Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features for DETs and "another" #416

Closed
AngledLuffa opened this issue Aug 20, 2023 · 44 comments
Closed

Features for DETs and "another" #416

AngledLuffa opened this issue Aug 20, 2023 · 44 comments

Comments

@AngledLuffa
Copy link
Contributor

In comparing EWT and GUM, there are two different standards for the word another. In GUM, it has the feature PronType=Art, whereas in EWT, it has no features. Personally I would think additional features are generally valuable, hence posting it as an issue in EWT.

@amir-zeldes

EWT example

# sent_id = weblog-juancole.com_juancole_20030911085700_ENG_20030911_085700-0026
# text = 3) Make Iraq another Afghanistan, using the Republican Right's own tactics against them.
1       3       3       NUM     LS      _       3       nummod  3:nummod        SpaceAfter=No
2       )       )       PUNCT   -RRB-   _       1       punct   1:punct _
3       Make    make    VERB    VB      VerbForm=Inf    0       root    0:root  _
4       Iraq    Iraq    PROPN   NNP     Number=Sing     3       obj     3:obj|6:nsubj:xsubj     _
5       another another DET     DT      _       6       det     6:det   _
6       Afghanistan     Afghanistan     PROPN   NNP     Number=Sing     3       xcomp   3:xcomp SpaceAfter=No

GUM example

# sent_id = GUM_bio_higuchi-21
# s_prominence = 3
# s_type = decl
# transition = continue
# text = Another theme Higuchi repeated was the ambition and cruelty of the Meiji middle class.
1       Another another DET     DT      PronType=Art    2       det     2:det   Bridge=66<73|Discourse=joint-list_m:68->61:2|Entity=(73-abstract-acc:inf-cf2-2-coref
@nschneid
Copy link
Contributor

This is an error in GUM, right? I've always understood the English articles to be restricted to "a(n)" and "the", and that's how it is in EWT and the PronType guidelines.

@AngledLuffa
Copy link
Contributor Author

Cunningham's law strikes again! That possibility was why I tagged Amir, at least

@amir-zeldes
Copy link
Contributor

Well, if the guidelines say so then we have to either change GUM or the guidelines... I'd prefer it to have a PronType because it's really just a fusion of the same "an" we tag as having that feature, and the adjective other.

Since we tag and deprel it DT/det, and not amod, I would expect it's supposed to match the behavior of the "an" component, but if others see it differently, I'm willing to copy the EWT behavior.

@nschneid
Copy link
Contributor

nschneid commented Aug 20, 2023

Historically it is "an"+"other", but "another" as a whole functions differently. (For example, it can take "yet" as an advmod, which articles cannot.)

While we're at it I see GUM has PronType=Art for "both", "no", "(n)either", and "yonder" (query). I would change those as well. The guidelines suggest PronType=Tot for "both" and PronType=Neg for "no".

@amir-zeldes
Copy link
Contributor

OK, so Neg for no, Tot for both, and nothing for the rest? Maybe also neg for neither and Dem for yonder?

@nschneid
Copy link
Contributor

Yeah, Dem for "yonder" in its det usage makes sense to me. (If we wanted to decouple the det function from UPOS, like we do for some other deprels, arguably "yonder" is an ADV and maybe we'd want to drop the PronType. But that would be a separate discussion; let's keep DET for now.)

In principle there could be values that cover {"either", "neither"} and "another". It doesn't seem we have those at present (but see UniversalDependencies/docs#732), so I'm fine with Neg for "neither" and blank for "either" and "another".

Tagging @dan-zeman in case he wants to weigh in.

@AngledLuffa
Copy link
Contributor Author

AngledLuffa commented Aug 20, 2023 via email

@nschneid
Copy link
Contributor

If you want to make that happen I think the way would be to open an issue on the docs repo, and include a table of all determiners with their proposed features (along the lines of https://universaldependencies.org/en/pos/PRON.html).

But that will take some discussion—in the meantime we can just use the features we have.

@dan-zeman
Copy link
Member

and blank for "either" and "another"

I would use PronType=Ind for these two. Indefinite is sometimes used as a 'catch-the-rest' category.

@AngledLuffa
Copy link
Contributor Author

I had posted an issue which could be used for building a standard

UniversalDependencies/docs#971

Any thoughts on things such as either or another, such as @dan-zeman 's suggestion of PronType=Ind? There are others which might fit that, such as any or every

@nschneid
Copy link
Contributor

Here's what we converged on in the other thread: https://universaldependencies.org/en/pos/DET.html

@AngledLuffa PRs to implement this welcome!

@amir-zeldes
Copy link
Contributor

amir-zeldes commented Sep 29, 2023 via email

@nschneid
Copy link
Contributor

@AngledLuffa any interest in implementing this? Would be great to have for the UD 2.13 release (deadline Nov. 1).

@AngledLuffa
Copy link
Contributor Author

You have no idea how much of a PITA it's been trying to get Ssurgeon to support empty nodes :/

but I'm almost to point where simple edits to node features are possible, I think

@nschneid nschneid changed the title Another POS feature discrepancy Features for DETs and "another" Oct 14, 2023
@AngledLuffa
Copy link
Contributor Author

CoreNLP didn't support empty nodes at all in the graph objects used for SemanticGraph

Stanza couldn't read or write those nodes either, it just always discarded them

Both of those are now fixed. CoreNLP still can't read or write empty nodes, but I'm just skipping that for now...
still need to make it so that Ssurgeon can understand two graphs at once

@nschneid
Copy link
Contributor

I realized I should add these checks to my validation script and went ahead and added the features with some regex replacements.

nschneid added a commit that referenced this issue Oct 17, 2023
@AngledLuffa
Copy link
Contributor Author

LGTM, thanks. @amir-zeldes something similar for GUM etc? I'll take a look at PUD and the Pronouns datasets

@amir-zeldes
Copy link
Contributor

Yes, it's on my list to implement the feature proposal from the table before the upcoming release, not done yet though.

@AngledLuffa
Copy link
Contributor Author

In PUD, there are a few lines of that which are not as the new table:

19      that    that    DET     WDT     PronType=Rel    22      obj     18:ref  _
25      that    that    DET     WDT     PronType=Rel    27      obj     24:ref  _
16      that    that    DET     WDT     PronType=Rel    20      obj     15:ref  _

A larger context looks like this:

16      the     the     DET     DT      Definite=Def|PronType=Art       18      det     18:det  _
17      last    last    ADJ     JJ      Degree=Pos      18      amod    18:amod _
18      thing   thing   NOUN    NN      Number=Sing     2       parataxis       2:parataxis|22:obl      _
19      that    that    DET     WDT     PronType=Rel    22      obj     18:ref  _
20      the     the     DET     DT      Definite=Def|PronType=Art       21      det     21:det  _
21      Government      government      NOUN    NN      Number=Sing     22      nsubj   22:nsubj        _
22      wants   want    VERB    VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   18      acl:relcl       18:acl:relcl    SpaceAfter=No

Is it still Number=Sing if it's in a WDT context instead of a DT context?

@AngledLuffa
Copy link
Contributor Author

Similarly, should half a million get the updated half features?

-11     half    half    DET     PDT     _       13      compound        13:compound     _
+11     half    half    DET     PDT     NumForm=Word|NumType=Frac|PronType=Ind  13      compound        13:compound     _

@nschneid
Copy link
Contributor

In PUD, there are a few lines of that which are not as the new table:

If that is relative it should be PRON not DET.

Similarly, should half a million get the updated half features?

Yes, that's half as PDT/DET.

@AngledLuffa
Copy link
Contributor Author

If that is relative it should be PRON not DET.

So these that should be PRON and not DET?

16      the     the     DET     DT      Definite=Def|PronType=Art       18      det     18:det  _
17      last    last    ADJ     JJ      Degree=Pos      18      amod    18:amod _
18      thing   thing   NOUN    NN      Number=Sing     2       parataxis       2:parataxis|22:obl      _
19      that    that    DET     WDT     PronType=Rel    22      obj     18:ref  _
20      the     the     DET     DT      Definite=Def|PronType=Art       21      det     21:det  _
21      Government      government      NOUN    NN      Number=Sing     22      nsubj   22:nsubj        _
22      wants   want    VERB    VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   18      acl:relcl       18:acl:relcl    SpaceAfter=No
23      a       a       DET     DT      Definite=Ind|PronType=Art       24      det     24:det  _
24      producer        producer        NOUN    NN      Number=Sing     20      appos   20:appos|27:obl _
25      that    that    DET     WDT     PronType=Rel    27      obj     24:ref  _
26      she     she     PRON    PRP     Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs   27      nsubj   27:nsubj        _
27      admired admire  VERB    VBD     Mood=Ind|Tense=Past|VerbForm=Fin        24      acl:relcl       24:acl:relcl    SpaceAfter=No
13      of      of      ADP     IN      _       15      case    15:case _
14      total   total   ADJ     JJ      Degree=Pos      15      amod    15:amod _
15      closure closure NOUN    NN      Number=Sing     12      nmod    12:nmod:of|20:obl       _
16      that    that    DET     WDT     PronType=Rel    20      obj     15:ref  _
17      the     the     DET     DT      Definite=Def|PronType=Art       18      det     18:det  _
18      Bank    bank    NOUN    NN      Number=Sing     20      nsubj   20:nsubj        _
19      has     have    AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   20      aux     20:aux  _
20      shown   show    VERB    VBN     Tense=Past|VerbForm=Part        15      acl:relcl       15:acl:relcl    _
21      to      to      ADP     IN      _       22      case    22:case _
22      us      we      PRON    PRP     Case=Acc|Number=Plur|Person=1|PronType=Prs      20      obl     20:obl:to       SpaceAfter=No

@nschneid
Copy link
Contributor

Yes

@AngledLuffa
Copy link
Contributor Author

UniversalDependencies/UD_English-PUD#20

should the dependencies be nsubj or are they fine as obj?

@nschneid
Copy link
Contributor

nschneid commented Oct 18, 2023

obj is correct: "a producer that she admired" is a way of conveying "she admired the producer", only with "that" standing in for the producer and moved before "she".

@AngledLuffa
Copy link
Contributor Author

Great, thanks. Based on that, I merged the PR as is

@AngledLuffa
Copy link
Contributor Author

The Pronouns dataset doesn't have many errors:

UniversalDependencies/UD_English-Pronouns#8

@AngledLuffa
Copy link
Contributor Author

What about all labeled as a PDT? Still the same features?

11      people  people  NOUN    NNS     Number=Plur     14      nsubj   14:nsubj        _
12      without without ADP     IN      _       13      case    13:case _
13      children        child   NOUN    NNS     Number=Plur     11      nmod    11:nmod:without _
14      express express VERB    VBP     Mood=Ind|Tense=Pres|VerbForm=Fin        4       conj    4:conj:and      _
15      through through ADP     IN      _       17      case    17:case _
16      their   they    PRON    PRP$    Number=Plur|Person=3|Poss=Yes|PronType=Prs      17      nmod:poss       17:nmod:poss    _
17      disapproval     disapproval     NOUN    NN      Number=Sing     14      obl     14:obl:through  _
18      all     all     DET     PDT     _       20      det:predet      20:det:predet   _
19      their   they    PRON    PRP$    Number=Plur|Person=3|Poss=Yes|PronType=Prs      20      nmod:poss       20:nmod:poss    _
20      hatred  hatred  NOUN    NN      Number=Sing     14      obj     14:obj  _
21      of      of      ADP     IN      _       23      case    23:case _
22      modern  modern  ADJ     JJ      Degree=Pos      23      amod    23:amod _
23      parenting       parenting       NOUN    NN      Number=Sing     20      nmod    20:nmod:of      SpaceAfter=No

@nschneid
Copy link
Contributor

Yeah, PronType=Tot

@AngledLuffa
Copy link
Contributor Author

Pronouns change looks good then?

@nschneid
Copy link
Contributor

Here's an implementation:

# articles
("a", "DT"):{"Definite":"Ind","PronType":"Art","LEMMA":"a"},
("an", "DT"):{"Definite":"Ind","PronType":"Art","LEMMA":"a"},
("the", "DT"):{"Definite":"Def","PronType":"Art","LEMMA":"the"},
# demonstratives. Note: tagged PRON if not acting as det, but script will check either way
("this", "DT"):{"Number":"Sing","PronType":"Dem","LEMMA":"this"},
("that", "DT"):{"Number":"Sing","PronType":"Dem","LEMMA":"that"},
("these", "DT"):{"Number":"Plur","PronType":"Dem","LEMMA":"this"},
("those", "DT"):{"Number":"Plur","PronType":"Dem","LEMMA":"that"},
("yonder", "DT"):{"PronType":"Dem","LEMMA":"yonder"},
# total
("all", "DT"):{"PronType":"Tot","LEMMA":"all"},
("all", "PDT"):{"PronType":"Tot","LEMMA":"all"},
("both", "DT"):{"PronType":"Tot","LEMMA":"both"},
("both", "PDT"):{"PronType":"Tot","LEMMA":"both"},
("each", "DT"):{"PronType":["Tot","Rcp"],"LEMMA":"each"},
("every", "DT"):{"PronType":"Tot","LEMMA":"every"},
# indefinite
("half", "PDT"):{"NumForm":"Word","NumType":"Frac","PronType":"Ind","LEMMA":"half"},
("no", "DT"):{"PronType":"Neg","LEMMA":"no"},
("neither", "DT"):{"PronType":"Neg","LEMMA":"neither"},
("nary", "PDT"):{"PronType":"Neg","LEMMA":"nary"},
("any", "DT"):{"PronType":"Ind","LEMMA":"any"},
("some", "DT"):{"PronType":"Ind","LEMMA":"some"},
("another", "DT"):{"PronType":"Ind","LEMMA":"another"},
("either", "DT"):{"PronType":"Ind","LEMMA":"either"},
("such", "PDT"):{"PronType":"Ind","LEMMA":"such"},
("quite", "PDT"):{"PronType":"Ind","LEMMA":"quite"},
("many", "PDT"):{"PronType":"Ind","LEMMA":"many"},
# WH (interrogative or relative)
("that", "WDT"):{"PronType":"Rel","LEMMA":"that"}, # actually PRON
("which", "WDT"):{"PronType":["Int","Rel"],"LEMMA":"which"}, # DET or PRON
("what", "WDT"):{"PronType":["Int","Rel"],"LEMMA":"what"},
("whatever", "WDT"):{"PronType":["Int","Rel"],"LEMMA":"whatever"}

@AngledLuffa
Copy link
Contributor Author

AngledLuffa commented Oct 18, 2023

What about all in ADV sentences instead? Any features there? I don't see any on all_ADV in EWT

1       We      we      PRON    PRP     Case=Nom|Number=Plur|Person=1|PronType=Prs      4       nsubj   4:nsubj _
2       're     be      AUX     VBP     Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin   4       cop     4:cop   _
3       all     all     ADV     RB      _       4       advmod  4:advmod        _
4       set     set     ADJ     JJ      Degree=Pos      0       root    0:root  _

@nschneid
Copy link
Contributor

No, if an ADV has features it would just be comparative or superlative I think

@AngledLuffa
Copy link
Contributor Author

that's fair, but i'll just leave it for now

@AngledLuffa
Copy link
Contributor Author

Here's an update for PUD:

UniversalDependencies/UD_English-PUD#21

@nschneid
Copy link
Contributor

@amir-zeldes implemented in GUM yet?

@amir-zeldes
Copy link
Contributor

I think so - I implemented the table. "Another" now has just PronType=Ind, that's what we want, right?

@nschneid
Copy link
Contributor

Yes, the table at https://universaldependencies.org/en/pos/DET.html.

@AngledLuffa are we done with this issue?

@amir-zeldes
Copy link
Contributor

Great, feel free to spot check my work, it's all in the dev branch.

@AngledLuffa
Copy link
Contributor Author

I think we're done - although it occurs to me no one updated LinES. Perhaps I can do that with my script

@AngledLuffa
Copy link
Contributor Author

One thing I found when trying to script the changes to LinES is that they labeled non-English determiners as DET when part of a proper noun. Le Monde comes up pretty often. Should I treat that as The or would a different UPOS be more appropriate? Le petit (no capital, perhaps that is a typo) is the only example I found in EWT of Le, with a tag of PROPN, and there are none in GUM. It should be pointed out that The is never a PROPN in EWT. Perhaps Le_DET is better?

@nschneid
Copy link
Contributor

Different treebanks have different policies re: analyzing foreign expressions. Some try to analyze the syntax of the foreign phrase, so DET and det. Another option is to treat all the words in the name as PROPN. Another option is X.

@dan-zeman
Copy link
Member

One thing I found when trying to script the changes to LinES is that they labeled non-English determiners as DET when part of a proper noun.

It depends on whether they decided to annotate foreign phrases following the foreign guidelines, which is legitimate in UD, but optional. But even then foreign multiword names would be gray zone because they can be considered as English phrases but names.

@AngledLuffa
Copy link
Contributor Author

I updated the each UPOS tags and then made a PR in LinES which updates the features on DET. I suppose I'll merge it later today if I don't hear otherwise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants