Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lemmas of English personal pronouns #517

Open
nschneid opened this issue Dec 21, 2017 · 58 comments
Open

Lemmas of English personal pronouns #517

nschneid opened this issue Dec 21, 2017 · 58 comments

Comments

@nschneid
Copy link
Contributor

nschneid commented Dec 21, 2017

It is not obvious how pronouns should be lemmatized (cf. #276 for Slavic). The UD_English corpus does the following:

Nominative (PRP):

I -> I
you -> you
he -> he
she -> she
it -> it
we -> we
they -> they

Accusative (PRP):

me -> I
you -> you
him -> he
her -> she
it -> it
us -> we
them -> they

Dependent possessive (PRP$):

my -> my (!)
your -> you
his -> he
her -> she
its -> its (!)
our -> we
your -> you
their -> they

The pattern here is that they are normalized to nominative case, except for "my" and "its", which should probably be "I" and "it", respectively.

Independent possessive (PRP, no morphological features): mine, yours, ours, theirs, etc.: no normalization

Reflexive (PRP): myself, yourself, ourselves, yourselves, themselves, etc.: no normalization

WH animate: who, whom, whoever, whomever: no normalization

I am not sure why whom, whomever, the independent possessives, and the reflexives aren't normalized to nominative as well.

There is one token where ’s in Let’s has been lemmatized as us (it should presumably be we for consistency).

That said, the simplest policy may be to use the lemma field only for spelling normalization (#513) and not perform case normalization at all. If the end user wants to map pronouns to nominative case, that is not hard to implement as postprocessing once spelling is consistent.

Thoughts?

@rueter
Copy link
Contributor

rueter commented Dec 21, 2017 via email

@dan-zeman
Copy link
Member

Case normalization in lemmas is expected in languages where Case plays a more important role than in English and I would expect it in English as well.

@nschneid
Copy link
Contributor Author

nschneid commented Dec 24, 2017 via email

@amir-zeldes
Copy link
Contributor

I think different language-specific guidelines differ on this, and it would be good to stay consistent with other corpora in the respective languages, since what 'lemma' means in each language is rather different. We already have a split between UPOS and language-specific tags, I wouldn't want to see 'native vs. UD lemmas' as well if possible...

For GUM, we've simply used the behavior of the TreeTagger: PRP gets the nominative form (him -> he), PRP$ get their own form (my -> my, its -> its). The independent forms (mine etc.) technically have their own nominative form (mine is...) so they are lemmatized to themselves (mine -> mine). Basically this corresponds to only lemmatizing across case, and treating the possessive determiners as not a case form of the personal pronoun (which most of them are not, historically). I don't necessarily think this is ideal, but I think it doesn't matter much for personal pronouns, and inventing new standards for this sounds like it would ultimately create more work and complications than benefits...

@nschneid
Copy link
Contributor Author

For future reference, I'm finding many inconsistencies between columns in UD_English that point to tagging, morphology, or parse errors involving pronouns. Some commands:

fgrep $'PRP\t_' */*.conllu
egrep 'PRP\$.*i?obj' */*.conllu
egrep $'PRP\t.*nmod:poss' */*.conllu

egrep 'PRP\$.*nsubj' */*.conllu turns up several possessed gerunds (our agreeing to the deal, etc.). Not sure if this is the correct analysis. There aren't any instances of possessed gerunds with nmod:poss.

@nschneid
Copy link
Contributor Author

@sebschu do you have an opinion on pronoun lemmatization?

@dan-zeman dan-zeman added this to the v2.2 milestone Apr 24, 2018
@dan-zeman dan-zeman modified the milestones: v2.2, v2.4 Nov 13, 2018
@dan-zeman dan-zeman modified the milestones: v2.4, v2.5 Oct 6, 2019
@dan-zeman dan-zeman modified the milestones: v2.5, v2.6 Nov 9, 2019
@dan-zeman dan-zeman modified the milestones: v2.6, v2.7 May 14, 2020
@dan-zeman dan-zeman modified the milestones: v2.7, v2.8 Nov 14, 2020
@WaukyJose
Copy link

Interesting discussion of lemmatisation of pronominals. However, it seems like programming experts giving their opinions ignore the issues at automatic analysing particular parts of speech, as in the analysis of pronouns which demand a wider understanding of the functions underlying pronouns across sentences and paragraphs of a text. The deitic element, for example, is mostly absent in the programming of pronoun detection and analysis, as in automatically determining the average of pronoun lemmas which is of course not a bad idea. A big however here is that pronominals (a type of cohesion referential) signal back and forth referentials (e.g., anaphoric, cataphoric). Nevertheless, it seem as NLP tools have deliberately been minimising this important aspect in the analysis of pronouns. Ignoring functional linguistic elements keep new NLP programmers meeting and replicating the same big mistakes in the analysis of lemmatised pronouns.

@amir-zeldes
Copy link
Contributor

@WaukyJose this is the documentation for Universal Dependencies, a project creating resources with syntactic, rather than semantic analyses. However some datasets do actually contain annotations from other projects, including explicit analysis of anaphora, cataphora, and other forms of coreference. If you're looking for English data covering both UD syntax and coreference, you may want to look at this one:

https://github.com/UniversalDependencies/UD_English-GUM

You can find coreference indices and entity types in the last column, inside the annotation Entity (e.g. Entity=(person-4) on a pronoun's line means that that pronoun refers to a person, all of whose mentions are indexed as '4' inside that document).

@dan-zeman dan-zeman modified the milestones: v2.8, v2.9 Jun 17, 2021
@nschneid
Copy link
Contributor Author

nschneid commented Jan 18, 2022

This issue has reared its head again in UniversalDependencies/UD_English-EWT#293, with some arguing that a standard for pronoun lemmas across Germanic languages should be attempted.

After making corrections for consistency, here is the full set of pronouns in EWT—for the lemma, the ones it italics are normalized to the first item in the row:

Personal pronouns

  Nominative
Case=Nom
Accusative
Case=Acc
Dependent Genitive/Possessive
Poss=Yes
Independent Genitive/Possessive
Poss=Yes
Reflexive
Case=Acc,
Reflex=Yes
Variants
1.sg I me my mine myself  
1.pl we us our ours ourselves  
2.sg you you your yours yourself u, ya, ye, thou; yo, thy
2.pl you you your yours yourselves y'all
3.sg.m he him his his himself  
3.sg.f she her her hers herself  
3.sg.n it it its (its) itself  
3.pl they them their theirs themselves  

(Items in parentheses are unattested in EWT.)

☞ Clearly my and its are outliers, as noted at the top of the issue. The least disruptive change would be to replace my => I and its => it. But we should at least make sure that EWT and GUM agree; GUM does not presently lemmatize possessives.

☞ The features do not currently distinguish dependent and independent genitives/possessives. Would it make sense to use Case=Gen instead of Poss=Yes for one of them? Or add another feature?

Other pronouns

WH Plain -ever Possessive Variant
wh.anim who, whom whoever, whomever whose  
wh.inanim what whatever whose wtf
wh.det which (whichever)    

☞ If personal pronouns are normalized for case, it would make sense to normalize whom => who and whomever => whoever.

☞ If dependent possessive personal pronouns are normalized, it would make sense replace whose, although technically it is shared between who and what, so semantics would be required to resolve the correct lemma.

INDEFINITE one body thing
every everyone everybody everything
any anyone anybody anything
some someone somebody something
no no one nobody nothing

No one is currently analyzed as det(one/NOUN, no/DET). Perhaps one should be PRON.

DEMONSTRATIVE sg pl
prox this these
dist that those
EXPLETIVE
there
GENERIC
one
RECIPROCAL
each other, one another [not PRON: see UniversalDependencies/UD_English-EWT#123]

For the remaining groups only plural demonstratives these and those are normalized, which makes sense.

N.B. when, wherever, somewhere, etc. are tagged as ADV, not PRON.

@amir-zeldes
Copy link
Contributor

Thanks for writing this up so clearly! For convenience I will repeat what I said in the EWT issue - basically I think case forms like "them" should be lemmatized to the nominative "they", but possessive determiners form a separate paradigm because:

  • The determiners are not historically genitive forms of the pronouns (they correspond to Latin "meus, meo", not "ego, mihi")
  • The determiners have their own lemmas and full paradigms, incl. case in the other Germanic UD languages (German: mich -> ich = me -> I, and mein(er|e|es) -> mein); all things being equal I think English should do things the same as German, Dutch etc., unless there is a strong reason not to.
  • The independent forms can serve in any case form, indicating that they are not genitive forms either: "we both have cats; yours/NOM has met mine/ACC"
  • In colloquial speech under coordination, 's genitives are compatible with a coordinate true pronoun, e.g. "me and John's cat", whereas "my and John's cat" is disprefered (but should be fine IMO if "John's" and "my" were both truly genitives); admittedly the existence of both forms makes this particular argument weaker than the rest
  • One of the most popular English lemmatizers of the past two decades, TreeTagger, lemmatized "my" to "my", leading to this lemmatization behavior being present in a lot of corpora (e.g. all of the ones here), and the same seems to be true of the COCA family of corpora

I would like to see this behave as similarly as possible across German languages, though of course not at all costs :)

@dan-zeman dan-zeman modified the milestones: v2.14, v2.15 May 15, 2024
@nschneid
Copy link
Contributor Author

Somehow it seems we missed "none" (and, as noted in the PTB tag guidelines, "naught"). Will add these to the PRON table with PronType=Neg.

nschneid added a commit that referenced this issue May 30, 2024
@nschneid
Copy link
Contributor Author

nschneid commented May 30, 2024

@dan-zeman points out that PronType should apply to grammatical adverbs (pro-adverbs). We use it already for WH-adverbs and here and there. What else should be added? I am thinking of:

  • PronType=Neg: never, nowhere, neither
  • PronType=Tot: always, everywhere
  • PronType=Ind: sometime(s), someplace, somewhere, anytime, anyplace, anywhere, ever, either
  • PronType=Dem: now, then

@amir-zeldes thoughts on the above list? https://en.wikipedia.org/wiki/Pro-form is useful, though I'm not sure we want to start dealing with "however", "therefore", and so on.

@amir-zeldes
Copy link
Contributor

I think that mostly makes sense; for "either" and "neither", I'm not sure it should apply in determiner/preconj usage, since they're not exactly pronouns there. But this is all mainly useful if other languages implement this as well. For 'therefore' and 'however' in the discourse use I think they are probably no longer perceived as pronominal, even if they are etymologically.

@nschneid
Copy link
Contributor Author

for "either" and "neither", I'm not sure it should apply in determiner/preconj usage, since they're not exactly pronouns there

I would expect PronType to accord with the UPOS. At present preconj "either" and "neither" are tagged CCONJ, so let's not give them a PronType. DETs do receive PronTypes though, as documented previously: https://universaldependencies.org/en/pos/DET.html

TBC, I listed "(n)either" above for the ADV uses ("I don't want a sandwich, either").

(I keep having to remind myself that "PronType" is a misnomer, it actually covers all pro-forms.)

@amir-zeldes
Copy link
Contributor

Yeah, I think ProType would have been better! In any case, let me know what you want to do and I'll match it for GU corpora, this all sounds fine to me.

nschneid added a commit that referenced this issue Jun 18, 2024
@nschneid
Copy link
Contributor Author

OK how about these guidelines: https://universaldependencies.org/en/pos/ADV.html

nschneid added a commit to UniversalDependencies/UD_English-EWT that referenced this issue Jun 22, 2024
…cies/docs#517); involves some changes from interrogative to free relative structure (#278)
nschneid added a commit to UniversalDependencies/UD_English-EWT that referenced this issue Jun 22, 2024
…d always be adverbs (#132); also apply PronType=Ind to the retagged ones (UniversalDependencies/docs#517)
nschneid added a commit to UniversalDependencies/UD_English-EWT that referenced this issue Jun 22, 2024
@nschneid
Copy link
Contributor Author

OK how about these guidelines: https://universaldependencies.org/en/pos/ADV.html

Implemented in EWT! (modulo some existing PronType=Int annotations that should be PronType=Rel)

@AngledLuffa
Copy link

So we should update none in PUD to be PRON with PronType=Neg?

(among other changes)

@AngledLuffa
Copy link

anything to be done for however? that was left out of the EWT updates

anyway?

any_ADV, there_PRON left blank?

@nschneid
Copy link
Contributor Author

anything to be done for however? that was left out of the EWT updates

anyway?

These are both mainly discourse connectives, so I'm not sure they need a PronType.

any_ADV, there_PRON left blank?

there_PRON: for expletive "there" I'm not sure if any of the PronType values would be a good fit. This is documented at https://universaldependencies.org/en/pos/PRON.html#expletive-there

any_ADV: "any" is normally DET. I see "any/ADV longer/ADV" and similar; not sure this is actually correct. Also "it doesn't hurt any/ADV" (= at all). Could these be DET attaching as advmod? Feels related to "some/DET 540,000 men". Curious to hear @amir-zeldes's take when he's back from vacation.

@AngledLuffa
Copy link

however

mainly discourse connectives

Agreed that the discourse versions are fine w/o. They are not always discourse, though, especially however:

# sent_id = email-enronsent24_01-0036
# text = My goal, however optimistic, is to execute the risk policy by the end of today.
4       however however ADV     RB      _       5       advmod  5:advmod        _
5       optimistic      optimistic      ADJ     JJ      Degree=Pos      2       amod    2:amod  SpaceAfter=No

# sent_id = email-enronsent24_01-0093
# text = My goal, however optimistic, is to execute the risk policy by the end of today.

# sent_id = reviews-332105-0004
# text = I will reccommend his services however/whenever possible!
6       however however ADV     WRB     PronType=Int    3       advmod  3:advmod|9:advmod       SpaceAfter=No
7       /       /       SYM     SYM     _       8       cc      8:cc    SpaceAfter=No
8       whenever        whenever        ADV     WRB     PronType=Rel    6       conj    3:advmod|6:conj|9:advmod        _

(those are the only ones I saw for however)

@nschneid
Copy link
Contributor Author

Technically you're right, the "however optimistic" ones should be PronType=Int. I suppose these are just uses of "however" that modify a non-predicate ADJ or ADV.

  • EWT query
  • GUM query
    • plus an instance of "however our judgments might differ" that I believe is similar
  • PUD: (no results)

"however/whenever possible": as "however" is the first item in coordination I suppose it should be the head of the free relative

@AngledLuffa
Copy link

Technically you're right

(insert satisfied seal meme here)

@nschneid
Copy link
Contributor Author

Aha, apparently "however" receives a different xpos: RB for the discourse connective use and WRB for the interrogative or relative use! (This is documented in the PTB tagging guidelines.) So we can require PronType conditional on that.

nschneid added a commit to UniversalDependencies/UD_English-EWT that referenced this issue Jun 26, 2024
…iated spellings (UniversalDependencies/docs#517 - also fix neaten.py cause of false negative in #532); some typos (including "develope", #526)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants