Lemmas of English personal pronouns #517

nschneid · 2017-12-21T00:42:35Z

It is not obvious how pronouns should be lemmatized (cf. #276 for Slavic). The UD_English corpus does the following:

Nominative (PRP):

I -> I
you -> you
he -> he
she -> she
it -> it
we -> we
they -> they

Accusative (PRP):

me -> I
you -> you
him -> he
her -> she
it -> it
us -> we
them -> they

Dependent possessive (PRP$):

my -> my (!)
your -> you
his -> he
her -> she
its -> its (!)
our -> we
your -> you
their -> they

The pattern here is that they are normalized to nominative case, except for "my" and "its", which should probably be "I" and "it", respectively.

Independent possessive (PRP, no morphological features): mine, yours, ours, theirs, etc.: no normalization

Reflexive (PRP): myself, yourself, ourselves, yourselves, themselves, etc.: no normalization

WH animate: who, whom, whoever, whomever: no normalization

I am not sure why whom, whomever, the independent possessives, and the reflexives aren't normalized to nominative as well.

There is one token where ’s in Let’s has been lemmatized as us (it should presumably be we for consistency).

That said, the simplest policy may be to use the lemma field only for spelling normalization (#513) and not perform case normalization at all. If the end user wants to map pronouns to nominative case, that is not hard to implement as postprocessing once spelling is consistent.

Thoughts?

The text was updated successfully, but these errors were encountered:

rueter · 2017-12-21T10:20:28Z

For morphologically rich/normal languages, the lemma serves also as a point of disambiguation in company with its pos sibling. Since spelling normalization is being discussed, it might serve our purpose to provide a spelling[norm]=xxx in misc to cover the for the misspellings.

…

Sent from my iPhone

On 21 Dec 2017, at 2.42, Nathan Schneider ***@***.***> wrote: It is not obvious how pronouns should be lemmatized (cf. #276 for Slavic). The UD_English corpus does the following: Nominative (PRP): I -> I you -> you he -> he she -> she it -> it we -> we they -> they Accusative (PRP): me -> I you -> you him -> he her -> she it -> it us -> we them -> they Dependent possessive (PRP$): my -> my (!) your -> you his -> he her -> she its -> its (!) our -> we your -> you their -> they The pattern here is that they are normalized to nominative case, except for "my" and "its", which should probably be "I" and "it", respectively. Independent possessive (PRP, no morphological features): no normalization Reflexive (PRP): myself, yourself, ourselves, yourselves, themselves, etc.: no normalization WH animate: who, whom, whoever, whomever: no normalization I am not sure why whom, whomever, the independent possessives, and the reflexives aren't normalized to nominative as well. There is one token where ’s in Let’s has been lemmatized as us (it should presumably be we for consistency). That said, the simplest policy may be to use the lemma field only for spelling normalization (#513) and not perform case normalization at all. If the end user wants to map pronouns to nominative case, that is not hard to implement as postprocessing once spelling is consistent. Thoughts? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

dan-zeman · 2017-12-23T19:32:48Z

Case normalization in lemmas is expected in languages where Case plays a more important role than in English and I would expect it in English as well.

nschneid · 2017-12-24T02:47:31Z

I guess I am not sure what the guiding principles are/should be for pronoun normalization. It is clear that English nouns should be normalized by number and verbs by number, person, and tense. So why are the pronouns normalized by case but not person or number? If the goal is to remove all inflectional information, shouldn't all personal pronouns map to the same lemma? Or is the goal to collapse dimensions of a paradigm which tend to have common stems? By the common stem criterion it would make sense to give possessives and accusatives the same lemma, and perhaps "he"/"him"/"his", but it does not feel intuitive to give "I", "we", "me", and "our" the same lemma. From a more semantic/practical perspective, I could see an argument that number and person are relevant to reference resolution whereas case is primarily grammatical and is encoded in the syntactic relations. Finally, one could argue that it's best to avoid worrying about all of these competing criteria for closed-class POS categories and just keep the (spelling-normalized) word as the lemma, because the benefits of lemmatization in dealing with the long tail are not relevant as they are for open classes. English doesn't have that many distinct pronouns to begin with, and their commonalities are exposed in morphological features, so what does lemmatization buy us?

…

On Dec 23, 2017 9:32 PM, "Dan Zeman" ***@***.***> wrote: Case normalization in lemmas is expected in languages where Case plays a more important role than in English and I would expect it in English as well. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#517 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA8Irx-Zx_7mE-Nt-wInmZd2pvxW6Q9jks5tDVVigaJpZM4RJJjW> .

amir-zeldes · 2017-12-25T15:20:26Z

I think different language-specific guidelines differ on this, and it would be good to stay consistent with other corpora in the respective languages, since what 'lemma' means in each language is rather different. We already have a split between UPOS and language-specific tags, I wouldn't want to see 'native vs. UD lemmas' as well if possible...

For GUM, we've simply used the behavior of the TreeTagger: PRP gets the nominative form (him -> he), PRP$ get their own form (my -> my, its -> its). The independent forms (mine etc.) technically have their own nominative form (mine is...) so they are lemmatized to themselves (mine -> mine). Basically this corresponds to only lemmatizing across case, and treating the possessive determiners as not a case form of the personal pronoun (which most of them are not, historically). I don't necessarily think this is ideal, but I think it doesn't matter much for personal pronouns, and inventing new standards for this sounds like it would ultimately create more work and complications than benefits...

nschneid · 2017-12-31T23:47:58Z

For future reference, I'm finding many inconsistencies between columns in UD_English that point to tagging, morphology, or parse errors involving pronouns. Some commands:

fgrep $'PRP\t_' */*.conllu
egrep 'PRP\$.*i?obj' */*.conllu
egrep $'PRP\t.*nmod:poss' */*.conllu

egrep 'PRP\$.*nsubj' */*.conllu turns up several possessed gerunds (our agreeing to the deal, etc.). Not sure if this is the correct analysis. There aren't any instances of possessed gerunds with nmod:poss.

nschneid · 2018-04-22T19:25:13Z

@sebschu do you have an opinion on pronoun lemmatization?

WaukyJose · 2020-12-23T14:31:06Z

Interesting discussion of lemmatisation of pronominals. However, it seems like programming experts giving their opinions ignore the issues at automatic analysing particular parts of speech, as in the analysis of pronouns which demand a wider understanding of the functions underlying pronouns across sentences and paragraphs of a text. The deitic element, for example, is mostly absent in the programming of pronoun detection and analysis, as in automatically determining the average of pronoun lemmas which is of course not a bad idea. A big however here is that pronominals (a type of cohesion referential) signal back and forth referentials (e.g., anaphoric, cataphoric). Nevertheless, it seem as NLP tools have deliberately been minimising this important aspect in the analysis of pronouns. Ignoring functional linguistic elements keep new NLP programmers meeting and replicating the same big mistakes in the analysis of lemmatised pronouns.

amir-zeldes · 2020-12-23T23:05:03Z

@WaukyJose this is the documentation for Universal Dependencies, a project creating resources with syntactic, rather than semantic analyses. However some datasets do actually contain annotations from other projects, including explicit analysis of anaphora, cataphora, and other forms of coreference. If you're looking for English data covering both UD syntax and coreference, you may want to look at this one:

https://github.com/UniversalDependencies/UD_English-GUM

You can find coreference indices and entity types in the last column, inside the annotation Entity (e.g. Entity=(person-4) on a pronoun's line means that that pronoun refers to a person, all of whose mentions are indexed as '4' inside that document).

nschneid · 2022-01-18T03:01:36Z

This issue has reared its head again in UniversalDependencies/UD_English-EWT#293, with some arguing that a standard for pronoun lemmas across Germanic languages should be attempted.

After making corrections for consistency, here is the full set of pronouns in EWT—for the lemma, the ones it italics are normalized to the first item in the row:

Personal pronouns

	Nominative `Case=Nom`	Accusative `Case=Acc`	Dependent Genitive/Possessive `Poss=Yes`	Independent Genitive/Possessive `Poss=Yes`	Reflexive `Case=Acc`, `Reflex=Yes`	Variants
1.sg	I	me	my	mine	myself
1.pl	we	us	our	ours	ourselves
2.sg	you	you	your	yours	yourself	u, ya, ye, thou; yo, thy
2.pl	you	you	your	yours	yourselves	y'all
3.sg.m	he	him	his	his	himself
3.sg.f	she	her	her	hers	herself
3.sg.n	it	it	its	(its)	itself
3.pl	they	them	their	theirs	themselves

(Items in parentheses are unattested in EWT.)

☞ Clearly my and its are outliers, as noted at the top of the issue. The least disruptive change would be to replace my => I and its => it. But we should at least make sure that EWT and GUM agree; GUM does not presently lemmatize possessives.

☞ The features do not currently distinguish dependent and independent genitives/possessives. Would it make sense to use Case=Gen instead of Poss=Yes for one of them? Or add another feature?

Other pronouns

WH	Plain	-ever	Possessive	Variant
wh.anim	who, whom	whoever, whomever	whose
wh.inanim	what	whatever	whose	wtf
wh.det	which	(whichever)

☞ If personal pronouns are normalized for case, it would make sense to normalize whom => who and whomever => whoever.

☞ If dependent possessive personal pronouns are normalized, it would make sense replace whose, although technically it is shared between who and what, so semantics would be required to resolve the correct lemma.

INDEFINITE	one	body	thing
every	everyone	everybody	everything
any	anyone	anybody	anything
some	someone	somebody	something
no	no one	nobody	nothing

☞ No one is currently analyzed as det(one/NOUN, no/DET). Perhaps one should be PRON.

DEMONSTRATIVE	sg	pl
prox	this	these
dist	that	those

EXPLETIVE
there

GENERIC
one

RECIPROCAL
each other, one another [not PRON: see UniversalDependencies/UD_English-EWT#123]

For the remaining groups only plural demonstratives these and those are normalized, which makes sense.

N.B. when, wherever, somewhere, etc. are tagged as ADV, not PRON.

amir-zeldes · 2022-01-18T17:05:36Z

Thanks for writing this up so clearly! For convenience I will repeat what I said in the EWT issue - basically I think case forms like "them" should be lemmatized to the nominative "they", but possessive determiners form a separate paradigm because:

The determiners are not historically genitive forms of the pronouns (they correspond to Latin "meus, meo", not "ego, mihi")
The determiners have their own lemmas and full paradigms, incl. case in the other Germanic UD languages (German: mich -> ich = me -> I, and mein(er|e|es) -> mein); all things being equal I think English should do things the same as German, Dutch etc., unless there is a strong reason not to.
The independent forms can serve in any case form, indicating that they are not genitive forms either: "we both have cats; yours/NOM has met mine/ACC"
In colloquial speech under coordination, 's genitives are compatible with a coordinate true pronoun, e.g. "me and John's cat", whereas "my and John's cat" is disprefered (but should be fine IMO if "John's" and "my" were both truly genitives); admittedly the existence of both forms makes this particular argument weaker than the rest
One of the most popular English lemmatizers of the past two decades, TreeTagger, lemmatized "my" to "my", leading to this lemmatization behavior being present in a lot of corpora (e.g. all of the ones here), and the same seems to be true of the COCA family of corpora

I would like to see this behave as similarly as possible across German languages, though of course not at all costs :)

nschneid · 2024-05-30T23:32:47Z

Somehow it seems we missed "none" (and, as noted in the PTB tag guidelines, "naught"). Will add these to the PRON table with PronType=Neg.

nschneid · 2024-05-30T23:46:26Z

@dan-zeman points out that PronType should apply to grammatical adverbs (pro-adverbs). We use it already for WH-adverbs and here and there. What else should be added? I am thinking of:

PronType=Neg: never, nowhere, neither
PronType=Tot: always, everywhere
PronType=Ind: sometime(s), someplace, somewhere, anytime, anyplace, anywhere, ever, either
PronType=Dem: now, then

@amir-zeldes thoughts on the above list? https://en.wikipedia.org/wiki/Pro-form is useful, though I'm not sure we want to start dealing with "however", "therefore", and so on.

amir-zeldes · 2024-06-13T17:05:58Z

I think that mostly makes sense; for "either" and "neither", I'm not sure it should apply in determiner/preconj usage, since they're not exactly pronouns there. But this is all mainly useful if other languages implement this as well. For 'therefore' and 'however' in the discourse use I think they are probably no longer perceived as pronominal, even if they are etymologically.

nschneid · 2024-06-18T15:19:47Z

for "either" and "neither", I'm not sure it should apply in determiner/preconj usage, since they're not exactly pronouns there

I would expect PronType to accord with the UPOS. At present preconj "either" and "neither" are tagged CCONJ, so let's not give them a PronType. DETs do receive PronTypes though, as documented previously: https://universaldependencies.org/en/pos/DET.html

TBC, I listed "(n)either" above for the ADV uses ("I don't want a sandwich, either").

(I keep having to remind myself that "PronType" is a misnomer, it actually covers all pro-forms.)

amir-zeldes · 2024-06-18T15:48:44Z

Yeah, I think ProType would have been better! In any case, let me know what you want to do and I'll match it for GU corpora, this all sounds fine to me.

nschneid · 2024-06-18T21:38:51Z

OK how about these guidelines: https://universaldependencies.org/en/pos/ADV.html

…cies/docs#517); involves some changes from interrogative to free relative structure (#278)

…d always be adverbs (#132); also apply PronType=Ind to the retagged ones (UniversalDependencies/docs#517)

…etc. (UniversalDependencies/docs#517)

nschneid · 2024-06-22T18:31:45Z

OK how about these guidelines: https://universaldependencies.org/en/pos/ADV.html

Implemented in EWT! (modulo some existing PronType=Int annotations that should be PronType=Rel)

AngledLuffa · 2024-06-25T05:21:45Z

So we should update none in PUD to be PRON with PronType=Neg?

(among other changes)

AngledLuffa · 2024-06-25T07:05:48Z

anything to be done for however? that was left out of the EWT updates

anyway?

any_ADV, there_PRON left blank?

nschneid · 2024-06-25T23:36:33Z

anything to be done for however? that was left out of the EWT updates

anyway?

These are both mainly discourse connectives, so I'm not sure they need a PronType.

any_ADV, there_PRON left blank?

there_PRON: for expletive "there" I'm not sure if any of the PronType values would be a good fit. This is documented at https://universaldependencies.org/en/pos/PRON.html#expletive-there

any_ADV: "any" is normally DET. I see "any/ADV longer/ADV" and similar; not sure this is actually correct. Also "it doesn't hurt any/ADV" (= at all). Could these be DET attaching as advmod? Feels related to "some/DET 540,000 men". Curious to hear @amir-zeldes's take when he's back from vacation.

AngledLuffa · 2024-06-25T23:54:43Z

however

mainly discourse connectives

Agreed that the discourse versions are fine w/o. They are not always discourse, though, especially however:

# sent_id = email-enronsent24_01-0036
# text = My goal, however optimistic, is to execute the risk policy by the end of today.
4       however however ADV     RB      _       5       advmod  5:advmod        _
5       optimistic      optimistic      ADJ     JJ      Degree=Pos      2       amod    2:amod  SpaceAfter=No

# sent_id = email-enronsent24_01-0093
# text = My goal, however optimistic, is to execute the risk policy by the end of today.

# sent_id = reviews-332105-0004
# text = I will reccommend his services however/whenever possible!
6       however however ADV     WRB     PronType=Int    3       advmod  3:advmod|9:advmod       SpaceAfter=No
7       /       /       SYM     SYM     _       8       cc      8:cc    SpaceAfter=No
8       whenever        whenever        ADV     WRB     PronType=Rel    6       conj    3:advmod|6:conj|9:advmod        _

(those are the only ones I saw for however)

nschneid · 2024-06-26T02:10:37Z

Technically you're right, the "however optimistic" ones should be PronType=Int. I suppose these are just uses of "however" that modify a non-predicate ADJ or ADV.

EWT query
GUM query
- plus an instance of "however our judgments might differ" that I believe is similar
PUD: (no results)

"however/whenever possible": as "however" is the first item in coordination I suppose it should be the head of the free relative

AngledLuffa · 2024-06-26T02:12:45Z

Technically you're right

(insert satisfied seal meme here)

nschneid · 2024-06-26T02:28:55Z

Aha, apparently "however" receives a different xpos: RB for the discourse connective use and WRB for the interrogative or relative use! (This is documented in the PTB tagging guidelines.) So we can require PronType conditional on that.

…iated spellings (UniversalDependencies/docs#517 - also fix neaten.py cause of false negative in #532); some typos (including "develope", #526)

nschneid added English lemmatization labels Dec 21, 2017

dan-zeman added this to the v2.2 milestone Apr 24, 2018

nschneid mentioned this issue Jul 3, 2018

Resolving adposition spelling variants nert-nlp/Xposition#51

Closed

dan-zeman modified the milestones: v2.2, v2.4 Nov 13, 2018

dan-zeman modified the milestones: v2.4, v2.5 Oct 6, 2019

amir-zeldes mentioned this issue Oct 24, 2019

Lemma me vs I UniversalDependencies/UD_English-Pronouns#2

Open

dan-zeman modified the milestones: v2.5, v2.6 Nov 9, 2019

dan-zeman modified the milestones: v2.6, v2.7 May 14, 2020

dan-zeman modified the milestones: v2.7, v2.8 Nov 14, 2020

dan-zeman modified the milestones: v2.8, v2.9 Jun 17, 2021

dan-zeman modified the milestones: v2.9, v2.10 Jan 19, 2022

nschneid mentioned this issue Jan 28, 2022

Possessive pronouns: consistency across annotations UniversalDependencies/UD_English-EWT#293

Closed

3 tasks

dan-zeman modified the milestones: v2.14, v2.15 May 15, 2024

nschneid added a commit that referenced this issue May 30, 2024

en/PRON: "none", "naught" (#517)

7802ffd

nschneid mentioned this issue Jun 4, 2024

Standardizing ExtPos (at least for fixed expressions) #1037

Open

nschneid mentioned this issue Jun 13, 2024

implement Polarity=Neg UniversalDependencies/UD_English-EWT#526

Closed

nschneid added a commit that referenced this issue Jun 18, 2024

en/ADV: PronType details (#517)

ce8697e

nschneid mentioned this issue Jun 22, 2024

Missing PronType? UniversalDependencies/UD_English-EWT#230

Open

nschneid added a commit that referenced this issue Jun 22, 2024

en/ADV: PronType details: add "wherever" examples (#517)

422f7ce

nschneid added a commit to UniversalDependencies/UD_English-EWT that referenced this issue Jun 22, 2024

PronType for ADVs "now", "never", "whenever", etc. (UniversalDependen…

dc5c9c9

…cies/docs#517); involves some changes from interrogative to free relative structure (#278)

nschneid mentioned this issue Jun 22, 2024

UPOS inconsistency: indefinite pronouns/pro-adverbs UniversalDependencies/UD_English-EWT#132

Closed

nschneid added a commit to UniversalDependencies/UD_English-EWT that referenced this issue Jun 22, 2024

indefinite space and time pro-forms "somewhere", "anytime" etc. shoul…

3414421

…d always be adverbs (#132); also apply PronType=Ind to the retagged ones (UniversalDependencies/docs#517)

nschneid added a commit to UniversalDependencies/UD_English-EWT that referenced this issue Jun 22, 2024

PronType for remaining pro-adverbs: "always", "somewhere", "either", …

bd04579

…etc. (UniversalDependencies/docs#517)

AngledLuffa mentioned this issue Jun 25, 2024

Prontype UniversalDependencies/UD_English-PUD#50

Merged

nschneid added a commit that referenced this issue Jun 26, 2024

en/ADV: "however" receives PronType if tagged WRB (#517)

e9be3f1

nschneid mentioned this issue Aug 19, 2024

Pronoun/determinative/possessive lemmas nert-nlp/cgel#128

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lemmas of English personal pronouns #517

Lemmas of English personal pronouns #517

nschneid commented Dec 21, 2017 •

edited

Loading

rueter commented Dec 21, 2017 via email

dan-zeman commented Dec 23, 2017

nschneid commented Dec 24, 2017 via email

amir-zeldes commented Dec 25, 2017

nschneid commented Dec 31, 2017

nschneid commented Apr 22, 2018

WaukyJose commented Dec 23, 2020

amir-zeldes commented Dec 23, 2020

nschneid commented Jan 18, 2022 •

edited

Loading

amir-zeldes commented Jan 18, 2022

nschneid commented May 30, 2024

nschneid commented May 30, 2024 •

edited

Loading

amir-zeldes commented Jun 13, 2024

nschneid commented Jun 18, 2024

amir-zeldes commented Jun 18, 2024

nschneid commented Jun 18, 2024

nschneid commented Jun 22, 2024

AngledLuffa commented Jun 25, 2024

AngledLuffa commented Jun 25, 2024

nschneid commented Jun 25, 2024

AngledLuffa commented Jun 25, 2024

nschneid commented Jun 26, 2024

AngledLuffa commented Jun 26, 2024

nschneid commented Jun 26, 2024

Lemmas of English personal pronouns #517

Lemmas of English personal pronouns #517

Comments

nschneid commented Dec 21, 2017 • edited Loading

rueter commented Dec 21, 2017 via email

dan-zeman commented Dec 23, 2017

nschneid commented Dec 24, 2017 via email

amir-zeldes commented Dec 25, 2017

nschneid commented Dec 31, 2017

nschneid commented Apr 22, 2018

WaukyJose commented Dec 23, 2020

amir-zeldes commented Dec 23, 2020

nschneid commented Jan 18, 2022 • edited Loading

Personal pronouns

Other pronouns

amir-zeldes commented Jan 18, 2022

nschneid commented May 30, 2024

nschneid commented May 30, 2024 • edited Loading

amir-zeldes commented Jun 13, 2024

nschneid commented Jun 18, 2024

amir-zeldes commented Jun 18, 2024

nschneid commented Jun 18, 2024

nschneid commented Jun 22, 2024

AngledLuffa commented Jun 25, 2024

AngledLuffa commented Jun 25, 2024

nschneid commented Jun 25, 2024

AngledLuffa commented Jun 25, 2024

nschneid commented Jun 26, 2024

AngledLuffa commented Jun 26, 2024

nschneid commented Jun 26, 2024

nschneid commented Dec 21, 2017 •

edited

Loading

nschneid commented Jan 18, 2022 •

edited

Loading

nschneid commented May 30, 2024 •

edited

Loading