Possessive pronouns: consistency across annotations #293

nschneid · 2022-01-15T17:54:42Z

A variety of issues including typos like "it's" for "its", incorrect XPOS/feats for possessive "her", miscellaneous pronouns like "thy" lacking proper feats (cf. #230), etc.

tagged PRP$ but not syntactically possessive (most are incorrectly tagged "her"; some are valid)
syntactically but not morphologically possessive
"it's" tagged as PRP$ but missing typo annotation

The text was updated successfully, but these errors were encountered:

nschneid · 2022-01-15T19:54:02Z

For possessive pronoun "yo" as in "yo mama", I'm having trouble deciding on Style between Colloquial, Vernacular, and Slang.

nschneid · 2022-01-15T20:45:14Z

OK I've decided that in the context of African-American language, "yo" would be Vernacular, but it's been borrowed into wider use as Slang in the expression "yo mama".

nschneid · 2022-01-16T00:39:21Z

Remaining issues:

Independent genitive personal pronouns retain case in the lemma. Should they? E.g. ours (PTB: PRP) does not lemmatize as we, but our does.
Do features provide a way to distinguish these independent forms from the attributive forms? Both are Poss=Yes and PronType=Prs.
WH-pronouns whom (WP) and whose (WP$): should the lemma be case-normalized to who? I don't see why not. And if so I suppose the lemma of whomever (WP) should be whoever.
WH-pronouns need to be manually reviewed for PronType=Rel vs. PronType=Int.
- e.g., to whom will the letter be sent? should be Int, the individual or entity to whom they are addressed should be Rel
- I checked all the "whose" cases: all EWT tokens should be Rel (no examples like "Whose book is this?"). GUM has an Int example that should be Rel.

@dan-zeman Guessing you'll have opinions here

nschneid · 2022-01-16T01:00:08Z

OK the lemma situation is weirder than I thought: the non-WH attributive genitives in EWT are all case-normalized except "my" for some reason. In GUM none of them are, but the accusative forms are.

I think the simplest policy would be to case-normalize all pronouns to the nominative, including both attributive and independent genitives. Any objections?

amir-zeldes · 2022-01-16T19:47:38Z

I somehow have a feeling we've already discussed it, and I thought splitting off the determiners' lemmas was the result of that discussion... But if not, here are some arguments why I think the determiners should have separate lemmas (but I fully agree with them -> they):

The determiners are not historically genitive forms of the pronouns (they correspond to Latin "meus, meo", not "ego, mihi")
The determiners have their own lemmas and full paradigms, incl. case in the other Germanic UD languages (German: mich -> ich = me -> I, and mein(er|e|es) -> mein); all things being equal I think English should do things the same as German, Dutch etc., unless there is a strong reason not to.
The independent forms can serve in any case form, indicating that they are not genitive forms either: "we both have cats; yours/NOM has met mine/ACC"
In colloquial speech under coordination, 's genitives are compatible with a coordinate true pronoun, e.g. "me and John's cat", whereas "my and John's cat" is disprefered (but should be fine IMO if "John's" and "my" were both truly genitives); admittedly the existence of both forms makes this particular argument weaker than the rest
One of the most popular English lemmatizers of the past two decades, TreeTagger, lemmatized "my" to "my", leading to this lemmatization behavior being present in a lot of corpora (e.g. all of the ones here), and the same seems to be true of the COCA family of corpora

I can't remember whether we discussed this in an e-mail chain or one of the repos, but I feel like the distinct lemmatization of the determiner possessives was already hashed out somewhere. There may have been other arguments, but for me it just seems like the determiners are a separate paradigm (in languages like German this is very clear, since they have all case forms themselves).

amir-zeldes · 2022-01-16T19:50:08Z

Oh and about the other things:

Yes, some feature to distinguish "mine" might be nice
whom -> who sounds right (it's literally just the old accusative), and by analogy whomever -> whoever
whose is indeed the historical genitive of who, so although I'm for lemmatizing to the nominative (and UD German does so as well)

nschneid · 2022-01-16T20:20:37Z

Thanks. While the etymology and analogies to German etc. are interesting I don't think we should be bound by those. (The fact that "whose" is historically genitive but "my" is not shouldn't matter for synchronic analysis because they both serve the same possessive function.) In any case, I am interested mainly on a practical level what users of English corpora will expect from lemmas.

To me it is very confusing to have me => I and our => we but my => my. Would people be surprised by lemmatizing only to remove accusative case (in personal and WH pronouns, converting them to nominative), and leaving all other pronouns alone (both kinds of possessives, reflexives)?

Would people be surprised if we dropped case normalization altogether? I honestly don't see a compelling need to relate me and I by assigning them the same lemma, given that their similarities and differences are made precise by features. And these pronouns are extremely frequent so there's no sparsity issue.

Also, it is a bit surprising to me that we currently normalize case but not number in pronouns, given that we normalize number in nouns. But saying that pronouns as closed-class items have no inflectional normalization at all in the lemma would be a simple enough policy. (For English, giving the small size of these paradigms.)

amir-zeldes · 2022-01-16T20:35:32Z

Not sure I understand - I would definitely also lemmatize our -> our, like my. I vs. me is typical nom/acc, why not lemmatize to I? And what related languages do is not irrelevant, if we want to use UD for language comparison.

nschneid · 2022-01-16T20:46:08Z

As I understand it UD doesn't really attempt to standardize lemmas across languages, though. The features and deps are the crosslinguistic interface.

Since nom/acc doesn't occur in English outside of a few pronouns, do we really need to normalize it in the lemma? And if we did normalize it, why not also normalize number, so us => I?

nschneid · 2022-01-16T20:53:42Z

Oh and FWIW currently we have these => this and those => that. So number is normalized in nouns and determiners and demonstrative pronouns but not other pronouns.

amir-zeldes · 2022-01-16T21:19:53Z

It's relevant in that some researchers will use the treebanks to get stats on how many pronominal lemmas a language has, how many inflected forms each pron lemma has on avg, per lang etc. I'm not saying it's the only consideration, but if we can get the Germanic languages to behave the same and nothing speaks against it, then I would do it that way

rueter · 2022-01-16T21:50:05Z

It's interesting to watch how languages of minimal morphological variation are difficult to deal with, probably due to the irregularity.
I was hoping to see parallels drawn:
Who, whom, whose, whose
He, him, his, his
She, her, her, hers
I, me, my (!mine eyes have seen...), mine
I'm not really sure this is automatically clear for some native monolinguals, though, ...
I found the determiner with me and John's dog interesting in that the word ordering already tells me I should avoid it in good style. So how about him and John's dog vs his and John's dog or even John and Paul's cat. Now we do say John and Paul's, which answers the question Whose?
This would tell us why his/my and John's doesn't sound right.
The fourth column with possessive pronouns allows for: the dog is John's and mine

Suppletion is always interesting. The adjective forms good, better and best are now being lemmatized as "good".

Do we want parallels? They might help translation machines.

Some languages actually have the analogical us >> I.
But these are languages where paradigms regular and irregular are everyday things. (Uralic languages, Scandinavian, at least).

nschneid · 2022-01-16T22:46:54Z

I think coordination of pronouns and possessives is a murky area independent of what counts as genitive. The 's clitic can apply to an entire coordinated phrase or to one of its elements. I follow the rule that there's a semantic distinction between "John's and Mary's articles" (separate) and "John and Mary's articles" (joint), but it's the sort of thing that is taught prescriptively which suggests variation among native speakers. There are also people who avoid "you and me" at all costs and say "you and I" even as an object, but I wouldn't say that it warrants categorizing "I" as accusative; it's just a peculiarity of patterns of coordination.

nschneid · 2022-01-16T22:53:18Z

But I take @rueter's point that, going by similarity of forms (despite some suppletion where nominatives stand out), person+gender+number seems to be the primary axis of distinguishing personal pronouns, and case and possessiveness are secondary.

LarsAhrenberg · 2022-01-17T10:05:42Z

I support @amir-zeldes and believe it would be nice if all Germanic treebanks could agree on a common approach.

The UD guidelines on lemmas say: "Except perhaps in rare cases of suppletion, one form should be the chosen as the lemma of a verb, noun, determiner, or pronoun paradigm." So the question I suppose boils down to what a pronoun paradigm is. Should the possessives be included or not? For example, at present, the Swedish treebanks and Norwegian_Bokmaal do it differently, where Swedish uses the nominative form also for the possessives, while N_Bokmaal has a separate lemma for the possessive. This seems quite unnecessary. While there seem to be linguistic arguments for both alternatives, following a common principle in this case would increase the usefulness of the Germanic treebanks as a whole.

dan-zeman · 2022-01-28T17:53:47Z

Independent genitive personal pronouns retain case in the lemma. Should they? E.g. ours (PTB: PRP) does not lemmatize as we, but our does.

I agree with others that it would be useful if it is consistent across at least the English treebanks, but unless there are strong reasons for an English-specific treatment, then also across the Germanic branch. The treatment of my and our should not differ. I tend to prefer them being tagged DET in an analogy with German; then the lemma of my would be my, not I; but also the lemma of our would be our, not we. I would then not use the Case feature at all. However, I admit that the adjective-like behavior has somehow faded in English (in comparison to German), so it would be possible to say instead that they are PRON in genitive (Case=Gen would be used), then their lemma should be the nominative form, and if we also normalize Number (I'm not sure what is the consensus in English but I would do it), then the lemma I would cover the forms I, me, my, we, us, our.

I tend to agree with @amir-zeldes that the independent form ours is not a genitive. Whether and how it is distinguished from our depends on whether our is a genitive pronoun or rather a caseless determiner. In the latter case I could even imagine lemmatizing both forms to our; although that would mean that ours would also be DET, and I assume it might be more acceptable for people to say it's PRON.

dan-zeman · 2022-01-28T17:57:29Z

As I understand it UD doesn't really attempt to standardize lemmas across languages

It didn't attempt to standardize them because it seemed too hard a task when we had the plate full with other issues; but I don't think it would be wrong to attempt it. The more annotation approaches we can make similar across languages (especially related languages), the better!

dan-zeman · 2022-01-28T18:00:08Z

I somehow have a feeling we've already discussed it

I have it, too, but no idea where exactly to search for it. This is one reason why I hate seeing guidelines discussion (including language-specific guidelines) outside the main docs issue tracker. Folks, this place is about EWT. Guidelines should be sorted out in docs, and the issues here should then only discuss what should be fixed in EWT in order to match the guidelines.

amir-zeldes · 2022-01-28T19:24:37Z

This is one reason why I hate seeing guidelines discussion (including language-specific guidelines) outside the main docs issue tracker. Folks, this place is about EWT

I definitely sympathize with not being able to find things and keeping general discussions in docs, but this is (started out as?) an English-specific issue, and I guess EWT being the first UD English dataset, it is often taken as the default corpus for English.

For me (and I'm guessing many others) it would be fairly surprising to lemmatize "our" to "I", and specifically there is prior art in English lemmatization practices to keep person forms distinct, as well as for lemmatizing the possessives to themselves. I don't mind making them upos=DET, although I will point out that doing so is maybe a little incongruent with attaching them as nmod:poss (which is maybe a mistake, but it is the status quo).

In terms of consistency within Germanic, I think there is no option of lemmatizing them to "I" in German etc., because they inflect with a full paradigm (deu. meiner, meinen, meinem, meines). I may be influenced by historical factors here, but I just don't see these things as genitives (and they distribute very differently from the proper English genitive NPs in 's), and it's not just a question of comparison to German, since we could also have historical data in English where it would be nice to have consistency (the Penn Parsed Historical Corpora tag these as pronouns, and of course in Old English we still have min.NOM, mine.ACC, mines.GEN as in German).

dan-zeman · 2022-01-28T19:48:16Z

This is one reason why I hate seeing guidelines discussion (including language-specific guidelines) outside the main docs issue tracker. Folks, this place is about EWT

I definitely sympathize with not being able to find things and keeping general discussions in docs, but this is (started out as?) an English-specific issue, and I guess EWT being the first UD English dataset, it is often taken as the default corpus for English.

Even if EWT were the only English treebank in UD, I would prefer its issue tracker to be limited to bugs in EWT, while English-specific guidelines would be discussed at the repo where they are documented, i.e., docs. (Also, as happened here, guidelines discussions easily jump to examples from other languages sooner or later.) Obviously there is even stronger motive to go to docs since EWT is no longer the only English treebank, but one of nine in the latest release. It is even possible that users responsible for the other English treebanks are watching the docs repository but not EWT; and if this is the case, then it is bad because ideally all English treebanks should converge to the same set of guidelines.

nschneid · 2022-01-28T19:48:39Z

UniversalDependencies/docs#517 so that folks not following English treebanks specifically will be able to participate.

I think we need a live discussion though.

(But discussions that seem to be about fine-grained EWT or EWT+GUM things have a way of turning into broader guidelines questions, alas....)

nschneid · 2022-10-08T21:43:49Z

See resolution at UniversalDependencies/docs#517

…dependent possessives; check consistency of xpos/feats incl. variant forms closes #293

nschneid added a commit that referenced this issue Jan 15, 2022

possessive pronouns: mistagged 'her' and some typos (#293)

9dc489e

nschneid added a commit that referenced this issue Jan 15, 2022

more possessive pronoun typos (#293)

251a3d8

nschneid added a commit that referenced this issue Jan 15, 2022

'it's' for 'its' (#293)

df3dcbe

nschneid added a commit that referenced this issue Jan 15, 2022

archaic 2sg pronouns "thou" (+ "art"), "thy" (#293)

9edf51d

nschneid added a commit that referenced this issue Jan 15, 2022

yo (#293)

e687ec6

nschneid added a commit that referenced this issue Jan 15, 2022

possessive you/ur (#293)

a8f625a

nschneid added a commit that referenced this issue Jan 16, 2022

PRP vs. PRP$ (#293)

eaaf982

nschneid added a commit that referenced this issue Jan 16, 2022

'whose' lemma (#293), ye/ya/y'all, typos

4a9ffbe

nschneid added a commit that referenced this issue Jan 18, 2022

pronoun lemmas/xpos incl. 'let's', indep poss 'his' (#293)

e74ecfc

nschneid mentioned this issue Jan 18, 2022

Lemmas of English personal pronouns UniversalDependencies/docs#517

Open

nschneid added the pronouns/determiners/numbers label Jan 29, 2022

nschneid added this to the v2.11 milestone Oct 2, 2022

nschneid pushed a commit that referenced this issue Oct 12, 2022

Align personal pronouns with updated guidelines (#368): Case=Gen for …

eb83bb0

…dependent possessives; check consistency of xpos/feats incl. variant forms closes #293

nschneid closed this as completed Oct 12, 2022

nschneid mentioned this issue Dec 10, 2022

Documentation of lemmatization decisions in English corpora #131

Open

amir-zeldes mentioned this issue Nov 25, 2023

Incorrect independent pronoun lemmas UniversalDependencies/UD_English-Pronouns#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possessive pronouns: consistency across annotations #293

Possessive pronouns: consistency across annotations #293

nschneid commented Jan 15, 2022 •

edited

Loading

nschneid commented Jan 15, 2022

nschneid commented Jan 15, 2022

nschneid commented Jan 16, 2022 •

edited

Loading

nschneid commented Jan 16, 2022 •

edited

Loading

amir-zeldes commented Jan 16, 2022

amir-zeldes commented Jan 16, 2022

nschneid commented Jan 16, 2022

amir-zeldes commented Jan 16, 2022

nschneid commented Jan 16, 2022 •

edited

Loading

nschneid commented Jan 16, 2022 •

edited

Loading

amir-zeldes commented Jan 16, 2022

rueter commented Jan 16, 2022 •

edited

Loading

nschneid commented Jan 16, 2022

nschneid commented Jan 16, 2022

LarsAhrenberg commented Jan 17, 2022

dan-zeman commented Jan 28, 2022

dan-zeman commented Jan 28, 2022

dan-zeman commented Jan 28, 2022

amir-zeldes commented Jan 28, 2022

dan-zeman commented Jan 28, 2022

nschneid commented Jan 28, 2022

nschneid commented Oct 8, 2022

Possessive pronouns: consistency across annotations #293

Possessive pronouns: consistency across annotations #293

Comments

nschneid commented Jan 15, 2022 • edited Loading

nschneid commented Jan 15, 2022

nschneid commented Jan 15, 2022

nschneid commented Jan 16, 2022 • edited Loading

nschneid commented Jan 16, 2022 • edited Loading

amir-zeldes commented Jan 16, 2022

amir-zeldes commented Jan 16, 2022

nschneid commented Jan 16, 2022

amir-zeldes commented Jan 16, 2022

nschneid commented Jan 16, 2022 • edited Loading

nschneid commented Jan 16, 2022 • edited Loading

amir-zeldes commented Jan 16, 2022

rueter commented Jan 16, 2022 • edited Loading

nschneid commented Jan 16, 2022

nschneid commented Jan 16, 2022

LarsAhrenberg commented Jan 17, 2022

dan-zeman commented Jan 28, 2022

dan-zeman commented Jan 28, 2022

dan-zeman commented Jan 28, 2022

amir-zeldes commented Jan 28, 2022

dan-zeman commented Jan 28, 2022

nschneid commented Jan 28, 2022

nschneid commented Oct 8, 2022

nschneid commented Jan 15, 2022 •

edited

Loading

nschneid commented Jan 16, 2022 •

edited

Loading

nschneid commented Jan 16, 2022 •

edited

Loading

nschneid commented Jan 16, 2022 •

edited

Loading

nschneid commented Jan 16, 2022 •

edited

Loading

rueter commented Jan 16, 2022 •

edited

Loading