Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possessive pronouns: consistency across annotations #293

Closed
3 tasks done
nschneid opened this issue Jan 15, 2022 · 22 comments
Closed
3 tasks done

Possessive pronouns: consistency across annotations #293

nschneid opened this issue Jan 15, 2022 · 22 comments

Comments

@nschneid
Copy link
Contributor

nschneid commented Jan 15, 2022

A variety of issues including typos like "it's" for "its", incorrect XPOS/feats for possessive "her", miscellaneous pronouns like "thy" lacking proper feats (cf. #230), etc.

@nschneid
Copy link
Contributor Author

For possessive pronoun "yo" as in "yo mama", I'm having trouble deciding on Style between Colloquial, Vernacular, and Slang.

@nschneid
Copy link
Contributor Author

OK I've decided that in the context of African-American language, "yo" would be Vernacular, but it's been borrowed into wider use as Slang in the expression "yo mama".

nschneid added a commit that referenced this issue Jan 15, 2022
nschneid added a commit that referenced this issue Jan 15, 2022
nschneid added a commit that referenced this issue Jan 16, 2022
@nschneid
Copy link
Contributor Author

nschneid commented Jan 16, 2022

Remaining issues:

  • Independent genitive personal pronouns retain case in the lemma. Should they? E.g. ours (PTB: PRP) does not lemmatize as we, but our does.
  • Do features provide a way to distinguish these independent forms from the attributive forms? Both are Poss=Yes and PronType=Prs.
  • WH-pronouns whom (WP) and whose (WP$): should the lemma be case-normalized to who? I don't see why not. And if so I suppose the lemma of whomever (WP) should be whoever.
  • WH-pronouns need to be manually reviewed for PronType=Rel vs. PronType=Int.
    • e.g., to whom will the letter be sent? should be Int, the individual or entity to whom they are addressed should be Rel
    • I checked all the "whose" cases: all EWT tokens should be Rel (no examples like "Whose book is this?"). GUM has an Int example that should be Rel.

@dan-zeman Guessing you'll have opinions here

@nschneid
Copy link
Contributor Author

nschneid commented Jan 16, 2022

OK the lemma situation is weirder than I thought: the non-WH attributive genitives in EWT are all case-normalized except "my" for some reason. In GUM none of them are, but the accusative forms are.

I think the simplest policy would be to case-normalize all pronouns to the nominative, including both attributive and independent genitives. Any objections?

@amir-zeldes
Copy link
Contributor

I somehow have a feeling we've already discussed it, and I thought splitting off the determiners' lemmas was the result of that discussion... But if not, here are some arguments why I think the determiners should have separate lemmas (but I fully agree with them -> they):

  • The determiners are not historically genitive forms of the pronouns (they correspond to Latin "meus, meo", not "ego, mihi")
  • The determiners have their own lemmas and full paradigms, incl. case in the other Germanic UD languages (German: mich -> ich = me -> I, and mein(er|e|es) -> mein); all things being equal I think English should do things the same as German, Dutch etc., unless there is a strong reason not to.
  • The independent forms can serve in any case form, indicating that they are not genitive forms either: "we both have cats; yours/NOM has met mine/ACC"
  • In colloquial speech under coordination, 's genitives are compatible with a coordinate true pronoun, e.g. "me and John's cat", whereas "my and John's cat" is disprefered (but should be fine IMO if "John's" and "my" were both truly genitives); admittedly the existence of both forms makes this particular argument weaker than the rest
  • One of the most popular English lemmatizers of the past two decades, TreeTagger, lemmatized "my" to "my", leading to this lemmatization behavior being present in a lot of corpora (e.g. all of the ones here), and the same seems to be true of the COCA family of corpora

I can't remember whether we discussed this in an e-mail chain or one of the repos, but I feel like the distinct lemmatization of the determiner possessives was already hashed out somewhere. There may have been other arguments, but for me it just seems like the determiners are a separate paradigm (in languages like German this is very clear, since they have all case forms themselves).

@amir-zeldes
Copy link
Contributor

Oh and about the other things:

  • Yes, some feature to distinguish "mine" might be nice
  • whom -> who sounds right (it's literally just the old accusative), and by analogy whomever -> whoever
  • whose is indeed the historical genitive of who, so although I'm for lemmatizing to the nominative (and UD German does so as well)

@nschneid
Copy link
Contributor Author

Thanks. While the etymology and analogies to German etc. are interesting I don't think we should be bound by those. (The fact that "whose" is historically genitive but "my" is not shouldn't matter for synchronic analysis because they both serve the same possessive function.) In any case, I am interested mainly on a practical level what users of English corpora will expect from lemmas.

To me it is very confusing to have me => I and our => we but my => my. Would people be surprised by lemmatizing only to remove accusative case (in personal and WH pronouns, converting them to nominative), and leaving all other pronouns alone (both kinds of possessives, reflexives)?

Would people be surprised if we dropped case normalization altogether? I honestly don't see a compelling need to relate me and I by assigning them the same lemma, given that their similarities and differences are made precise by features. And these pronouns are extremely frequent so there's no sparsity issue.

Also, it is a bit surprising to me that we currently normalize case but not number in pronouns, given that we normalize number in nouns. But saying that pronouns as closed-class items have no inflectional normalization at all in the lemma would be a simple enough policy. (For English, giving the small size of these paradigms.)

@amir-zeldes
Copy link
Contributor

Not sure I understand - I would definitely also lemmatize our -> our, like my. I vs. me is typical nom/acc, why not lemmatize to I? And what related languages do is not irrelevant, if we want to use UD for language comparison.

@nschneid
Copy link
Contributor Author

nschneid commented Jan 16, 2022

As I understand it UD doesn't really attempt to standardize lemmas across languages, though. The features and deps are the crosslinguistic interface.

Since nom/acc doesn't occur in English outside of a few pronouns, do we really need to normalize it in the lemma? And if we did normalize it, why not also normalize number, so us => I?

@nschneid
Copy link
Contributor Author

nschneid commented Jan 16, 2022

Oh and FWIW currently we have these => this and those => that. So number is normalized in nouns and determiners and demonstrative pronouns but not other pronouns.

@amir-zeldes
Copy link
Contributor

It's relevant in that some researchers will use the treebanks to get stats on how many pronominal lemmas a language has, how many inflected forms each pron lemma has on avg, per lang etc. I'm not saying it's the only consideration, but if we can get the Germanic languages to behave the same and nothing speaks against it, then I would do it that way

@rueter
Copy link

rueter commented Jan 16, 2022

It's interesting to watch how languages of minimal morphological variation are difficult to deal with, probably due to the irregularity.
I was hoping to see parallels drawn:
Who, whom, whose, whose
He, him, his, his
She, her, her, hers
I, me, my (!mine eyes have seen...), mine
I'm not really sure this is automatically clear for some native monolinguals, though, ...
I found the determiner with me and John's dog interesting in that the word ordering already tells me I should avoid it in good style. So how about him and John's dog vs his and John's dog or even John and Paul's cat. Now we do say John and Paul's, which answers the question Whose?
This would tell us why his/my and John's doesn't sound right.
The fourth column with possessive pronouns allows for: the dog is John's and mine

Suppletion is always interesting. The adjective forms good, better and best are now being lemmatized as "good".

Do we want parallels? They might help translation machines.

Some languages actually have the analogical us >> I.
But these are languages where paradigms regular and irregular are everyday things. (Uralic languages, Scandinavian, at least).

@nschneid
Copy link
Contributor Author

I think coordination of pronouns and possessives is a murky area independent of what counts as genitive. The 's clitic can apply to an entire coordinated phrase or to one of its elements. I follow the rule that there's a semantic distinction between "John's and Mary's articles" (separate) and "John and Mary's articles" (joint), but it's the sort of thing that is taught prescriptively which suggests variation among native speakers. There are also people who avoid "you and me" at all costs and say "you and I" even as an object, but I wouldn't say that it warrants categorizing "I" as accusative; it's just a peculiarity of patterns of coordination.

@nschneid
Copy link
Contributor Author

But I take @rueter's point that, going by similarity of forms (despite some suppletion where nominatives stand out), person+gender+number seems to be the primary axis of distinguishing personal pronouns, and case and possessiveness are secondary.

@LarsAhrenberg
Copy link

I support @amir-zeldes and believe it would be nice if all Germanic treebanks could agree on a common approach.

The UD guidelines on lemmas say: "Except perhaps in rare cases of suppletion, one form should be the chosen as the lemma of a verb, noun, determiner, or pronoun paradigm." So the question I suppose boils down to what a pronoun paradigm is. Should the possessives be included or not? For example, at present, the Swedish treebanks and Norwegian_Bokmaal do it differently, where Swedish uses the nominative form also for the possessives, while N_Bokmaal has a separate lemma for the possessive. This seems quite unnecessary. While there seem to be linguistic arguments for both alternatives, following a common principle in this case would increase the usefulness of the Germanic treebanks as a whole.

@dan-zeman
Copy link
Member

  • Independent genitive personal pronouns retain case in the lemma. Should they? E.g. ours (PTB: PRP) does not lemmatize as we, but our does.

I agree with others that it would be useful if it is consistent across at least the English treebanks, but unless there are strong reasons for an English-specific treatment, then also across the Germanic branch. The treatment of my and our should not differ. I tend to prefer them being tagged DET in an analogy with German; then the lemma of my would be my, not I; but also the lemma of our would be our, not we. I would then not use the Case feature at all. However, I admit that the adjective-like behavior has somehow faded in English (in comparison to German), so it would be possible to say instead that they are PRON in genitive (Case=Gen would be used), then their lemma should be the nominative form, and if we also normalize Number (I'm not sure what is the consensus in English but I would do it), then the lemma I would cover the forms I, me, my, we, us, our.

I tend to agree with @amir-zeldes that the independent form ours is not a genitive. Whether and how it is distinguished from our depends on whether our is a genitive pronoun or rather a caseless determiner. In the latter case I could even imagine lemmatizing both forms to our; although that would mean that ours would also be DET, and I assume it might be more acceptable for people to say it's PRON.

@dan-zeman
Copy link
Member

As I understand it UD doesn't really attempt to standardize lemmas across languages

It didn't attempt to standardize them because it seemed too hard a task when we had the plate full with other issues; but I don't think it would be wrong to attempt it. The more annotation approaches we can make similar across languages (especially related languages), the better!

@dan-zeman
Copy link
Member

I somehow have a feeling we've already discussed it

I have it, too, but no idea where exactly to search for it. This is one reason why I hate seeing guidelines discussion (including language-specific guidelines) outside the main docs issue tracker. Folks, this place is about EWT. Guidelines should be sorted out in docs, and the issues here should then only discuss what should be fixed in EWT in order to match the guidelines.

@amir-zeldes
Copy link
Contributor

This is one reason why I hate seeing guidelines discussion (including language-specific guidelines) outside the main docs issue tracker. Folks, this place is about EWT

I definitely sympathize with not being able to find things and keeping general discussions in docs, but this is (started out as?) an English-specific issue, and I guess EWT being the first UD English dataset, it is often taken as the default corpus for English.

For me (and I'm guessing many others) it would be fairly surprising to lemmatize "our" to "I", and specifically there is prior art in English lemmatization practices to keep person forms distinct, as well as for lemmatizing the possessives to themselves. I don't mind making them upos=DET, although I will point out that doing so is maybe a little incongruent with attaching them as nmod:poss (which is maybe a mistake, but it is the status quo).

In terms of consistency within Germanic, I think there is no option of lemmatizing them to "I" in German etc., because they inflect with a full paradigm (deu. meiner, meinen, meinem, meines). I may be influenced by historical factors here, but I just don't see these things as genitives (and they distribute very differently from the proper English genitive NPs in 's), and it's not just a question of comparison to German, since we could also have historical data in English where it would be nice to have consistency (the Penn Parsed Historical Corpora tag these as pronouns, and of course in Old English we still have min.NOM, mine.ACC, mines.GEN as in German).

@dan-zeman
Copy link
Member

This is one reason why I hate seeing guidelines discussion (including language-specific guidelines) outside the main docs issue tracker. Folks, this place is about EWT

I definitely sympathize with not being able to find things and keeping general discussions in docs, but this is (started out as?) an English-specific issue, and I guess EWT being the first UD English dataset, it is often taken as the default corpus for English.

Even if EWT were the only English treebank in UD, I would prefer its issue tracker to be limited to bugs in EWT, while English-specific guidelines would be discussed at the repo where they are documented, i.e., docs. (Also, as happened here, guidelines discussions easily jump to examples from other languages sooner or later.) Obviously there is even stronger motive to go to docs since EWT is no longer the only English treebank, but one of nine in the latest release. It is even possible that users responsible for the other English treebanks are watching the docs repository but not EWT; and if this is the case, then it is bad because ideally all English treebanks should converge to the same set of guidelines.

@nschneid
Copy link
Contributor Author

UniversalDependencies/docs#517 so that folks not following English treebanks specifically will be able to participate.

I think we need a live discussion though.

(But discussions that seem to be about fine-grained EWT or EWT+GUM things have a way of turning into broader guidelines questions, alas....)

@nschneid
Copy link
Contributor Author

nschneid commented Oct 8, 2022

See resolution at UniversalDependencies/docs#517

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants