-
Notifications
You must be signed in to change notification settings - Fork 42
-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possessive pronouns: consistency across annotations #293
Comments
OK I've decided that in the context of African-American language, "yo" would be Vernacular, but it's been borrowed into wider use as Slang in the expression "yo mama". |
Remaining issues:
@dan-zeman Guessing you'll have opinions here |
OK the lemma situation is weirder than I thought: the non-WH attributive genitives in EWT are all case-normalized except "my" for some reason. In GUM none of them are, but the accusative forms are. I think the simplest policy would be to case-normalize all pronouns to the nominative, including both attributive and independent genitives. Any objections? |
I somehow have a feeling we've already discussed it, and I thought splitting off the determiners' lemmas was the result of that discussion... But if not, here are some arguments why I think the determiners should have separate lemmas (but I fully agree with them -> they):
I can't remember whether we discussed this in an e-mail chain or one of the repos, but I feel like the distinct lemmatization of the determiner possessives was already hashed out somewhere. There may have been other arguments, but for me it just seems like the determiners are a separate paradigm (in languages like German this is very clear, since they have all case forms themselves). |
Oh and about the other things:
|
Thanks. While the etymology and analogies to German etc. are interesting I don't think we should be bound by those. (The fact that "whose" is historically genitive but "my" is not shouldn't matter for synchronic analysis because they both serve the same possessive function.) In any case, I am interested mainly on a practical level what users of English corpora will expect from lemmas. To me it is very confusing to have me => I and our => we but my => my. Would people be surprised by lemmatizing only to remove accusative case (in personal and WH pronouns, converting them to nominative), and leaving all other pronouns alone (both kinds of possessives, reflexives)? Would people be surprised if we dropped case normalization altogether? I honestly don't see a compelling need to relate me and I by assigning them the same lemma, given that their similarities and differences are made precise by features. And these pronouns are extremely frequent so there's no sparsity issue. Also, it is a bit surprising to me that we currently normalize case but not number in pronouns, given that we normalize number in nouns. But saying that pronouns as closed-class items have no inflectional normalization at all in the lemma would be a simple enough policy. (For English, giving the small size of these paradigms.) |
Not sure I understand - I would definitely also lemmatize our -> our, like my. I vs. me is typical nom/acc, why not lemmatize to I? And what related languages do is not irrelevant, if we want to use UD for language comparison. |
As I understand it UD doesn't really attempt to standardize lemmas across languages, though. The features and deps are the crosslinguistic interface. Since nom/acc doesn't occur in English outside of a few pronouns, do we really need to normalize it in the lemma? And if we did normalize it, why not also normalize number, so us => I? |
Oh and FWIW currently we have these => this and those => that. So number is normalized in nouns and determiners and demonstrative pronouns but not other pronouns. |
It's relevant in that some researchers will use the treebanks to get stats on how many pronominal lemmas a language has, how many inflected forms each pron lemma has on avg, per lang etc. I'm not saying it's the only consideration, but if we can get the Germanic languages to behave the same and nothing speaks against it, then I would do it that way |
It's interesting to watch how languages of minimal morphological variation are difficult to deal with, probably due to the irregularity. Suppletion is always interesting. The adjective forms good, better and best are now being lemmatized as "good". Do we want parallels? They might help translation machines. Some languages actually have the analogical us >> I. |
I think coordination of pronouns and possessives is a murky area independent of what counts as genitive. The 's clitic can apply to an entire coordinated phrase or to one of its elements. I follow the rule that there's a semantic distinction between "John's and Mary's articles" (separate) and "John and Mary's articles" (joint), but it's the sort of thing that is taught prescriptively which suggests variation among native speakers. There are also people who avoid "you and me" at all costs and say "you and I" even as an object, but I wouldn't say that it warrants categorizing "I" as accusative; it's just a peculiarity of patterns of coordination. |
But I take @rueter's point that, going by similarity of forms (despite some suppletion where nominatives stand out), person+gender+number seems to be the primary axis of distinguishing personal pronouns, and case and possessiveness are secondary. |
I support @amir-zeldes and believe it would be nice if all Germanic treebanks could agree on a common approach. The UD guidelines on lemmas say: "Except perhaps in rare cases of suppletion, one form should be the chosen as the lemma of a verb, noun, determiner, or pronoun paradigm." So the question I suppose boils down to what a pronoun paradigm is. Should the possessives be included or not? For example, at present, the Swedish treebanks and Norwegian_Bokmaal do it differently, where Swedish uses the nominative form also for the possessives, while N_Bokmaal has a separate lemma for the possessive. This seems quite unnecessary. While there seem to be linguistic arguments for both alternatives, following a common principle in this case would increase the usefulness of the Germanic treebanks as a whole. |
I agree with others that it would be useful if it is consistent across at least the English treebanks, but unless there are strong reasons for an English-specific treatment, then also across the Germanic branch. The treatment of my and our should not differ. I tend to prefer them being tagged I tend to agree with @amir-zeldes that the independent form ours is not a genitive. Whether and how it is distinguished from our depends on whether our is a genitive pronoun or rather a caseless determiner. In the latter case I could even imagine lemmatizing both forms to our; although that would mean that ours would also be |
It didn't attempt to standardize them because it seemed too hard a task when we had the plate full with other issues; but I don't think it would be wrong to attempt it. The more annotation approaches we can make similar across languages (especially related languages), the better! |
I have it, too, but no idea where exactly to search for it. This is one reason why I hate seeing guidelines discussion (including language-specific guidelines) outside the main docs issue tracker. Folks, this place is about EWT. Guidelines should be sorted out in |
I definitely sympathize with not being able to find things and keeping general discussions in docs, but this is (started out as?) an English-specific issue, and I guess EWT being the first UD English dataset, it is often taken as the default corpus for English. For me (and I'm guessing many others) it would be fairly surprising to lemmatize "our" to "I", and specifically there is prior art in English lemmatization practices to keep person forms distinct, as well as for lemmatizing the possessives to themselves. I don't mind making them In terms of consistency within Germanic, I think there is no option of lemmatizing them to "I" in German etc., because they inflect with a full paradigm (deu. meiner, meinen, meinem, meines). I may be influenced by historical factors here, but I just don't see these things as genitives (and they distribute very differently from the proper English genitive NPs in 's), and it's not just a question of comparison to German, since we could also have historical data in English where it would be nice to have consistency (the Penn Parsed Historical Corpora tag these as pronouns, and of course in Old English we still have min.NOM, mine.ACC, mines.GEN as in German). |
Even if EWT were the only English treebank in UD, I would prefer its issue tracker to be limited to bugs in EWT, while English-specific guidelines would be discussed at the repo where they are documented, i.e., |
UniversalDependencies/docs#517 so that folks not following English treebanks specifically will be able to participate. I think we need a live discussion though. (But discussions that seem to be about fine-grained EWT or EWT+GUM things have a way of turning into broader guidelines questions, alas....) |
See resolution at UniversalDependencies/docs#517 |
…dependent possessives; check consistency of xpos/feats incl. variant forms closes #293
A variety of issues including typos like "it's" for "its", incorrect XPOS/feats for possessive "her", miscellaneous pronouns like "thy" lacking proper feats (cf. #230), etc.
PRP$
but not syntactically possessive (most are incorrectly tagged "her"; some are valid)PRP$
but missing typo annotationThe text was updated successfully, but these errors were encountered: