Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should Number=Ptan be used instead of Number=Plur for English plural-only words? #999

Closed
rhdunn opened this issue Nov 28, 2023 · 23 comments
Closed

Comments

@rhdunn
Copy link

rhdunn commented Nov 28, 2023

Words such as "glasses" and "species" are in their plural form according to English pluralization rules. Regarding the lemmas, with some exceptions:

  1. EWT uses the plural form as the lemma, matching its intended use;
  2. GUM uses the singular form as the lemma, presumably using the machine-lemmatized plural form.

EWT is correct here, but these cases are very likely to confuse lemmatizers trained on the UD English corpora due to these cases not following the English plural lemmatization rules.

Making use of the plurale tantum annotation (Number=Ptan) which already exists would make the intention clearer, and allow lemmatizers to differentiate from NNS/Number=Plur and NNS/Number=Ptan.

Note: This should also apply to the dates like 1980s where the lemma retains the form's plural suffix.

Relevant issues:

  1. EWT -- Ambiguous lemmatization of pluralia tantum UD_English-EWT#374
  2. PUD -- Plural nouns not using singular lemma, possible pluralia tantum UD_English-PUD#33
@nschneid
Copy link
Contributor

I was today years old when I learned of the Ptan value. :) It is documented and several languages do use it, so in principle it would make sense to adopt it for English as well.

It would have the benefit of explaining why lemmas contain the plural morphology; lemmatizers/checkers would have to implement a fixed list of pluralia tantum anyway, and this makes it explicit in the data.

A counterargument might be that it is a rare value that does not really have morphosyntactic consequences for English beyond the lemma—morphosyntactically it is a kind of plural, so users may expect Number=Plur.

@nschneid
Copy link
Contributor

Here is a nice little summary of pluralia tantum and singularia tantum in English: https://english.stackexchange.com/questions/407446/does-english-have-any-singularia-tantum-besides-mass-nouns

@amir-zeldes and I are on board with implementing Number=Ptan for English. It would help if somebody else could submit PRs.

Let's NOT implement Number=Coll because collective nouns comprise a very large set.

@nschneid
Copy link
Contributor

nschneid commented Nov 30, 2023

I've rewritten https://universaldependencies.org/en/feat/Number.html to better reflect how we currently use the feature. The page isn't updating immediately, but you can look at the diff.

Note that, contra the original post above, I don't think species falls under the category of pluralia tantum.

Here's how I defined it:

Ptan: plurale tantum

Some nouns appear only in the plural form, with a regular plural suffix and plural agreement, but lack a singular counterpart (at least when serving as a nominal head). (The lemma is therefore the plural form.) These form a relatively closed set. Semantically, they often denote a mass-like collection, or a doublet object.

Note that some nouns have endings that look like regular plural endings, but are not: linguistics and Xerxes are singular, and species and series may be singular or plural, but none of these are pluralia tantum.

Examples

  • clothes, scissors, riches

@nschneid
Copy link
Contributor

Note that even for pluralia tantum, the "s" can sometimes be chopped off when used attributively or in a compound ("pant leg", "scissor kick"). Hence the qualification "(at least when serving as a nominal head)".

@amir-zeldes
Copy link
Contributor

Right - looks like I responded in the wrong issue! All I need is a list/notes if you want to kick anything OFF the list in the GUM validator.

@nschneid
Copy link
Contributor

nschneid commented Nov 30, 2023

Current GUM validator lists where the lemma is allowed to end with "s":

  • NNPS: ["Netherlands","Analytics","Olympics","Commons","Paralympics","Vans", "Andes","Forties","Philippines"]
  • NNS: ["surroundings","energetics","politics","jeans","clothes","electronics","means","feces", "biceps","triceps","news","species","economics","arrears","glasses","thanks","series"]
  • plus pluralized decades

Additional items in EWT that are being flagged:

  • regards troops supplies grounds goods headquarters contents pants manners memoirs respects whereabouts finances proceedings savings specifics genitals remains wares barracks environs orthodontics earnings statistics panties tenterhooks dynamics geopolitics trousers hackles auspices confines fives billiards genetics scissors eatables furnishings

I don't have clear intuitions about all of these, hoping somebody else can weigh in.

Note that only a subset are pluralia tantum—not "economics", "series", "species", or "news" (at least).

@sylvainkahane
Copy link
Contributor

I don't understand what you want to do. In English, there are only two numbers, singular and plural. OK, some nouns can only be singular or only plural. This an interesting property, but which concerns the lexicon. These nouns behave as normal singular or plural nouns. If you add another value for Number, you will break one of the most common rule of English: "The verb agrees in number with its subject." What could be annotated is the fact that a noun is massive or countable. Pluralia tantum are just massive plural nouns, no? I understand the question about the lemma but it must not act on the grammar: "massive vs countable" must be a lexical feature, which play a role in some construction: choice of the determiner, choice of number for generic use, etc. Note also an additional problem: a same lexeme can have different senses which are massive and countable (compare "I like fish" (massive) and "I like fishes" (countable)). Do we want to annotate this? (it is interesting but difficult to do without revising manually the treebanks.)

@AngledLuffa
Copy link

AngledLuffa commented Nov 30, 2023

OK, some nouns can only be singular or only plural.

That's literally the proposal - annotate those nouns. Ptan stands for "plural tantum" where "tantum" in some way represents the fact that the word is of a class which is never singular. (TBH I'm not really clear where that terminology came from) So for example, you could say "I have chips. Here, have one" but it would be rather weird to say "I have savings. Here, have one"

@rhdunn
Copy link
Author

rhdunn commented Nov 30, 2023

There are various plural-only words such as in "I put on my glasses". With these, the lemma is the same as the form, not the depluralized form. I.e. the lemma for "glasses" in this case is "glasses" not "glass". The proposal is to mark these uses as Number=Ptan instead of Number=Plur.

If Number=Plur is used, it is harder for NLP systems to learn the depluralization rules, resulting in inconsistent lemmas in generated output. Especially if there are plural instances.

@rhdunn
Copy link
Author

rhdunn commented Nov 30, 2023

These are Latin terms:

  1. plurale tantum -- plural only (https://en.wikipedia.org/wiki/Plurale_tantum, https://en.wiktionary.org/wiki/plurale_tantum#English)
  2. singulare tantum -- singular only (https://en.wiktionary.org/wiki/singulare_tantum#English)

@sylvainkahane
Copy link
Contributor

We know what "plurale tantum" means. It is not the question. A plurale tantum is a lexical unit whose occurences are all plural. The properties of the lexical unit must not be confused with those of its occurrences. Number=Plur is a property of the occurences. Number=Ptan does not make any sense for me (at least in English). This property is a property attached to the lexical unit itself. It must be encoded by another feature.

@nschneid
Copy link
Contributor

@sylvainkahane I understand Number=Ptan to be a subtype of Number=Plur. As you say, it is a subtype that concerns itself with the overall lexeme as reflected in the lemma. Another way to do this would be something like LexNumber=PlurOnly|Number=Plur. I don't have any particular preference between the two implementations but Number=Ptan is already part of the universal guidelines. Perhaps @dan-zeman would like to weigh in.

@rhdunn
Copy link
Author

rhdunn commented Nov 30, 2023

Would Number=Plur,Ptan also work? -- UD allows multi-valued features.

@nschneid
Copy link
Contributor

I do have a slight worry that algorithms projecting agreement features onto verbs from their subjects would naïvely copy Ptan, when it really only applies to nouns. So that would be a practical argument in favor of separating LexNumber.

@amir-zeldes
Copy link
Contributor

Ptan is a canonical annotation value of UD, and I assume in any language where it is used, it implies Plur, so @sylvainkahane 's objection is not really English-specific IMO, it sounds like a general criticism of Ptan. But other languages do use it in exactly this way, so it is UD English that's unusual here.

I think it's not so odd to have more specific values that imply other values. For example, in many languages, numbers are essentially nouns (e.g. Semitic), or there is no really strong distinction between ADJ and NOUN, but we still use the more specific tag where appropriate. Should we avoid NUM in Arabic just because it obscures the facts that Arabic cardinal numbers are morphosyntactically also nouns? I think in such cases it can be understood that a language should implement the most specific labels possible, and an implication hierarchy such as Ptan -> (subtype of) Plur is understood.

I do have a slight worry that algorithms projecting agreement features onto verbs from their subject

Such alogrithms wouldn't get very far anyway: even just for plain coordination of two singulars you need to switch to Plur, so I would say the problem would be with the algorithm, not the annotation. I don't mind if UD says as a whole that Ptan is not a value of Number, but if it is, then I see nothing about the English case to suggest this shouldn't be used here.

@dan-zeman
Copy link
Member

Yes, Number=Ptan can be viewed as a special case of Number=Plur, so if a language uses it, then agreement between Ptan subject and Plur verb is not unexpected.

In some languages (English, probably), the special bit is a property of the lexeme, as @sylvainkahane points out. In others (Czech, for instance) it also has morphosyntactic implications (you must use different forms of numerals with plurale tantum than with normal plural nouns).

@amir-zeldes
Copy link
Contributor

OK, so if we're doing this we need a list. Here's what I gleaned from the above plus the GUM exempt plural form lemmas (presumably EWT has some more):

lemma xpos notes
Netherlands NNPS  
Analytics NNPS  
Olympics NNPS  
Commons NNPS  
Paralympics NNPS  
Vans NNPS  
Andes NNPS  
Philippines NNPS  
Maldives NNPS  
surroundings NNS  
energetics NNS  
politics NNS  
jeans NNS  
clothes NNS  
electronics NNS  
means NNS  
feces NNS  
remains NNS  
news NNS  
economics NNS listed by @nschneid as non Ptan, but if functioning as NNS I think it is
arrears NNS  
glasses NNS  
thanks NNS  
ergonomics NNS  
aesthetics NNS  
twenties NNS and 20s, 1920s etc. for years (but not: as plural for sets of 20)
thirties NNS  
NNS  
pants NNS  
scissors NNS  

Not Ptan: species, series, biceps, triceps

Any additions/comments welcome!

@AngledLuffa
Copy link

Disagree on series: one TV show or one set of 7 playoff games is one series

@amir-zeldes
Copy link
Contributor

Disagree on series: one TV show or one set of 7 playoff games is one series

Right, one series - so there can be a single one, or multiple series. So it's not Ptan, just a noun whose singular form is identical to the plural form (like "sheep"), no?

@nschneid
Copy link
Contributor

nschneid commented Dec 1, 2023

Yeah: "That series is canceled." "Those series are canceled." It ends in -ies because of the Latin source, nothing to do with pluralization.

Ptan means the form cannot be used in the singular or made singular in the same sense. (Maybe "economics" is valid as Ptan in the plural: If I say "the economics are sound" that's has nothing to do with multiple "economic"s, and is a different sense from talking about the field of economics.)

@AngledLuffa
Copy link

My mistake, sounds good

nschneid added a commit to UniversalDependencies/UD_English-EWT that referenced this issue Dec 25, 2023
nschneid added a commit to UniversalDependencies/UD_English-EWT that referenced this issue Dec 25, 2023
@nschneid
Copy link
Contributor

nschneid commented Dec 25, 2023

Implemented for EWT. Scripts: UniversalDependencies/UD_English-EWT@547b675...cd0d92f#diff-e02db0ba7788687b383704df1414689e399e8d52709bac84029ab9a86d64c109

I disambiguated "respects" manually, but haven't worked on the "-ics" nouns ("politics", etc.), except to apply Ptan to the ones in the list. In EWT the default for such nouns is NNS whereas in GUM it is NN. Absent evidence from subject-verb agreement it can be hard to tell whether an "-ics" token should be considered singular or plural....

nschneid added a commit to UniversalDependencies/UD_English-EWT that referenced this issue Dec 26, 2023
nschneid added a commit to UniversalDependencies/UD_English-EWT that referenced this issue Dec 26, 2023
nschneid added a commit to UniversalDependencies/UD_English-EWT that referenced this issue Dec 26, 2023
@nschneid
Copy link
Contributor

nschneid commented May 5, 2024

Number=Ptan was implemented for English, at least EWT and GUM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants