Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prefLabel is non-deterministic when multiple rdfs:label are present in the source #164

Open
alexskr opened this issue Dec 1, 2022 · 11 comments

Comments

@alexskr
Copy link
Member

alexskr commented Dec 1, 2022

We have discrepancies in the way perflabel is generated when multiple rdfs:label entries are present such that in the staging environment one label is chosen and in production, a different label is chosen.

For example, in UPHENO ontology term ID http://purl.obolibrary.org/obo/FBbt_00000002 Preferred name in production is 'Drosophila tagma' but in staging it is 'dagma'

image
image

@alexskr
Copy link
Member Author

alexskr commented Dec 1, 2022

`www.semanticweb.org/rbmor/ontologies/2021/1/1/untitled-ontology-133#Agitation term ID sometimes is named as "Agitation" and sometimes as "Restlessness"
image
image

@graybeal
Copy link

graybeal commented Dec 5, 2022

I think that is the behavior I would expect. There are 3 labels and no prefLabel in the ontology shown here, and AFAIK there can be no enforced ordering in SPARQL when the triples are requested, so whichever one comes back first is Preferred. (We have to pick one and we can't pick more than one.) Since they haven't specified a specific preferredLabel neither can we.

We could choose in this case to always provide the alphabetically first label, the longest label, or the longest label that has the alphabetically first label in it (alphabetical order breaking ties in options 2 or 3). That should make the label consistent and most inclusive, I think option 2 is best.

Another option in some cases: In the example you provided on slack (below), there is a label specified by the property obo:IAO_0000589, which is "An alternative name for a class or property which is unique across the OBO Foundry." We could use this class when present to choose a label, although it is not one of the original labels in the ontology; it has the advantage that it is a singular label (so, consistent from one parsing to the next) and in OBO world makes the label unique across OBO (apparently).


    <!-- http://purl.obolibrary.org/obo/FBbt_00000002 -->

    <owl:Class rdf:about="http://purl.obolibrary.org/obo/FBbt_00000002">
        <owl:equivalentClass>
            <owl:Class>
                <owl:intersectionOf rdf:parseType="Collection">
                    <rdf:Description rdf:about="http://purl.obolibrary.org/obo/UBERON_6000002"/>
                    <owl:Restriction>
                        <owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/BFO_0000050"/>
                        <owl:someValuesFrom rdf:resource="http://purl.obolibrary.org/obo/NCBITaxon_7227"/>
                    </owl:Restriction>
                </owl:intersectionOf>
            </owl:Class>
        </owl:equivalentClass>
        <rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/FBbt_00057001"/>
        <rdfs:subClassOf rdf:resource="http://purl.obolibrary.org/obo/UBERON_6000002"/>
        <rdfs:subClassOf>
            <owl:Restriction>
                <owl:onProperty rdf:resource="http://purl.obolibrary.org/obo/BFO_0000050"/>
                <owl:someValuesFrom rdf:resource="http://purl.obolibrary.org/obo/FBbt_00000001"/>
            </owl:Restriction>
        </rdfs:subClassOf>
        <metadata:prefixIRI rdf:datatype="http://www.w3.org/2001/XMLSchema#string">FBbt:00000002</metadata:prefixIRI>
        <metadata:treeView rdf:resource="http://purl.obolibrary.org/obo/FBbt_00000001"/>
        <obo4:part_of rdf:resource="http://purl.obolibrary.org/obo/FBbt_00000001"/>
        <obo:IAO_0000115>The three main divisions of the whole organism formed from groups of segments.</obo:IAO_0000115>
        <obo:IAO_0000589 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">arthropod tagma (drosophila)</obo:IAO_0000589>
        <oboInOwl:hasDbXref>UBERON:6000002</oboInOwl:hasDbXref>
        <oboInOwl:hasOBONamespace>fly_anatomy.ontology</oboInOwl:hasOBONamespace>
        <oboInOwl:id rdf:datatype="http://www.w3.org/2001/XMLSchema#string">FBbt:00000002</oboInOwl:id>
        <oboInOwl:inSubset rdf:resource="http://purl.obolibrary.org/obo/fbbt#FB_gloss"/>
        <oboInOwl:inSubset rdf:resource="http://purl.obolibrary.org/obo/fbbt#cur"/>
        <oboInOwl:inSubset rdf:resource="http://purl.obolibrary.org/obo/fbbt#larval_OF"/>
        <rdfs:label>Drosophila tagma</rdfs:label>
        <rdfs:label>tagma</rdfs:label>
        <skos:notation rdf:datatype="http://www.w3.org/2001/XMLSchema#string">FBbt:00000002</skos:notation>
    </owl:Class>
    ```

@syphax-bouazzouni
Copy link

The reason behind this is in the way prefLabel is handled when no skos:prefLabel found.
We put the first rdfs:label found as the prefLabel (see https://github.com/ncbo/ontologies_linked_data/blob/master/lib/ontologies_linked_data/models/ontology_submission.rb#L678)

So yeah, it's random and dependent on the order of labels that the triple store returns.
What it's strange in your data is that the returned labels (see label property below) order is not the same that the one in the triple store (see property "http://www.w3.org/2000/01/rdf-schema#label" below) (maybe the order is random and change at each request (when no cache))
image

@syphax-bouazzouni
Copy link

We could choose in this case to always provide the alphabetically first label, the longest label, or the longest label that has the alphabetically first label in it

+1

@graybeal
Copy link

graybeal commented Dec 5, 2022

(maybe the order is random and change at each request (when no cache))

yes, my understanding is that list order returned from SPARQL queries are by definition undefined unless a sorting mechanism is explicitly specified in the query. Not a BioPortal limitation. :-)

@jonquet
Copy link

jonquet commented Dec 6, 2022

The best scenario to me would be to inform the ontology developper that he is not following a good practice by not informing on a preferred label property.
I would not vote in preference of an alphabetical selection, as it would "show" we have done something to cope with the ontology developer bad practice.

Personnaly, I would implement the system so that the pref label in BioPortal skipped every label if no pref one is informed (if there are multiple labels of course) and then the automatic rollback already in place will pickup the end of the URI as pref label.
FBbt_00000002 in our exemple
It would not take long to the ontology developer to learn how to declare a preferred label i his/her ontology.

I don't think its good that the technology always addresses the lack of design... BioPortal needs a pref label to offer its full service. Let's brand this and assume the fact that resources that are not especially well designed will be handled less well.

@alexskr
Copy link
Member Author

alexskr commented Dec 6, 2022

I agree that it's not ontoportal's responsibility to fix problems with ontologies. Ideally, it should be handled with other ontology linting/validating tools if they exist; however, the issue I would like to address is that every time the same ontology submission gets processed, ontoportal displays it differently. This complicates troubleshooting, migration, and development efforts.

@graybeal
Copy link

graybeal commented Dec 7, 2022

I do not consider it a "problem" that requires fixing that the provider has not followed a good practice. We are in no position to bird-dog our ontology providers that closely.

However, the system does require that we have a prefLabel for every term, because we use the prefLabel as, you know, the preferred label. Arguably it's bad form for us to make that publicly visible in the way we do, it doesn't really reflect that this is the BP-designated prefLabel and not the ontology author's designation. So that's something we should fix someday (it's apparent on examination of the TTL file, because we use our local prefLabel property not the standard one).

So given that we have to make a choice, let's make the best choice we can, which I think is to choose a label for the prefLabel in a way that we will get the same label every time as long as the author doesn't add a new choice or eliminate our choice. The way to do this most usefully is to take the most detailed label offered (that's the longest, and therefore most likely to include terms in the search list), and if there is more than one of maxLength, pick the alphabetically first one among those.

(I like the idea of the OBO one but am not recommending it because the OBO one may be unattractive in some cases, and the ontology owner may not like it.

If it were possible I'd say add a tool-tip to that PrefLabel title that says "Author-specified prefLabel, or if not specified, longest available label"

@graybeal graybeal changed the title perfLabel is non-deterministic when multiple rdfs:label are present in the source prefLabel is non-deterministic when multiple rdfs:label are present in the source Dec 7, 2022
@mdorf
Copy link
Member

mdorf commented Mar 1, 2023

Added an array .sort call for cases with multiple labels:

            label = rdfs_labels.sort[0]

This isn't a perfect solution, but it adds some determinism in selecting the prefLabel
https://github.com/ncbo/ontologies_linked_data/blob/master/lib/ontologies_linked_data/models/ontology_submission.rb#L687

@matuskalas
Copy link

Does this mean it will be sorted alphanumerically by the rdfs:label literal?
Would it then also mean that the language is not taken into account?
Or is there a proper handling of languages with the fallback order of something like the following?

1. @en-us
2. @en
3. no xml:lang tag

@mdorf
Copy link
Member

mdorf commented Mar 1, 2023

Does this mean it will be sorted alphanumerically by the rdfs:label literal? Would it then also mean that the language is not taken into account? Or is there a proper handling of languages with the fallback order of something like the following?

1. @en-us
2. @en
3. no xml:lang tag

This is a short-term fix to implement a deterministic order of selecting a prefLabel during the ontology processing stage. A more complete support for language-based prefLabel(s) is in the works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants