Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master Information class issue #1947

Open
4 of 7 tasks
turbomam opened this issue Apr 30, 2024 · 16 comments
Open
4 of 7 tasks

Master Information class issue #1947

turbomam opened this issue Apr 30, 2024 · 16 comments

Comments

@turbomam
Copy link
Member

turbomam commented Apr 30, 2024

  • create Information class as a direct subclass of NamedThing
  • make DataObject a direct subclass of Information.
    • it's possible that we could place DataObject as a subclass of some intermediate class like Data in the future
  • add a new Configuration class as a direct subclass of NamedThing
    • future intermediate classes may be added in the future
  • optimize mappings between these new NMDC classes and OBI, IAO, etc classes
  • ensure that the new NMDC classes don't add new, ambiguous relationships
  • add a new Calibration class as a subclass (possibly indirect) of Information.
    • this will require especially careful analysis of the relationships linking Calibration to other classes
  • make any slots that model any kind of input or output relationship subproperties of has_input or has_output
@turbomam
Copy link
Member Author

The nmdc-schema has well-developed subclasses of NamedThing for MaterialEntities and PlannedProcesses. Those two classes (including all of their subclasses) are implicitly disjoint with one another.

The nmdc-schema also has a DataObject class, which is implicitly disjoint from both MaterialEntities and PlannedProcesses, but it is not placed under any intermediate organizing class.

A reasonable grouping class would be Information. Information would be disjoint from MaterialEntities and PlannedProcesses, but like MaterialEntities, Information could be either the input into a PlannedProcesses or the output from one. A DataObject as output from a DataGeneration is explicitly modeled in the berkeley-schema-fy24.

WorkflowExecution in the berkeley-schema-fy24 also have inputs and outputs, but I can't recall right now whether these relationships are populated with DataObjects. That doesn't appear to be explicitly constrained in berkeley-schema-fy24.

A common but not completely satisfying definition of Information is "anything that decreases uncertainty". For example, one doesn't know how a PlannedProcess was executed unless information is provided. Likewise, one is uncertain of the results of a PlannedProcess until Information is observed, saved, etc.

There are multiple patterns by which information can be associated with process in a linked data model.

  1. The information values can be bound directly to the process instance
  2. They can be bound into an instance of another class with a direct relationship to the process
  3. They can be mentioned (but not bound, embedded, included etc) by linking to a file or web resource

One consideration for selecting between those patterns is whether users need the ability to search through the information, and the degree to which a constrained number of information patterns will be associated with a large number of processes. Direct search over a small, highly repeated set of information patterns is strong justification for making the information patterns first class citizens in their own table, collection, etc.

DataObjects are currently used to capture process results and follow pattern 3. The DataObjects generally (?) link to their external resource with the url slots.

The ideal modeling of Information in the nmdc-schema will take advantage of hierarchical organization and will use a minimal number of relationship patterns.

@turbomam turbomam changed the title Master information class issue Master Information class issue Apr 30, 2024
@turbomam
Copy link
Member Author

turbomam commented Apr 30, 2024

I assume that several people will want to have input into the implementation of this issue. I would like one primary contact person. Could that be @kheal ?

@turbomam
Copy link
Member Author

turbomam commented Apr 30, 2024

Provide some tools for interrogating OBI (or some subset) with a large-context LLM. This could intrinsically be expensive.

input token limits via API:

  • Gemini 1.5 (via Vertex API): 1 M
  • Claude 3 opus: 200 k
  • ChatGPT 4 Turbo: 128 k

Using these models through their APIs requires more coding than using them through their web interfaces, but they offer more traceability and repeatability.

Qualitatively, I feel like Claude gives better results than Gemini 1.5, but it is more expensive and harder to setup.

BBOP staff are provided with funding for ChatGPT

See also https://artificialanalysis.ai/

@turbomam
Copy link
Member Author

wc -w obi.owl

431 137 obi.owl.txt

@turbomam
Copy link
Member Author

turbomam commented Apr 30, 2024

Would https://curategpt.io/ be helpful?

what OBI classes can be used to model the settings applied to analytical instruments in general? do not include classes that model one specific instrument.

Screenshot 2024-04-29 at 10 59 29 PM Screenshot 2024-04-29 at 11 01 29 PM

@turbomam
Copy link
Member Author

turbomam commented Apr 30, 2024

see also

  • OBI:0000654 device setting (a quality?)
  • OBI:0000818 calibration ( a process)
  • IAO:0000109 measurement datum

@turbomam
Copy link
Member Author

I am especially interested in linking slots, like url

Where is it allowed to be used?

https://microbiomedata.github.io/nmdc-schema/url/

Name Description Modifies Slot
DataObject An object that primarily consists of symbols that represent information no
ImageValue An attribute value representing an image no
Protocol   no

Where has it been used in practice?

PREFIX nmdc: <https://w3id.org/nmdc/>
select
?st ?ot ?odt (count(?s) as ?count)
where {
    graph <https://api.microbiomedata.org> {
        ?s nmdc:url ?o .
        optional {
            ?s a ?st
        }
        optional {
            ?o a ?ot
        }
        BIND (IF(isIRI(?o), "IRI", 
                IF(isLiteral(?o), str(datatype(?o)), "Unknown")) 
            AS ?odt) 
    }
}
group by ?st ?ot ?odt
st ot odt count
nmdc:DataObject   xsd:string 175976
nmdc:ImageValue   xsd:string 7

@turbomam
Copy link
Member Author

There's also websites and homepage_website slots

There's a UrlValue class in the nmdc-schema but not in the berkeley-schema-fy24

@turbomam
Copy link
Member Author

turbomam commented Apr 30, 2024

PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select
?st ?p ?ot ?odt (count(?s) as ?count)
where {
    graph <https://api.microbiomedata.org> {
        ?s ?p ?o .
        ?p rdfs:subPropertyOf* nmdc:websites .
        optional {
            ?s a ?st
        }
        optional {
            ?o a ?ot
        }
        BIND (IF(isIRI(?o), "IRI", 
                IF(isLiteral(?o), str(datatype(?o)), "Unknown")) 
            AS ?odt) 
    }
}
group by ?st ?p ?ot ?odt

homepage_website does not appear to be used in the nmdc-graph-2024-04-11 GrapghDB respoitory

st p ot odt count
nmdc:Study nmdc:websites   xsd:string 35

@turbomam
Copy link
Member Author

turbomam commented Apr 30, 2024

maybe make websites a subproperty of url

potential problems:

  • url is single valued and websites is multi-valued
  • websites has a pattern constraint

We should at least assert see_alsos

@turbomam
Copy link
Member Author

To what degree do the DataObjects use the url slot?

PREFIX nmdc: <https://w3id.org/nmdc/>
select
?p (count(?do) as ?count)
where {
    graph <https://api.microbiomedata.org> {
        ?do a nmdc:DataObject ;
            ?p ?o .
    }
}
group by ?p
order by desc(count(?do))
p count
rdf:type 179528
nmdc:name 179528
dcterms:description 179528
nmdc:url 175976
nmdc:file_size_bytes 172963
nmdc:md5_checksum 169777
nmdc:data_object_type 165839
nmdc:type 164546
nmdc:was_generated_by 4847
nmdc:alternative_identifiers 146

3,552 out of 179,528 DataObjects are missing urls

@turbomam
Copy link
Member Author

Slot analysis of the DataObjects that don't assert url in nmdc-graph-2024-04-11

PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX dcterms: <http://purl.org/dc/terms/>
select *
where {
    graph <https://api.microbiomedata.org> {
        ?do a nmdc:DataObject .
    }
    minus {
        ?do nmdc:url ?url
    }
    optional {
        ?do nmdc:name ?name
    }
    optional {
        ?do dcterms:description ?description .
        bind(replace(?description, " for.*$", "") as ?description_pattern)
    }
    optional {
        ?do nmdc:nmdc:file_size_bytes  ?file_size_bytes
    }
    optional {
        ?do nmdc:md5_checksum ?md5_checksum
    }
    optional {
        ?do nmdc:data_object_type ?data_object_type
    }
    optional {
        ?do nmdc:type ?nmdc_type
    }
    optional {
        ?do nmdc:was_generated_by ?generator
    }
    optional {
        ?do nmdc:alternative_identifiers ?alternative_identifiers
    }
}
description_pattern Count
Assembled AGP file 44
Assembled contigs fasta 44
Assembled scaffold fasta 44
Filtered read data 1
Filtered read data stats 1
Full scan GC-MS (but not GC QExactive, which is EI-HMS) 42
High res MS with high res CID MSn (and possibly some low res MSn) 14
High res MS with high res HCD MSn 43
High res MS with high res HCD MSn and low res CID MSn 175
High res MS with low res CID MSn 116
High resolution MS spectra only 2118
Metagenome Alignment BAM file 44
Metagenome Contig Coverage Stats 44
Raw sequencer read data 822
Total Result 3552

none assert any of these either

  • file_size_bytes
  • md5_checksum
  • data_object_type
  • was_generated_by
  • alternative_identifiers

@kheal
Copy link

kheal commented Jul 23, 2024

@turbomam Can we close this or convert to a discussion? I think the only outstanding sub/referenced issues are #1852 and
"make any slots that model any kind of input or output relationship subproperties of has_input or has_output".

#1852 is blocked but planned for completion by myself or @brynnz22.
The second non-completed task on this (making has_input and has_output subslots) seems more tractable as a separate issue.

@turbomam
Copy link
Member Author

@kheal I tried to be self sufficient and convert it into a discussion myself. I was asked what category to put it in. I'll look for the most popular discussion categories but if you have any guidelines please tell me.

@kheal
Copy link

kheal commented Jul 26, 2024

@turbomam - most of the categories in the discussions are default. I did make the schema clean up category to put issues that were more about making changes to the schema that were stylistic or to better adhere to best practices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants