-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Master Information
class issue
#1947
Comments
The nmdc-schema has well-developed subclasses of NamedThing for MaterialEntities and PlannedProcesses. Those two classes (including all of their subclasses) are implicitly disjoint with one another. The nmdc-schema also has a DataObject class, which is implicitly disjoint from both MaterialEntities and PlannedProcesses, but it is not placed under any intermediate organizing class. A reasonable grouping class would be WorkflowExecution in the berkeley-schema-fy24 also have inputs and outputs, but I can't recall right now whether these relationships are populated with DataObjects. That doesn't appear to be explicitly constrained in berkeley-schema-fy24. A common but not completely satisfying definition of Information is "anything that decreases uncertainty". For example, one doesn't know how a PlannedProcess was executed unless information is provided. Likewise, one is uncertain of the results of a PlannedProcess until Information is observed, saved, etc. There are multiple patterns by which information can be associated with process in a linked data model.
One consideration for selecting between those patterns is whether users need the ability to search through the information, and the degree to which a constrained number of information patterns will be associated with a large number of processes. Direct search over a small, highly repeated set of information patterns is strong justification for making the information patterns first class citizens in their own table, collection, etc. DataObjects are currently used to capture process results and follow pattern 3. The DataObjects generally (?) link to their external resource with the url slots. The ideal modeling of Information in the nmdc-schema will take advantage of hierarchical organization and will use a minimal number of relationship patterns. |
Information
class issue
I assume that several people will want to have input into the implementation of this issue. I would like one primary contact person. Could that be @kheal ? |
Provide some tools for interrogating OBI (or some subset) with a large-context LLM. This could intrinsically be expensive. input token limits via API:
Using these models through their APIs requires more coding than using them through their web interfaces, but they offer more traceability and repeatability. Qualitatively, I feel like Claude gives better results than Gemini 1.5, but it is more expensive and harder to setup. BBOP staff are provided with funding for ChatGPT See also https://artificialanalysis.ai/ |
wc -w obi.owl
|
Would https://curategpt.io/ be helpful?
|
see also
|
I am especially interested in linking slots, like Where is it allowed to be used? https://microbiomedata.github.io/nmdc-schema/url/
Where has it been used in practice? PREFIX nmdc: <https://w3id.org/nmdc/>
select
?st ?ot ?odt (count(?s) as ?count)
where {
graph <https://api.microbiomedata.org> {
?s nmdc:url ?o .
optional {
?s a ?st
}
optional {
?o a ?ot
}
BIND (IF(isIRI(?o), "IRI",
IF(isLiteral(?o), str(datatype(?o)), "Unknown"))
AS ?odt)
}
}
group by ?st ?ot ?odt
|
There's also websites and homepage_website slots There's a UrlValue class in the nmdc-schema but not in the berkeley-schema-fy24 |
PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select
?st ?p ?ot ?odt (count(?s) as ?count)
where {
graph <https://api.microbiomedata.org> {
?s ?p ?o .
?p rdfs:subPropertyOf* nmdc:websites .
optional {
?s a ?st
}
optional {
?o a ?ot
}
BIND (IF(isIRI(?o), "IRI",
IF(isLiteral(?o), str(datatype(?o)), "Unknown"))
AS ?odt)
}
}
group by ?st ?p ?ot ?odt homepage_website does not appear to be used in the nmdc-graph-2024-04-11 GrapghDB respoitory
|
maybe make potential problems:
We should at least assert |
To what degree do the PREFIX nmdc: <https://w3id.org/nmdc/>
select
?p (count(?do) as ?count)
where {
graph <https://api.microbiomedata.org> {
?do a nmdc:DataObject ;
?p ?o .
}
}
group by ?p
order by desc(count(?do))
3,552 out of 179,528 |
Slot analysis of the PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX dcterms: <http://purl.org/dc/terms/>
select *
where {
graph <https://api.microbiomedata.org> {
?do a nmdc:DataObject .
}
minus {
?do nmdc:url ?url
}
optional {
?do nmdc:name ?name
}
optional {
?do dcterms:description ?description .
bind(replace(?description, " for.*$", "") as ?description_pattern)
}
optional {
?do nmdc:nmdc:file_size_bytes ?file_size_bytes
}
optional {
?do nmdc:md5_checksum ?md5_checksum
}
optional {
?do nmdc:data_object_type ?data_object_type
}
optional {
?do nmdc:type ?nmdc_type
}
optional {
?do nmdc:was_generated_by ?generator
}
optional {
?do nmdc:alternative_identifiers ?alternative_identifiers
}
}
none assert any of these either
|
@turbomam Can we close this or convert to a discussion? I think the only outstanding sub/referenced issues are #1852 and #1852 is blocked but planned for completion by myself or @brynnz22. |
@kheal I tried to be self sufficient and convert it into a discussion myself. I was asked what category to put it in. I'll look for the most popular discussion categories but if you have any guidelines please tell me. |
@turbomam - most of the categories in the discussions are default. I did make the schema clean up category to put issues that were more about making changes to the schema that were stylistic or to better adhere to best practices. |
Information
class as a direct subclass ofNamedThing
DataObject
a direct subclass ofInformation
.Configuration
class as a direct subclass ofNamedThing
Calibration
class as a subclass (possibly indirect) ofInformation
.Calibration
to other classeshas_input
orhas_output
The text was updated successfully, but these errors were encountered: