Master `Information` class issue #1947

turbomam · 2024-04-30T01:20:15Z

create Information class as a direct subclass of NamedThing
make DataObject a direct subclass of Information.
- it's possible that we could place DataObject as a subclass of some intermediate class like Data in the future
add a new Configuration class as a direct subclass of NamedThing
- future intermediate classes may be added in the future
optimize mappings between these new NMDC classes and OBI, IAO, etc classes
ensure that the new NMDC classes don't add new, ambiguous relationships
add a new Calibration class as a subclass (possibly indirect) of Information.
- this will require especially careful analysis of the relationships linking Calibration to other classes
make any slots that model any kind of input or output relationship subproperties of has_input or has_output

The text was updated successfully, but these errors were encountered:

turbomam · 2024-04-30T01:58:07Z

The nmdc-schema has well-developed subclasses of NamedThing for MaterialEntities and PlannedProcesses. Those two classes (including all of their subclasses) are implicitly disjoint with one another.

The nmdc-schema also has a DataObject class, which is implicitly disjoint from both MaterialEntities and PlannedProcesses, but it is not placed under any intermediate organizing class.

A reasonable grouping class would be Information. Information would be disjoint from MaterialEntities and PlannedProcesses, but like MaterialEntities, Information could be either the input into a PlannedProcesses or the output from one. A DataObject as output from a DataGeneration is explicitly modeled in the berkeley-schema-fy24.

WorkflowExecution in the berkeley-schema-fy24 also have inputs and outputs, but I can't recall right now whether these relationships are populated with DataObjects. That doesn't appear to be explicitly constrained in berkeley-schema-fy24.

A common but not completely satisfying definition of Information is "anything that decreases uncertainty". For example, one doesn't know how a PlannedProcess was executed unless information is provided. Likewise, one is uncertain of the results of a PlannedProcess until Information is observed, saved, etc.

There are multiple patterns by which information can be associated with process in a linked data model.

The information values can be bound directly to the process instance
They can be bound into an instance of another class with a direct relationship to the process
They can be mentioned (but not bound, embedded, included etc) by linking to a file or web resource

One consideration for selecting between those patterns is whether users need the ability to search through the information, and the degree to which a constrained number of information patterns will be associated with a large number of processes. Direct search over a small, highly repeated set of information patterns is strong justification for making the information patterns first class citizens in their own table, collection, etc.

DataObjects are currently used to capture process results and follow pattern 3. The DataObjects generally (?) link to their external resource with the url slots.

The ideal modeling of Information in the nmdc-schema will take advantage of hierarchical organization and will use a minimal number of relationship patterns.

turbomam · 2024-04-30T02:11:42Z

I assume that several people will want to have input into the implementation of this issue. I would like one primary contact person. Could that be @kheal ?

turbomam · 2024-04-30T02:20:31Z

Add Calibration class and associated slots and enums berkeley-schema-fy24#133
Edit slots and usages to accommodate multiple detection configurations for single MassSpectrometry instance berkeley-schema-fy24#141
add configuration_data_objects slot berkeley-schema-fy24#136
Add permissible values to FileTypeEnum and DataCategoryEnum to accomodate workflow configuration DataObjects berkeley-schema-fy24#140
Refactor has_calibration to use a class #1918
Remove has_calibration from WorkflowExecution subclasses after MassSpectrometry change sheets #1852
Implement migrator that populates WorkflowExecutionActivity.has_calibration field with a Calibration.id value #1761
update has_calibration from Class: MetabolomicsAnalysisActivity #1570
Example DataObject required for ChromatographicSeparationProcess-GC-has_calibration.yaml in berkeley-schema-fy24 #1850
document has_calibration in metabolomics analysis activity #304
Edit slot usages to accommodate multiple detection configurations for single MassSpectrometry instance #1910
Add permissible values to FileTypeEnum and DataCategoryEnum to accomodate workflow configuration DataObjects #1912
Metabolomics Schema Updates Meta Issue #1905
DO NOT MERGE: Cumulative configurations etc (reuse as documentation only) berkeley-schema-fy24#137

turbomam · 2024-04-30T02:29:02Z

Provide some tools for interrogating OBI (or some subset) with a large-context LLM. This could intrinsically be expensive.

input token limits via API:

Gemini 1.5 (via Vertex API): 1 M
Claude 3 opus: 200 k
ChatGPT 4 Turbo: 128 k

Using these models through their APIs requires more coding than using them through their web interfaces, but they offer more traceability and repeatability.

Qualitatively, I feel like Claude gives better results than Gemini 1.5, but it is more expensive and harder to setup.

BBOP staff are provided with funding for ChatGPT

see also

OBI:0000654 device setting (a quality?)
OBI:0000818 calibration ( a process)
IAO:0000109 measurement datum

turbomam · 2024-04-30T13:40:28Z

I am especially interested in linking slots, like url

Where is it allowed to be used?

https://microbiomedata.github.io/nmdc-schema/url/

Name	Description	Modifies Slot
DataObject	An object that primarily consists of symbols that represent information	no
ImageValue	An attribute value representing an image	no
Protocol		no

Where has it been used in practice?

PREFIX nmdc: <https://w3id.org/nmdc/>
select
?st ?ot ?odt (count(?s) as ?count)
where {
    graph <https://api.microbiomedata.org> {
        ?s nmdc:url ?o .
        optional {
            ?s a ?st
        }
        optional {
            ?o a ?ot
        }
        BIND (IF(isIRI(?o), "IRI", 
                IF(isLiteral(?o), str(datatype(?o)), "Unknown")) 
            AS ?odt) 
    }
}
group by ?st ?ot ?odt

st	ot	odt	count
nmdc:DataObject		xsd:string	175976
nmdc:ImageValue		xsd:string	7

turbomam · 2024-04-30T13:43:45Z

There's also websites and homepage_website slots

There's a UrlValue class in the nmdc-schema but not in the berkeley-schema-fy24

turbomam · 2024-04-30T13:46:39Z

PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select
?st ?p ?ot ?odt (count(?s) as ?count)
where {
    graph <https://api.microbiomedata.org> {
        ?s ?p ?o .
        ?p rdfs:subPropertyOf* nmdc:websites .
        optional {
            ?s a ?st
        }
        optional {
            ?o a ?ot
        }
        BIND (IF(isIRI(?o), "IRI", 
                IF(isLiteral(?o), str(datatype(?o)), "Unknown")) 
            AS ?odt) 
    }
}
group by ?st ?p ?ot ?odt

homepage_website does not appear to be used in the nmdc-graph-2024-04-11 GrapghDB respoitory

st	p	ot	odt	count
nmdc:Study	nmdc:websites		xsd:string	35

turbomam · 2024-04-30T13:49:10Z

maybe make websites a subproperty of url

potential problems:

url is single valued and websites is multi-valued
websites has a pattern constraint

We should at least assert see_alsos

turbomam · 2024-04-30T13:55:20Z

To what degree do the DataObjects use the url slot?

PREFIX nmdc: <https://w3id.org/nmdc/>
select
?p (count(?do) as ?count)
where {
    graph <https://api.microbiomedata.org> {
        ?do a nmdc:DataObject ;
            ?p ?o .
    }
}
group by ?p
order by desc(count(?do))

p	count
rdf:type	179528
nmdc:name	179528
dcterms:description	179528
nmdc:url	175976
nmdc:file_size_bytes	172963
nmdc:md5_checksum	169777
nmdc:data_object_type	165839
nmdc:type	164546
nmdc:was_generated_by	4847
nmdc:alternative_identifiers	146

3,552 out of 179,528 DataObjects are missing urls

turbomam · 2024-04-30T14:13:08Z

Slot analysis of the DataObjects that don't assert url in nmdc-graph-2024-04-11

PREFIX nmdc: <https://w3id.org/nmdc/>
PREFIX dcterms: <http://purl.org/dc/terms/>
select *
where {
    graph <https://api.microbiomedata.org> {
        ?do a nmdc:DataObject .
    }
    minus {
        ?do nmdc:url ?url
    }
    optional {
        ?do nmdc:name ?name
    }
    optional {
        ?do dcterms:description ?description .
        bind(replace(?description, " for.*$", "") as ?description_pattern)
    }
    optional {
        ?do nmdc:nmdc:file_size_bytes  ?file_size_bytes
    }
    optional {
        ?do nmdc:md5_checksum ?md5_checksum
    }
    optional {
        ?do nmdc:data_object_type ?data_object_type
    }
    optional {
        ?do nmdc:type ?nmdc_type
    }
    optional {
        ?do nmdc:was_generated_by ?generator
    }
    optional {
        ?do nmdc:alternative_identifiers ?alternative_identifiers
    }
}

description_pattern	Count
Assembled AGP file	44
Assembled contigs fasta	44
Assembled scaffold fasta	44
Filtered read data	1
Filtered read data stats	1
Full scan GC-MS (but not GC QExactive, which is EI-HMS)	42
High res MS with high res CID MSn (and possibly some low res MSn)	14
High res MS with high res HCD MSn	43
High res MS with high res HCD MSn and low res CID MSn	175
High res MS with low res CID MSn	116
High resolution MS spectra only	2118
Metagenome Alignment BAM file	44
Metagenome Contig Coverage Stats	44
Raw sequencer read data	822
Total Result	3552

none assert any of these either

file_size_bytes
md5_checksum
data_object_type
was_generated_by
alternative_identifiers

kheal · 2024-07-23T16:39:06Z

@turbomam Can we close this or convert to a discussion? I think the only outstanding sub/referenced issues are #1852 and
"make any slots that model any kind of input or output relationship subproperties of has_input or has_output".

#1852 is blocked but planned for completion by myself or @brynnz22.
The second non-completed task on this (making has_input and has_output subslots) seems more tractable as a separate issue.

turbomam · 2024-07-26T18:49:50Z

@kheal I tried to be self sufficient and convert it into a discussion myself. I was asked what category to put it in. I'll look for the most popular discussion categories but if you have any guidelines please tell me.

kheal · 2024-07-26T20:08:09Z

@turbomam - most of the categories in the discussions are default. I did make the schema clean up category to put issues that were more about making changes to the schema that were stylistic or to better adhere to best practices.

turbomam changed the title ~~Master information class issue~~ Master Information class issue Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Master `Information` class issue #1947

Master `Information` class issue #1947

turbomam commented Apr 30, 2024 •

edited by kheal

Loading

turbomam commented Apr 30, 2024

turbomam commented Apr 30, 2024 •

edited

Loading

turbomam commented Apr 30, 2024 •

edited

Loading

turbomam commented Apr 30, 2024 •

edited

Loading

turbomam commented Apr 30, 2024

turbomam commented Apr 30, 2024 •

edited

Loading

turbomam commented Apr 30, 2024 •

edited

Loading

turbomam commented Apr 30, 2024

turbomam commented Apr 30, 2024

turbomam commented Apr 30, 2024 •

edited

Loading

turbomam commented Apr 30, 2024 •

edited

Loading

turbomam commented Apr 30, 2024

turbomam commented Apr 30, 2024

kheal commented Jul 23, 2024

turbomam commented Jul 26, 2024

kheal commented Jul 26, 2024

Master Information class issue #1947

Master Information class issue #1947

Comments

turbomam commented Apr 30, 2024 • edited by kheal Loading

turbomam commented Apr 30, 2024

turbomam commented Apr 30, 2024 • edited Loading

turbomam commented Apr 30, 2024 • edited Loading

turbomam commented Apr 30, 2024 • edited Loading

turbomam commented Apr 30, 2024

turbomam commented Apr 30, 2024 • edited Loading

turbomam commented Apr 30, 2024 • edited Loading

turbomam commented Apr 30, 2024

turbomam commented Apr 30, 2024

turbomam commented Apr 30, 2024 • edited Loading

turbomam commented Apr 30, 2024 • edited Loading

turbomam commented Apr 30, 2024

turbomam commented Apr 30, 2024

kheal commented Jul 23, 2024

turbomam commented Jul 26, 2024

kheal commented Jul 26, 2024

Master `Information` class issue #1947

Master `Information` class issue #1947

turbomam commented Apr 30, 2024 •

edited by kheal

Loading

turbomam commented Apr 30, 2024 •

edited

Loading

turbomam commented Apr 30, 2024 •

edited

Loading

turbomam commented Apr 30, 2024 •

edited

Loading

turbomam commented Apr 30, 2024 •

edited

Loading

turbomam commented Apr 30, 2024 •

edited

Loading

turbomam commented Apr 30, 2024 •

edited

Loading

turbomam commented Apr 30, 2024 •

edited

Loading