You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In LinkML there is a clean separation of schema and data (of course, schema is just data conforming to the metamodel).
This makes it easy to create reusable schemas or standards that can be applied across multiple datasets.
However, sometimes it's convenient to have a bespoke schema for an individual dataset. These aren't typically referred to as schemas or data models, but perhaps as data dictionaries.
There are a number of frameworks that bundle these together
A dump of a SQL database containing both DDL and INSERTs
Of course this use case is already supported in LinkML in that you can always define a schema yaml for a one-off dataset in CSV, JSON, whatever. But it might be more convenient if these could be bundled into one file (no cheating - not a tar file..). It might also make it easier to map to the above frameworks.
Complicating the picture, these other frameworks sometimes contain ways of describing the dataset itself - how it was collected, who collected it. Sometimes this is generic, sometimes domain specific.
There are a few approaches here, depending on whether we consider the composed representation schema-first or data-first.
schema-first
In a schema-first approach, we would simply allow the schema to be the vehicle for dataset distribution. We could add slots to classes to indicate source of data (like csvw)
E.g.
classes:
TemperatureRecording:
data_files: ## new metaslot
- location: ./my.csvattributes:
...
This would be generic
If we wanted to capture additional metadata about the data set then this could go in the schema metadata itself but this is mixing concerns a bit and we obviously want to keep the schema metadata generic and not introduce domain specific modeling.
But this could be done with annotations; e.g.
id: temp-recordingsannotations:
sampling_protocols: <domain specific model here...>
...
classes:
TemperatureRecording:
data_files: ## new metaslot
- location: ./my.csvannotations:
collection_start_date: ... ## domain-specificattributes:
...
A variant of this is that a dataset schema inherits from the linkml metamodel
data-first
With this approach every community would define their own dataset schema (hopefully extending dcat/schema.org/d4d/etc), but we would have a mechanism for saying that parts of the data would map to a schema; e.g.
id: my-datasetsampling_protocols: ...uri_prefixes: ## everything here must conform to linkml prefixesdata_files:
- location: my.csvfields: ## everything here must conform to linkml attributes
the schema for this dataset would map uri_prefixes to linkml:prefixes and fields to linkml:attributes
a variant of this is for the dataset schema to import the metamodel allowing reuse of its components
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
In LinkML there is a clean separation of schema and data (of course, schema is just data conforming to the metamodel).
This makes it easy to create reusable schemas or standards that can be applied across multiple datasets.
However, sometimes it's convenient to have a bespoke schema for an individual dataset. These aren't typically referred to as schemas or data models, but perhaps as data dictionaries.
There are a number of frameworks that bundle these together
Of course this use case is already supported in LinkML in that you can always define a schema yaml for a one-off dataset in CSV, JSON, whatever. But it might be more convenient if these could be bundled into one file (no cheating - not a tar file..). It might also make it easier to map to the above frameworks.
Complicating the picture, these other frameworks sometimes contain ways of describing the dataset itself - how it was collected, who collected it. Sometimes this is generic, sometimes domain specific.
There are a few approaches here, depending on whether we consider the composed representation schema-first or data-first.
schema-first
In a schema-first approach, we would simply allow the schema to be the vehicle for dataset distribution. We could add slots to classes to indicate source of data (like csvw)
E.g.
This would be generic
If we wanted to capture additional metadata about the data set then this could go in the schema metadata itself but this is mixing concerns a bit and we obviously want to keep the schema metadata generic and not introduce domain specific modeling.
But this could be done with annotations; e.g.
the schema of the annotations would be encoded by a meta-schema, see https://linkml.io/linkml/schemas/annotations.html#validation-of-annotations
A variant of this is that a dataset schema inherits from the linkml metamodel
data-first
With this approach every community would define their own dataset schema (hopefully extending dcat/schema.org/d4d/etc), but we would have a mechanism for saying that parts of the data would map to a schema; e.g.
the schema for this dataset would map
uri_prefixes
tolinkml:prefixes
andfields
tolinkml:attributes
a variant of this is for the dataset schema to import the metamodel allowing reuse of its components
Beta Was this translation helpful? Give feedback.
All reactions