Best practices for embedding data dictionaries in datasets #2170

cmungall · 2024-06-21T18:31:25Z

cmungall
Jun 21, 2024
Maintainer

In LinkML there is a clean separation of schema and data (of course, schema is just data conforming to the metamodel).

This makes it easy to create reusable schemas or standards that can be applied across multiple datasets.

However, sometimes it's convenient to have a bespoke schema for an individual dataset. These aren't typically referred to as schemas or data models, but perhaps as data dictionaries.

There are a number of frameworks that bundle these together

Frictionless Add frictionless schema importer/exporter #861
EML Add EML (Ecological Metadata Language) importer to schema-automator #2168
CSVW Add generator for CSV on the Web (CSVW) #86
HDMF and HDF5 based frameworks
A dump of a SQL database containing both DDL and INSERTs

Of course this use case is already supported in LinkML in that you can always define a schema yaml for a one-off dataset in CSV, JSON, whatever. But it might be more convenient if these could be bundled into one file (no cheating - not a tar file..). It might also make it easier to map to the above frameworks.

Complicating the picture, these other frameworks sometimes contain ways of describing the dataset itself - how it was collected, who collected it. Sometimes this is generic, sometimes domain specific.

There are a few approaches here, depending on whether we consider the composed representation schema-first or data-first.

schema-first

In a schema-first approach, we would simply allow the schema to be the vehicle for dataset distribution. We could add slots to classes to indicate source of data (like csvw)

E.g.

classes:
  TemperatureRecording:
    data_files:  ## new metaslot
       - location: ./my.csv
    attributes:
       ...

This would be generic

If we wanted to capture additional metadata about the data set then this could go in the schema metadata itself but this is mixing concerns a bit and we obviously want to keep the schema metadata generic and not introduce domain specific modeling.

But this could be done with annotations; e.g.

id: temp-recordings
annotations:
  sampling_protocols: <domain specific model here...>
...
classes:
  TemperatureRecording:
    data_files:  ## new metaslot
       - location: ./my.csv
          annotations:
             collection_start_date: ... ## domain-specific
    attributes:
       ...

the schema of the annotations would be encoded by a meta-schema, see https://linkml.io/linkml/schemas/annotations.html#validation-of-annotations

A variant of this is that a dataset schema inherits from the linkml metamodel

data-first

With this approach every community would define their own dataset schema (hopefully extending dcat/schema.org/d4d/etc), but we would have a mechanism for saying that parts of the data would map to a schema; e.g.

id: my-dataset
sampling_protocols: ...
uri_prefixes: ## everything here must conform to linkml prefixes
data_files:
  - location: my.csv
     fields:   ## everything here must conform to linkml attributes

the schema for this dataset would map uri_prefixes to linkml:prefixes and fields to linkml:attributes

a variant of this is for the dataset schema to import the metamodel allowing reuse of its components

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linked data Modeling Language

Best practices for embedding data dictionaries in datasets #2170

{{title}}

Replies: 0 comments

Select a reply

Linked data Modeling Language

Best practices for embedding data dictionaries in datasets #2170

cmungall Jun 21, 2024 Maintainer

schema-first

data-first

Replies: 0 comments

cmungall
Jun 21, 2024
Maintainer