Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example data files for NOM sample preparation #146

Merged
merged 7 commits into from
May 15, 2024

Conversation

anastasiyaprymolenna
Copy link
Collaborator

Valid metadata to use for NOM workflows. Includes extraction and SPE protocols.

@turbomam The test fails to validate a CURIE, but the output from poetry is not informative as to which CURIE does is not accepted. Is there a way to make poetry have a more informative traceback than what follows? Or is there a way to speed up testing for just one example file? Because there are 66 curies in that file to test if I am to do it one by one.

  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/bin/linkml-run-examples", line 8, in <module>
    sys.exit(cli())
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/linkml/workspaces/example_runner.py", line 313, in cli
    runner.process_examples()
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/linkml/workspaces/example_runner.py", line 138, in process_examples
    self.process_examples_from_list(input_examples, fmt, False)
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/linkml/workspaces/example_runner.py", line 206, in process_examples_from_list
    rdflib_dumper.dump(obj, to_file=output_file, schemaview=sv, prefix_map=self.prefix_map)
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/linkml_runtime/dumpers/rdflib_dumper.py", line 168, in dump
    super().dump(element, to_file, schemaview=schemaview, fmt=fmt, prefix_map=prefix_map)
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/linkml_runtime/dumpers/dumper_root.py", line 19, in dump
    output_file.write(self.dumps(element, **_))
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/linkml_runtime/dumpers/rdflib_dumper.py", line 186, in dumps
    return self.as_rdf_graph(element, schemaview, prefix_map=prefix_map).\
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/linkml_runtime/dumpers/rdflib_dumper.py", line 68, in as_rdf_graph
    self.inject_triples(element, schemaview, g)
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/linkml_runtime/dumpers/rdflib_dumper.py", line 141, in inject_triples
    v_node = self.inject_triples(v, schemaview, graph, slot.range)
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/linkml_runtime/dumpers/rdflib_dumper.py", line 141, in inject_triples
    v_node = self.inject_triples(v, schemaview, graph, slot.range)
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/linkml_runtime/dumpers/rdflib_dumper.py", line 141, in inject_triples
    v_node = self.inject_triples(v, schemaview, graph, slot.range)
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/linkml_runtime/dumpers/rdflib_dumper.py", line 113, in inject_triples
    return self._as_uri(element, id_slot, schemaview)
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/linkml_runtime/dumpers/rdflib_dumper.py", line 193, in _as_uri
    return schemaview.namespaces().uri_for(element_id)
  File "/home/prym311/.cache/pypoetry/virtualenvs/nmdc-schema-3tqC4PG7-py3.10/lib/python3.10/site-packages/linkml_runtime/utils/namespaces.py", line 228, in uri_for
    raise ValueError(f"{TypedNode.yaml_loc(uri_or_curie)}Unknown CURIE prefix: {prefix}")
ValueError: : Unknown CURIE prefix: @base
make: *** [project.Makefile:84: examples/output] Error 1```

@anastasiyaprymolenna anastasiyaprymolenna changed the title Nom example sample prep metadata NOM example sample prep metadata May 9, 2024
@turbomam
Copy link
Member

Your concern is valid, but it's not really Poetry's fault.

"ValueError: : Unknown CURIE prefix: @base" means that there's some value that is supposed to be a CUIRe, but it doesn't have a prefix.

Yes, the files can be validated individually with linkml-validate but that doesn't catch everything that linkml-run-examples does.

I'll look though this over the next day or two and share my insights.

@turbomam
Copy link
Member

@pkalita-lbl what is the recommended debugging next step when one validation crashes like this and doesn't report an id or line number of the offending line?

@turbomam
Copy link
Member

I'm running this now, from src/scripts

#!/bin/bash

# Path to the schema file
SCHEMA="../../nmdc_schema/nmdc_materialized_patterns.yaml"

# Directory containing the data files
DATA_DIR="../../src/data/valid"

# Loop over all files in the directory
for file in "$DATA_DIR"/*; do
  # Extract the class name from the filename by cutting on the first hyphen
  class_name=$(basename "$file" | cut -d'-' -f1)

  echo "$file"
  echo "$class_name"

  # Run the linkml-validate command
  linkml-validate --schema "$SCHEMA" --target-class "$class_name" "$file"
done

@pkalita-lbl
Copy link

You can see from the stacktrace that the problem isn't in the validation. It emanates from the line where ExampleRunner attempts to serialize the example data as a TTL-file. If it were me I'd point my local linkml development environment at this schema and example directory and instrument that line (however you prefer to do that -- your IDE debugger, print statements, etc) to see how far it gets before the exception is raised. Then if necessary I'd debug down into linkml-runtime (where the exception is actually thrown) as well.

This also speaks to the fact that linkml-run-examples might need a verbose flag and additional logging statements to (optionally) tell the user things about what example file is being processed, etc.

@turbomam
Copy link
Member

@pkalita-lbl and I made the same discovery around the same time. The linkml-convert exercise didn't help. I'll try converting each valid example to RDF, which is where the absence of a CURIe is especially problematic.

@turbomam
Copy link
Member

@cmungall is there some way to assert a default CURIe base in the RDF conversion process?

@turbomam
Copy link
Member

Step-wise RDF generation over the example files is revealing several errors

@turbomam
Copy link
Member

src/data/valid/Database-AssemblyAnalysis-1.yaml

jsonschema.exceptions.ValidationError: {'nmdc:FailureCategorization': {'type': 'nmdc:FailureCategorization', 'qc_failure_what': 'other', 'qc_failure_where': 'MetagenomeAssembly'}} is not of type 'array'

Failed validating 'type' in schema['properties']['metagenome_assembly_set']['items']['properties']['has_failure_categorization']:
    {'items': {'$ref': '#/$defs/FailureCategorization'}, 'type': 'array'}

On instance['metagenome_assembly_set'][0]['has_failure_categorization']:
    {'nmdc:FailureCategorization': {'qc_failure_what': 'other',
                                    'qc_failure_where': 'MetagenomeAssembly',
                                    'type': 'nmdc:FailureCategorization'}}

src/data/valid/Database-biosample-exhasutive.yaml

jsonschema.exceptions.ValidationError: {'nmdc:TextValue': {'type': 'nmdc:TextValue', 'has_raw_value': 'lime;1 kg/acre;2022-11-16T16:05:42+0000'}} is not of type 'array'

Failed validating 'type' in schema['properties']['biosample_set']['items']['properties']['agrochem_addition']:
    {'description': 'Addition of fertilizers, pesticides, etc. - amount '
                    'and time of applications',
     'items': {'$ref': '#/$defs/TextValue'},
     'type': 'array'}

On instance['biosample_set'][0]['agrochem_addition']:
    {'nmdc:TextValue': {'has_raw_value': 'lime;1 '
                                         'kg/acre;2022-11-16T16:05:42+0000',
                        'type': 'nmdc:TextValue'}}

src/data/valid/Database-mags.yaml

jsonschema.exceptions.ValidationError: {'nmdc:MagBin': {'type': 'nmdc:MagBin', 'bin_name': 'bins.3', 'bin_quality': 'LQ', 'completeness': 2.0, 'contamination': 0.0, 'gene_count': 294, 'num_16s': 0, 'num_23s': 0, 'num_5s': 0, 'num_t_rna': 1, 'number_of_contig': 11}} is not of type 'array'

Failed validating 'type' in schema['properties']['mags_set']['items']['properties']['mags_list']:
    {'items': {'$ref': '#/$defs/MagBin'}, 'type': 'array'}

On instance['mags_set'][0]['mags_list']:
    {'nmdc:MagBin': {'bin_name': 'bins.3',
                     'bin_quality': 'LQ',
                     'completeness': 2.0,
                     'contamination': 0.0,
                     'gene_count': 294,
                     'num_16s': 0,
                     'num_23s': 0,
                     'num_5s': 0,
                     'num_t_rna': 1,
                     'number_of_contig': 11,
                     'type': 'nmdc:MagBin'}}

src/data/valid/Database-NOM-material-processing.yaml

ValueError: File "Database-NOM-material-processing.yaml", line 10, col 19: Unknown CURIE prefix: @base

src/data/valid/Database-ReadQcAnalysisActivity-quality_fail.yaml

jsonschema.exceptions.ValidationError: {'nmdc:FailureCategorization': {'type': 'nmdc:FailureCategorization', 'qc_failure_what': 'malformed_data', 'qc_failure_where': 'ReadQcAnalysisActivity'}} is not of type 'array'

Failed validating 'type' in schema['properties']['read_qc_analysis_set']['items']['properties']['has_failure_categorization']:
    {'items': {'$ref': '#/$defs/FailureCategorization'}, 'type': 'array'}

On instance['read_qc_analysis_set'][0]['has_failure_categorization']:
    {'nmdc:FailureCategorization': {'qc_failure_what': 'malformed_data',
                                    'qc_failure_where': 'ReadQcAnalysisActivity',
                                    'type': 'nmdc:FailureCategorization'}}

src/data/valid/MetabolomicsAnalysis-1.yaml

jsonschema.exceptions.ValidationError: {'nmdc:MetaboliteIdentification': {'type': 'nmdc:MetaboliteIdentification', 'alternative_identifiers': ['kegg:C00583', 'cas:57-55-6'], 'highest_similarity_score': 0.9534156546099186, 'metabolite_identified': 'chebi:16997'}} is not of type 'array'

Failed validating 'type' in schema['properties']['has_metabolite_identifications']:
    {'items': {'$ref': '#/$defs/MetaboliteIdentification'}, 'type': 'array'}

On instance['has_metabolite_identifications']:
    {'nmdc:MetaboliteIdentification': {'alternative_identifiers': ['kegg:C00583',
                                                                   'cas:57-55-6'],
                                       'highest_similarity_score': 0.9534156546099186,
                                       'metabolite_identified': 'chebi:16997',
                                       'type': 'nmdc:MetaboliteIdentification'}}

@turbomam
Copy link
Member

@anastasiyaprymolenna do you want me to fix these for you and push them back to your branch?

We'll also have to think about when we are going to back-merge berkeley-schema-fy24 main back into this. There have been a lot of changes.

to src/data/problem/valid
@turbomam
Copy link
Member

The illegal CURIes in src/data/valid/Database-NOM-material-processing.yaml were from known_as assertions where the nmdc prefix wasn't provided. For example:

known_as: chem-99-000005

@turbomam
Copy link
Member

turbomam commented May 15, 2024

MetabolomicsAnalysis-1.yaml failed because has_metabolite_identifications did not have inlined_as_list set to true

It also concerns me that the identifier for the metabolite_identified used a lower-case chebi prefix, although apparently that isn't being checked. We should use the same prefixes as defined in the schema, CHEBI in this case.

@turbomam
Copy link
Member

Same thing for

  • has_failure_categorization wrt Database-AssemblyAnalysis-1.yaml
  • mags_list wrt Database-mags.yaml

@turbomam
Copy link
Member

turbomam commented May 15, 2024

@anastasiyaprymolenna

Summary for tonight:

  • thanks for calling my attention to this
  • I apologize that it was hard to debug
  • I am going to merge the changes but Database-biosample-exhasutive.yaml still needs some more work.
    • I may roll that into other MIxS integrations I've been working on.
    • all multivalued slots whose range is a class should be set to inlined_as_list unless the class in its' range has an identifier slot, usually id in nmdc-schmea.
      range: TextValue
  • I don't know why these the inlined_as_list issues became a systematic problem at this time.
    • did we bump up to a newer LinkML version?
    • were inlined_as_list assertions broadly removed from the schema? I really don' think so
  • but that's somewhat moot, because all of the changes I've introduced here really are the right way to do things. I'm surprised they weren't a problem in the past
  • if you haven't been doing this systematically, please start for the time-being: run make squeaky-clean all test after each change to the schema or example data files. I expect that we will be able to roll that back eventually.

@turbomam turbomam merged commit 4640e42 into main May 15, 2024
2 checks passed
@turbomam turbomam deleted the nom-example-sample-prep-metadata branch May 15, 2024 02:36
@turbomam turbomam changed the title NOM example sample prep metadata Example data files for NOM sample preparation May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants