Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement migrator that removes used slot from WorkflowExecution (file: migrator_from_X_to_PR31.py) #139

Merged
merged 33 commits into from
May 14, 2024
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
3bf56ca
create migrator and schema changes to remove used slot
Apr 23, 2024
917bcf9
revert core.yaml and basic_classes.yaml
Apr 23, 2024
b5faa03
recommit moving instrument_used slot
Apr 23, 2024
2e1b8b5
Removed a modified nmdc.yaml from pull request, no change
Apr 23, 2024
c084845
Merge branch 'main' into migrate-PR31
Apr 25, 2024
4a3c40f
add variable name changes
Apr 29, 2024
d18b335
update doc string
Apr 29, 2024
399ed81
add testing sets
Apr 29, 2024
987d0e4
stash changes to remote
May 6, 2024
b9afcc7
most recent updates
May 7, 2024
2c6236e
update doc strings
May 7, 2024
1af3e6e
add separate function to add instrument_name slot to omics_processing…
May 7, 2024
862660c
update doc string
May 7, 2024
55c8a7c
passing the batton
May 9, 2024
fc133cd
finish up migrator to use difflib SequenceMatcher
brynnz22 May 10, 2024
3ea8d49
remove doc string
brynnz22 May 10, 2024
088fa36
add backticks
brynnz22 May 13, 2024
3fa1be4
Update nmdc_schema/migrators/migrator_from_X_to_PR31.py
brynnz22 May 13, 2024
558fd83
update variable names
brynnz22 May 13, 2024
2d0e923
Remove white space
brynnz22 May 13, 2024
0e3d0cc
change elif to else
brynnz22 May 13, 2024
fdbe33f
add doc test
brynnz22 May 13, 2024
55ae53e
close paranthese;
brynnz22 May 13, 2024
31ba535
remove quotes from doctest
brynnz22 May 13, 2024
c05961a
add quotes
brynnz22 May 13, 2024
81b0e96
Update nmdc_schema/migrators/migrator_from_X_to_PR31.py
brynnz22 May 14, 2024
be0f002
umcomment lines
brynnz22 May 14, 2024
6811f0e
try removing instrument
brynnz22 May 14, 2024
51d6396
make instrument_used inlined:false
brynnz22 May 14, 2024
6adb530
remove inlined: false
brynnz22 May 14, 2024
26640cb
add regex pattern for instrument_used
brynnz22 May 14, 2024
0661278
move instrument_used out of aliases
brynnz22 May 14, 2024
fa31add
remove instrument_used regex pattern
brynnz22 May 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions nmdc_schema/migrators/migrator_from_X_to_PR31.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
from nmdc_schema.migrators.migrator_base import MigratorBase
from nmdc_schema.migrators.adapters.adapter_base import AdapterBase
from difflib import SequenceMatcher

class Migrator(MigratorBase):
r"""
Migrates data from X to PR31, removes used slot from WorkflowExecution subclasses and checks that the
value in the used slot on the WorkflowExecution classes matches the value on the DataGeneration
instances in the instrument_name slot.
"""

_from_version = "X"
_to_version = "PR31"

def upgrade(self):
r"""Migrates the database from conforming to the original schema, to conforming to the new schema."""

workflow_execution_collection_names = [
"mags_activity_set",
"metabolomics_analysis_activity_set",
"metagenome_annotation_activity_set",
"metagenome_assembly_set",
"metagenome_sequencing_activity_set",
"metatranscriptome_activity_set",
"nom_analysis_activity_set",
"omics_processing_set",
"read_based_taxonomy_analysis_activity_set",
"read_qc_analysis_activity_set"
"metaproteomics_analysis_activity_set"
]

for collection_name in workflow_execution_collection_names:
self.adapter.process_each_document(
collection_name=collection_name,
pipeline=[self.remove_used_slot],
)

def preprocess_string(self, s):
eecavanna marked this conversation as resolved.
Show resolved Hide resolved
r"""
Normalizes strings prior to using SequenceMatcher. Removes white spaces, hyphens, and
underscores from a string so difflib's SequenceMatcher can find the longest contiguous
matching subsequence between two sequences and these characters will not interfere.
>>> m = Migrator()
>>> m.preprocess_string('a b_-_c -de:f g')
'abcde:fg'
"""

return s.replace(" ", "").replace("_","").replace("-","")

def remove_used_slot(self, doc: dict) -> dict:
r"""
Removes the `used` slot from `WorkflowExecution` subclasses if the value matches the
instrument_name slot from the corresponding `OmicsProcessing` document by the longest
common sequence.

>>> from nmdc_schema.migrators.adapters.dictionary_adapter import DictionaryAdapter
>>>
>>> database = {'omics_processing_set':[{'id':'nmdc:omcp-123', 'instrument_name':'nmdc:wfc-456'}]} # in this example, our data store is a Python dictionary
>>> adapter = DictionaryAdapter(database=database)
>>> m = Migrator(adapter=adapter)
>>> m.remove_used_slot({'id': 'nmdc:metab-123', 'used': 'nmdc:wfc-456', 'was_informed_by': 'nmdc:omcp-123'})
{'id': 'nmdc:metab-123', 'was_informed_by': 'nmdc:omcp-123'}
"""

if "used" in doc:
omics_processing_doc = self.adapter.get_document_having_value_in_field(
collection_name="omics_processing_set", field_name="id", value=doc["was_informed_by"]
brynnz22 marked this conversation as resolved.
Show resolved Hide resolved
)

# Preprocess instrument strings to ignore hyphens, underscores, and blank spaces
processed_workflow_instrument_string = self.preprocess_string(doc["used"])
processed_omics_doc_instrument_string = self.preprocess_string(omics_processing_doc["instrument_name"])

similarity_ratio = SequenceMatcher(None, processed_workflow_instrument_string, processed_omics_doc_instrument_string).ratio()
threshold = 0.8
if similarity_ratio >= threshold:
if similarity_ratio < 1.0:
self.logger.info(f"Workflow with id {doc['id']} has instrument: {doc['used']} matches OmicsProcessing doc instrument: {omics_processing_doc['instrument_name']} well enough")
doc.pop("used")
else:
self.logger.error(f"Workflow doc {doc['id']} with instrument: {doc['used']} does not match {omics_processing_doc['instrument_name']}")

return doc


Original file line number Diff line number Diff line change
Expand Up @@ -126,13 +126,19 @@ data_generation_set:
- nmdc:dobj-12-jdhk9537
- nmdc:dobj-12-yx0tfp52
instrument_used:
- nmdc:inst-12-yx0tfp52
- nmdc:inst-14-xx07be40
part_of:
- nmdc:dgns-11-34xj1150
processing_institution: Battelle
type: nmdc:NucleotideSequencing
associated_studies:
- nmdc:sty-11-34xj1150
instrument_set:
- id: nmdc:inst-14-xx07be40
model: novaseq
name: Illumina NovaSeq
vendor: illumina
type: nmdc:Instrument
workflow_chain_set:
- id: nmdc:wfch-11-ab
analyte_category: metagenome
Expand Down
3 changes: 2 additions & 1 deletion src/schema/basic_classes.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,6 @@ classes:
slots:
- has_input
- has_output
- instrument_used
- processing_institution
- protocol_link
- start_date
Expand Down Expand Up @@ -309,6 +308,7 @@ classes:
- mod_date
- part_of
- principal_investigator
- instrument_used
slot_usage:
has_input:
required: true
Expand Down Expand Up @@ -521,6 +521,7 @@ slots:
range: Instrument
multivalued: true
description: What instrument was used during DataGeneration or MaterialProcessing.
pattern: "^nmdc:inst-[0-9][a-z]{0,6}[0-9]-[A-Za-z0-9]{1,}(\\.[A-Za-z0-9]{1,})*(_[A-Za-z0-9_\\.-]+)?$"
brynnz22 marked this conversation as resolved.
Show resolved Hide resolved

model:
range: InstrumentModelEnum
Expand Down
3 changes: 2 additions & 1 deletion src/schema/core.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1183,7 +1183,8 @@ classes:
description:
A process that takes one or more samples as inputs and generates
one or more samples as outputs.

slots:
- instrument_used
notes:
- This class is a replacement for BiosampleProcessing.
slot_usage:
Expand Down
Loading