Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement migrator that removes used slot from WorkflowExecution (file: migrator_from_X_to_PR31.py) #139

Merged
merged 33 commits into from
May 14, 2024

Conversation

anastasiyaprymolenna
Copy link
Collaborator

This PR removes used slot from all WorfklowExecution subclasses. This info. should already be on the corresponding DataGeneration instances in the instrument_name slot (soon to be changed to instrument_used so will need to come before that migration).

The migrator also checks that the value in the used slot on the WorkflowExecution classes matches the value on the DataGeneration (currently omics_processing_set) instances in the instrument_name slot.

Also included as a part of this PR, is a minor schema change - moving instrument_used slot off of PlannedProcess parent class and put directly on MaterialProcessing and DataGenerationclasses so it is no longer an option to be used on WorkflowExecution.

In testing, the data_generation_set in Database-neon_Biosample_to_DataObject_NEON.yaml did not match any known schema. I did not see anything that would cause this, I commented this out for the time being and the tests passed, but I need a second pair of eyes on it.

@anastasiyaprymolenna anastasiyaprymolenna changed the title Create migrator and schema changes to remove used slot Migrator to remove used slot from WorkflowExecution Apr 23, 2024
Copy link

@eecavanna eecavanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @anastasiyaprymolenna, thanks for making this migrator. I reviewed migrator_from_X_to_PR31.py only (I'm not fluent in LinkML yet). The structure of the migrator looks good to me and I understand what the methods of the Migrator class do. I have some suggestions regarding a couple variable names, which I left as review comments. Also, I have suggestions regarding referring to things as "DataGeneration" in comments while the code still refers to the omics_processing_set collection (I put those suggestions in review comments also).

nmdc_schema/migrators/migrator_from_X_to_PR31.py Outdated Show resolved Hide resolved
nmdc_schema/migrators/migrator_from_X_to_PR31.py Outdated Show resolved Hide resolved
nmdc_schema/migrators/migrator_from_X_to_PR31.py Outdated Show resolved Hide resolved
nmdc_schema/migrators/migrator_from_X_to_PR31.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@brynnz22 brynnz22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend combining these functions into one, because this check_instrument_function does not appear to do anything. According to line 34 check_instrument_name is being given an omics document, not a workflow_execution doc and this function is checking to see if the slot used exists in an omics_document (which it does not - used exists on a WorkflowExecution). So the first if statement does not work for any document being passed into this function, and there is no else statement. It is then getting an omics doc (data_generation_doc should be omics_doc since we are referring to the old collection) and searching for instrument_name, which could technically return a lot of documents since an intstrument name exists in more than one omics doc. What we want instead:

  • Loop through workflow execution docs and retrieve the corresponding omics doc id from the was_informed_by slot
  • use the get_document_having_value_in_field adapter to search the omics_processing_set on the id field using that value from the was_informed_by slot from workflow execution doc.
  • Check to see if the omics_doc["instrument_name"] matches the workflow_execution["used"] and if it does remove the used slot from the workflow execution doc.
  • If it doesn't, log error.

My recommendation is to do this all in one function. I don't see the need to do this in two functions since you need to iterate through the workflow executions anyway to get the was_informed_by to search for the corresponding omics doc.

I tried to make individual comments. I hope this is clear! Let me know if you need more clarity!

@brynnz22
Copy link
Collaborator

I added using difflib library's SequenceMatcher to account for 'used' slots that don't exactly match the values of the instrument_name slot in the OmicsProcessing docs. E.g. 7T-FT ICR MS and 7T_FTICR_MS. These should still be considered the same, so this migrator still drops the used slot if they are a match. It sets a match threshold to 0.8. Anything lower than this will throw an error. I've double checked and everything lines up as it should - removing all the used slots from the WorkflowExecution docs.

@brynnz22
Copy link
Collaborator

@eecavanna I made all of your requested changes. I changed those variable names for processed strings so they make more sense, added a doctest to that function, removed the white space. Let me know if you see anything else that needs adjusting.

@eecavanna
Copy link

The migrator looks good to me. I'm not too familiar with the YAML files, so I only reviewed them in terms of their YAML syntax, which looks good to me. Based on that, I'm comfortable with this branch being merged into the main branch.

There is one YAML file that has a bunch of lines commented out. In case you want the lines to be disabled but still be accessible to readers, another option is to delete them and then add a single-line comment with the Git commit hash; something like: The example of bla bla was removed in commit #1234abcd. People will be able to search the Git history for that commit and see the deleted lines.

@eecavanna eecavanna self-requested a review May 14, 2024 04:01
Co-authored-by: eecavanna <134325062+eecavanna@users.noreply.github.com>
@brynnz22
Copy link
Collaborator

I'm not sure why those lines are commented out. I will fix it.

@brynnz22 brynnz22 requested a review from turbomam May 14, 2024 17:07
@brynnz22 brynnz22 self-requested a review May 14, 2024 18:49
@brynnz22 brynnz22 merged commit b38da6a into main May 14, 2024
2 checks passed
@brynnz22 brynnz22 deleted the migrate-PR31 branch May 14, 2024 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants