Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax certain required: true slots in schema #583

Closed
Tracked by #587
sujaypatil96 opened this issue Jan 6, 2023 · 12 comments
Closed
Tracked by #587

Relax certain required: true slots in schema #583

sujaypatil96 opened this issue Jan 6, 2023 · 12 comments
Assignees
Labels
schema change Term updates to NMDC Schema

Comments

@sujaypatil96
Copy link
Collaborator

sujaypatil96 commented Jan 6, 2023

There are a few slots on the Biosample class in the NMDC Schema, most of which are set as required: true in the schema.

This issue seeks to request the modification of the required: true constraint on certain slots on the Biosample class to recommended: true in order to accommodate the fetching of biosample records from upstream sources such as the GOLD database.

The following slots are the ones on which we are requesting the relaxation:

  1. env_broad_scale
  2. env_local_scale
  3. env_medium
  4. alt
  5. depth
  6. subsurface_depth

See microbiomedata/sample-annotator#113 for more details.

@sujaypatil96 sujaypatil96 added the schema change Term updates to NMDC Schema label Jan 6, 2023
@mslarae13
Copy link
Contributor

alt I agree, elevation is typically used for soil.

the env triad is required for data search

depth, if we're talking about soils should be required. It's a vital slot for data reuse (as is geographic location / lat lon & I'll suggest we make those required)

idk what subsurface depth is?

One thing to consider, currently, as the class Biosample sits, depth, while important for soil, sediment, and water.. isn't relevant for plants. We will need to think about "what is required for all biosamples" vs certain types.
depth is really only required for certain types. All the bioscales plant & rhizosphere (maybe) samples won't have depth.

@mslarae13 mslarae13 mentioned this issue Jan 6, 2023
99 tasks
@cmungall
Copy link
Collaborator

cmungall commented Jan 9, 2023

Regarding the env triad, the choices in the general case are:

  1. relax schema and defer annotation, and have the samples not discoverable via certain search patterns in the interim
  2. keep schema script and force annotation prior to ingest

1 adds additional overhead in the need to perform updates using change sheets later.

2 adds some complexity to the ingest, in that we essentially have to merge two curation streams.

Note that in the specific case of BioScales, we need to merge two streams anyway. Here is the spreadsheet that we got from ORNL

https://docs.google.com/spreadsheets/d/1A6bynpzssAUpnDzoAQPZ-8L5HU2y3IuWX7mRiRrasgk/edit#gid=195687079

It includes the triad. It also includes other metadata we need to load.

@sujaypatil96
Copy link
Collaborator Author

Apologies some of the slots I mentioned in the above list are not enforced as required: true, it's mostly just the envo triad, and I think you said that the envo triad being required is pretty essential so we won't make any changes to that.

@ssarrafan
Copy link
Collaborator

@sujaypatil96 moving to the next sprint but please let me know if you won't be actively working on it for the next few weeks.

@sujaypatil96
Copy link
Collaborator Author

@ssarrafan we plan to address this at the metadata call today.

@sujaypatil96
Copy link
Collaborator Author

After a brief discussion, @turbomam and I agree with point 2 from @cmungall: keep schema script and force annotation prior to ingest.

This approach is only possible in this case because ORNL has provided a supplementary file.

@mslarae13
Copy link
Contributor

see #612

@mslarae13
Copy link
Contributor

leave envo required. we should be able to populate these for soil via gold addition of what stan provided.
Then for other sample types, use envo -> gold mapping to fill out the envo slots.

@aclum
Copy link
Contributor

aclum commented Jan 20, 2023

We may have to revisit leaving all the envo slots as required true. Per Reddy envo terms don't exist for endosphere so env_broad_scale is populated but env_local_scale and env_medium are not for the bioscales endosphere samples. @mslarae13 @cmungall @emileyfadrosh

@ssarrafan
Copy link
Collaborator

@sujaypatil96 is this issue still being worked on? I'll move to the next sprint due to the current activity but let me know if It can be closed or if it needs to go to the backlog.

@aclum
Copy link
Contributor

aclum commented Jan 28, 2023

So far we've found workarounds for the environmental terms so those are still required for now.

@sujaypatil96
Copy link
Collaborator Author

GOLD filled in missing values for the MIxS environmental triad for the BioScales project, so we've decided not to relax the schema, but to just leave it as is. So at the moment at least we don't need the changes from the original request of this issue so I think this issue can be closed.

@aclum aclum closed this as completed Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema change Term updates to NMDC Schema
Projects
Status: Done
Status: ✅ SubPort 1 - Done
Development

No branches or pull requests

5 participants