Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support searching for available zarr stores #59

Closed
briannapagan opened this issue May 21, 2024 · 7 comments
Closed

Support searching for available zarr stores #59

briannapagan opened this issue May 21, 2024 · 7 comments
Assignees
Labels
good first issue Good for newcomers

Comments

@briannapagan
Copy link
Collaborator

At GES DISC we are getting ready to expose our production public zarr stores. Lots of open questions remain on how to make these zarr stores easily searchable, especially because we publish zarr stores at the variable, not the collection level.

In umm_json we are specifying zarr as the instance_information, example here:
https://cmr.earthdata.nasa.gov/search/variables.umm_json?instance-format=zarr&provider=GES_DISC

I think python_cmr so far hasn't brought over many of the functionalities in the CMR API for VariableQuery and I would like to add a piece of code to help make searching for variable zarr stores more intuitive to the user:

import re
from cmr import VariableQuery 
def get_all_zarr_stores(provider = None):
    api = VariableQuery()
    if provider: 
        all_vars = api.provider(provider).get_all()
    else:
        all_vars = api.get_all()
    zarrs = []
    for variable_entry in all_vars:
        try:
            if variable_entry['instance_information']:
                zarrs.append(variable_entry)
        except KeyError:
            continue
    return zarrs

def query_zarr_stores(zarr_stores,short_name, version, variable=None):
    zarrs = []
    if variable:
        pattern = short_name + '.*' + version + '.*' + variable + '.*'
    else:         
        pattern = short_name + '.*' + version + '.*'

    for store in zarr_stores:
        try:
            if re.match(pattern, store["native_id"]):
                zarrs.append(store)
        except KeyError:
            continue
    return zarrs


# one way to search just knowing provider
provider = "GES_DISC"
zarr_stores = get_all_zarr_stores(provider = provider)

# another way (perhaps more intuitive) where user can input short_name, version and variable
short_name = "GPM_3IMERGHH"
version = "06"
variable = "precipitationCal"
zarr_stores = query_zarr_stores(zarr_stores,short_name,version)

Any thoughts/feedback before I suggest a PR to add instance_information as an additional query parameter in VariableQuery?

@briannapagan briannapagan self-assigned this May 21, 2024
@liredell
Copy link

For the second option, maybe adding DOI to the searchable fields to find all the zarr stores from a particular data collection.

@chuckwondo
Copy link
Collaborator

@briannapagan, I'm not sure this totally makes sense as logic that belongs in python_cmr because there is no way to have the CMR query return only variables that have associated instance_information.

We must simply filter the results that the CMR returns to us, as you have written in your example. If we were to add such logic to python_cmr, then that might open us up to adding all sorts of post query filtering for any sort of property that may or may not be returned in the results, but which we cannot specify as part of the CMR query itself. This is logic that I suggest does not belong in this library.

For example, I would suggest simplifying your code to something like so (which would remain in your code, not be put into python_cmr):

zarr_stores = [var for var in VariableQuery().get_all() if "instance_information" in var]
# OR
zarr_stores = [var for var in VariableQuery().provider(provider).get_all() if "instance_information" in var]

@briannapagan
Copy link
Collaborator Author

briannapagan commented May 22, 2024

Hi @chuckwondo thanks for the feedback. There are CMR endpoints that we can build into python_cmr i.e. the end points:
https://cmr.uat.earthdata.nasa.gov/search/variables?instance-format=zarr&provider=GES_DISC&keyword=C1225808238-GES_DISC&pretty=true
https://cmr.uat.earthdata.nasa.gov/search/variables?instance-format=zarr

We haven't added much to the VariableQuery, i.e. for other queries we have functions for short_name for example in queries.py:

    def short_name(self, short_name: str) -> Self:
        """
        Filter by short name (aka product or collection name).

        :param short_name: name of collection
        :returns: self
        """

        if not short_name:
            return self

        self.params['short_name'] = short_name
        return self

I want the same now for instance_information

@chuckwondo
Copy link
Collaborator

What I'm saying is that the CMR does not support querying by instance_information, so adding instance_information to the query params won't work. In fact, it will simply generate an error response. For example, the URL https://cmr.uat.earthdata.nasa.gov/search/variables?instance-information=true returns this response:

{
  "errors": [
    "Parameter [instance_information] was not recognized."
  ]
}

This means that the only way to perform filtering is after you get your query results back from the CMR. You cannot tell the CMR not to return variables that do or don't have associated instance_information.

@chuckwondo
Copy link
Collaborator

However, if you mean instance-format, not instance-information, then sure, open a PR to add an instance_format method.

@briannapagan
Copy link
Collaborator Author

Sorry @chuckwondo I meant instance_format :)

@briannapagan
Copy link
Collaborator Author

For the second option, maybe adding DOI to the searchable fields to find all the zarr stores from a particular data collection.

@liredell this would require for DOI to be searchable by variables, that is a feature that would be requested from CMR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants