Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slice using non-index coordinates #2028

Closed
crusaderky opened this issue Mar 29, 2018 · 21 comments
Closed

slice using non-index coordinates #2028

crusaderky opened this issue Mar 29, 2018 · 21 comments

Comments

@crusaderky
Copy link
Contributor

It should be relatively straightforward to allow slicing on coordinates that are not backed by an IndexVariable, or in other words coordinates that are on a dimension with a different name, as long as they are 1-dimensional (unsure about the multidimensional case).

E.g. given this array:

a = xarray.DataArray(
    [10, 20, 30],
    dims=['country'],
    coords={
        'country': ['US', 'Germany', 'France'],
        'currency': ('country', ['USD', 'EUR', 'EUR'])
    })

<xarray.DataArray (country: 3)>
array([10, 20, 30])
Coordinates:
  * country   (country) <U7 'US' 'Germany' 'France'
    currency  (country) <U3 'USD' 'EUR' 'EUR'

This is currently not possible:

a.sel(currency='EUR')


ValueError: dimensions or multi-index levels ['currency'] do not exist

It should be interpreted as a shorthand for:

a.sel(country=a.currency == 'EUR')

<xarray.DataArray (country: 2)>
array([20, 30])
Coordinates:
  * country   (country) <U7 'Germany' 'France'
    currency  (country) <U3 'EUR' 'EUR'
@max-sixty
Copy link
Collaborator

I agree this is harder that it should be.

Here's one way:

In [28]: a.where(a.currency=='EUR', drop=True)
Out[28]:
<xarray.DataArray (country: 2)>
array([20., 30.])
Coordinates:
  * country   (country) <U7 'Germany' 'France'
    currency  (country) <U3 'EUR' 'EUR'

I'm not sure whether .sel should work for non-IndexVariables - thoughts?

@shoyer
Copy link
Member

shoyer commented Mar 29, 2018

we're discussed this before: #934

I agree that this would be nice to support in theory. The challenge is that we would need to create (and then possibly throw away?) a pandas.Index do to the actual indexing, or use a numpy search function like np.isin(). Neither of these are very efficient.

Conceptually, I think it makes sense to support indexing on arbitrary variables, which is simply more expensive if an index is not already set. Dimension coordinates would not be special except that they have indexes created automatically.

@shoyer
Copy link
Member

shoyer commented Mar 29, 2018

This has some connections to the broader indexes refactor envisioned in #1603.

@max-sixty
Copy link
Collaborator

What's the easiest way to select on multiple values? Is it really this:

In [63]: da = xr.DataArray(np.random.rand(3,2), dims=list('ab'), coords={'c':(('a',),list('xyz'))})

In [64]: da.sel(a=(np.isin(da.c, list('xy'))))
Out[64]:
<xarray.DataArray (a: 2, b: 2)>
array([[0.383989, 0.174317],
       [0.698948, 0.815993]])
Coordinates:
    c        (a) <U1 'x' 'y'
Dimensions without coordinates: a, b

@shoyer
Copy link
Member

shoyer commented Mar 29, 2018

@maxim-lian Probably. Or you could make the pandas.Index explicitly, e.g., da.sel(a=da.c.to_index().get_indexer(['x', 'y'])).

We should really add DataArray.isin() (#1268).

@stale
Copy link

stale bot commented Feb 27, 2020

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Feb 27, 2020
@crusaderky
Copy link
Contributor Author

Still relevant

@dhruvbalwada
Copy link

dhruvbalwada commented Apr 15, 2020

I am a little confused about the documentation relating to this issue. It says in the documentation at http://xarray.pydata.org/en/stable/data-structures.html#coordinates
"non-dimension coordinates are variables that contain coordinate data, but are not a dimension coordinate. They can be multidimensional (see Working with Multidimensional Coordinates), and there is no relationship between the name of a non-dimension coordinate and the name(s) of its dimension(s). Non-dimension coordinates can be useful for indexing or plotting; otherwise, xarray does not make any direct use of the values associated with them. They are not used for alignment or automatic indexing, nor are they required to match when doing arithmetic (see Coordinates)."

Is this an issue that has been resolved, and if so an example on how to do this would be helpful in the documentation. If not, should the documentation be corrected?

@dcherian
Copy link
Contributor

#3925 would fix this for 1D non-dim coords. We should update the docs (ping @TomNicholas)

@gewitterblitz
Copy link

gewitterblitz commented Sep 17, 2021

@dcherian any recoomendations for 2D non-dim coords?

I would like to subset a dataarray based on slices for x and y coordinates

Screen Shot 2021-09-17 at 12 26 39 PM

@dcherian
Copy link
Contributor

xoak should work here: https://xoak.readthedocs.io/en/latest/

Here's an example with ocean model output: https://pop-tools.readthedocs.io/en/latest/examples/xoak-example.html .

If you can wait a while, this will all work better once #5692 is merged.

@ivanakcheurov
Copy link

I agree this is harder that it should be.

Here's one way:

In [28]: a.where(a.currency=='EUR', drop=True)
Out[28]:
<xarray.DataArray (country: 2)>
array([20., 30.])
Coordinates:
  * country   (country) <U7 'Germany' 'France'
    currency  (country) <U3 'EUR' 'EUR'

I'm not sure whether .sel should work for non-IndexVariables - thoughts?

@max-sixty , perhaps there is any update on OPs question or maybe you can help to achieve the following?
I would like sel based on a non-dim coordinate to be as fast as sel based on the dim itself.
Timings:

# sel based on a non-dim coordinate 
# (using this coordinate directly .sel(product_id=26) would result in error "'no index found for coordinate product_id")
%timeit xds.sel(product=xds.product_id==26)
1.54 ms ± 64.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# sel based on the dim itself
%timeit xds.sel(product='GN91 Glove Medium')
499 µs ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit xds.where(xds.product_id==26, drop=True)
4.17 ms ± 39 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Anyways, xarray is brilliant and made my week :)

@covertg
Copy link

covertg commented Aug 2, 2022

Hi all, wanted to ask what the status of this feature request is given all of the recent work by @benbovy on explicit indexes.

@benbovy
Copy link
Member

benbovy commented Aug 2, 2022

Hi @covertg, as soon as there is public API for setting non-dimension or custom indexes it should be ready. See #6849, which is actually already implemented in the scipy22 branch (+ an example here). I plan to re-submit it as a proper PR within the few coming weeks.

@covertg
Copy link

covertg commented Aug 2, 2022

Exciting news!! Thanks for the quick response and the huge amount of work on explicit indexes. I'll be excited and grateful to enjoy the public API once it comes into its own :)

@benbovy
Copy link
Member

benbovy commented Oct 3, 2022

With the last release v2022.09.0, this is now possible via .set_xindex():

a = a.set_xindex("currency")

a.sel(currency="EUR")
# <xarray.DataArray (country: 2)>
# array([20, 30])
# Coordinates:
#   * country   (country) <U7 'Germany' 'France'
#   * currency  (country) <U3 'EUR' 'EUR'

Closed in #6971 (although set_xindex still needs to be documented in the User Guide).

@benbovy benbovy closed this as completed Oct 3, 2022
@aberges-grd
Copy link

What about slices? My non-index coord is a datetime, and I need to select between two dates.

@benbovy
Copy link
Member

benbovy commented Feb 7, 2023

@aberges-grd If your non-index coordinate supports it (I guess it does?), you could assign a default index to the coordinate with set_xindex and then use slices for selection like any other (dimension) coordinate backed by a pandas index.

@gewitterblitz
Copy link

Thanks @benbovy, it works well. I am curious about using set_xindex with 2-dimensional non-index coordinates. A use case could be datasets with x and y coordinates that need to be subset using longitude (x,y) and latitude (x,y) values. Any suggestions?

@benbovy
Copy link
Member

benbovy commented Feb 8, 2023

@gewitterblitz there is a kdtree-based index example in #7041 that works with multi-dimensional coordinates. You could also have a look at https://xoak.readthedocs.io/en/latest/ (it doesn't use Xarray indexes - soon hopefully - so the current API is via Xarray accessors).

EDIT: seeing your previous #2028 (comment), not sure how you could use slices for label selection using those indexes as I don't think the wrapped scipy / sklearn kdtree objects support range queries. Other spatial indexes may support it (e.g., there's an example in https://github.com/martinfleis/xvec of selecting points using a shapely.box, although currently it only supports 1-d geometry coordinates).

@gewitterblitz
Copy link

Thanks, @benbovy. Yep, the kdtree objects don't like the range based slices. xoak has worked well in the past though. I'll keep an eye on xoak-xarray integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.