Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: progress toward making groupby work with multiple arguments #924

Closed
wants to merge 2 commits into from

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Jul 29, 2016

Fixes #324

It definitely doesn't work properly yet, totally mixing up coordinates,
data variables and multi-indexes (as shown by the failing tests).

A simple example:

In [4]: coords = {'a': ('x', [0, 0, 1, 1]), 'b': ('y', [0, 0, 1, 1])}

In [5]: square = xr.DataArray(np.arange(16).reshape(4, 4), coords=coords, dims=['x', 'y'])

In [6]: square
Out[6]:
<xarray.DataArray (x: 4, y: 4)>
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])
Coordinates:
    b        (y) int64 0 0 1 1
    a        (x) int64 0 0 1 1
  * x        (x) int64 0 1 2 3
  * y        (y) int64 0 1 2 3

In [7]: square.groupby(['a', 'b']).mean()
Out[7]:
<xarray.DataArray (a: 2, b: 2)>
array([[  2.5,   4.5],
       [ 10.5,  12.5]])
Coordinates:
  * a        (a) int64 0 1
  * b        (b) int64 0 1

In [8]: square.groupby(['x', 'y']).mean()
Out[8]:
<xarray.DataArray (x: 4, y: 4)>
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.],
       [ 12.,  13.,  14.,  15.]])
Coordinates:
  * x        (x) int64 0 1 2 3
  * y        (y) int64 0 1 2 3

More examples:
https://gist.github.com/shoyer/5cfa4d5751e8a78a14af25f8442ad8d5

It definitely doesn't work properly yet, totally mixing up coordinates,
data variables and multi-indexes.
@@ -131,7 +133,7 @@ class GroupBy(object):
DataArray.groupby
"""
def __init__(self, obj, group, squeeze=False, grouper=None, bins=None,
cut_kwargs={}):
cut_kwargs={}):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😳 these PEP8 violations are from my PR. Sorry! I have since started linting...

@rabernat
Copy link
Contributor

This looks like a really useful addition. It would be useful for me to have an example of how this is supposed to work. The tests are a starting point, but perhaps kind of trivial cases.

If I create the following 2D dataset:

ds = xr.Dataset({'foo': (['x','y'], np.random.rand(2,4))})

and then do

ds.groupby(['x','y']).sum()

What should the coordinates of the output be? Every unique combination of x and y?

@shoyer
Copy link
Member Author

shoyer commented Jul 31, 2016

@rabernat I updated the top post with examples. So yes, for your example, the coordinates of the output would have every unique combination of x and y. More generally, something like ds.groupby(['x', 'y']).mean() will be equivalent to ds.mean('z') for a dataset with dimensions (x, y, z). I think this is the only sane way to define these grouped operations.

Once we figure out squeezing out grouped/stacked dimensions (not quite working yet), this will let us write things like ds.groupby(['x', 'y']).apply(calculate_trend) or better yet with group_over, ds.group_over('time').apply(calculate_trend).

@RafalSkolasinski
Copy link

Hi, is there any active work on that feature? It would be really cool to have it.

@shoyer
Copy link
Member Author

shoyer commented Feb 11, 2017

@RafalSkolasinski This pull request was mostly working, but still needs some significant work to clean it up and update it to the current version of the codebase.

I don't think anyone is working on it currently (I'm not) but I'm sure someone will get to it eventually.

@RafalSkolasinski
Copy link

@shoyer I am considering contributing to this feature. Could you give me more details what needs to be done?

@shoyer
Copy link
Member Author

shoyer commented Feb 16, 2017

@RafalSkolasinski Sure, here is the current list:

  1. The implementation should probably switch to use set_index() (after stacking) to combine groupby arguments into a single MultiIndex, instead of the current version which constructs the MultiIndex explicitly in GroupBy.__init__, duplicating a lot of the set_index code.
  2. We need to consider how to handle multiple group-by arguments along orthogonal dimensions. For example, consider the 4x4 array from the gist at the top of this PR where grouping is done over 2x2 blocks. With the current PR, the array is flattened, so iteration is done over flat arrays of length 4, but we probably want to actually iterate over 2x2 arrays that match the shape of the original data. This could be done either by stacking/unstacking at each iteration step (slow, but would work) or by making groupby truly do indexing and concatenation over multiple dimensions at once (certainly faster and more elegant, but significantly more work to implement).
  3. The change also needs to be rebased to account for changes on master, but by the time you account for the above changes there may not be much of the current version remaining.
  4. We also need comprehensive test coverage for the desired behavior.

@pwolfram
Copy link
Contributor

@RafalSkolasinski and @shoyer, can I please get an update on this PR? This is something we need sometime soon too (cc @milenaveneziani).

@shoyer
Copy link
Member Author

shoyer commented Mar 14, 2017

I don't have any progress to report since my last comment.

@RafalSkolasinski
Copy link

@pwolfram Unfortunately nothing from my side yet.

@chunweiyuan
Copy link
Contributor

This functionality is quite important to us as well. How might we help? Should I simply fork groupby-multiple-args?

@jhamman
Copy link
Member

jhamman commented Jul 13, 2017

@chunweiyuan - yes, you are welcome to give this a shot.

@pwolfram
Copy link
Contributor

pwolfram commented Mar 8, 2018

Just to refresh here-- what needs done to finish this off?

@pwolfram
Copy link
Contributor

pwolfram commented Mar 8, 2018

@shoyer, it looks like your list above is the place to start from your branch, correct?

@shoyer
Copy link
Member Author

shoyer commented Mar 9, 2018

@pwolfram Personally, I am waiting to implement this until after #1603. I'm pretty sure that the alternative model for indexes proposed there will make this far easier to implement.

@pwolfram
Copy link
Contributor

pwolfram commented Mar 9, 2018

Thanks @shoyer, I find this feature extremely useful as I keep running into use cases where I can use it. Thanks for the update, given changes to xarray it sounds like the prudent course of action is as you outline. Thanks again for the quick reply!

@benbovy
Copy link
Member

benbovy commented Mar 9, 2018

I'd very happy to start working on #1603, but unfortunately I wish I could have more time for this at the moment (and before this, it is also probably a good idea that I continue the work in #1820, which has stalled for quite a while).

dcherian added a commit to dcherian/xarray that referenced this pull request Aug 16, 2024
@dcherian dcherian mentioned this pull request Aug 16, 2024
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support group_over
7 participants