Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataArray.to_dataset(dim) silently drops variable if it is already a dim #7823

Closed
4 tasks done
mgunyho opened this issue May 6, 2023 · 3 comments · Fixed by #8433
Closed
4 tasks done

DataArray.to_dataset(dim) silently drops variable if it is already a dim #7823

mgunyho opened this issue May 6, 2023 · 3 comments · Fixed by #8433
Labels

Comments

@mgunyho
Copy link
Contributor

mgunyho commented May 6, 2023

What happened?

If I have a DataArray da which I split into a Dataset using da.to_dataset(dim), and one of the values of da[dim] also happens to be one of the dimensions of da, that variable is silently missing from the resulting dataset.

What did you expect to happen?

If a variable cannot be created because it is already a dimension, it should raise an exception, or possibly issue a warning and rename the variable, so that no data is lost.

Minimal Complete Verifiable Example

import xarray as xr

da = xr.DataArray(
    np.zeros((3, 3)),
    coords={
        # note how 'foo' is one of the coordinate values, and also the name of a dimension
        "x": ["foo", "bar", "baz"],
        "foo": [1, 2, 3],
    }
)

# this produces a Dataset with the variables 'bar' and 'baz', 'foo' is missing (because it is already a coordinate)
print(da.to_dataset("x"))

# this produces a dataset with the variables 'foo', 'bar', and 'baz', as epected
print(da.rename({"foo": "qux"}).to_dataset("x"))

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

# Output of first conversion
<xarray.Dataset>
Dimensions:  (foo: 3)
Coordinates:
  * foo      (foo) int64 1 2 3
Data variables:
    bar      (foo) float64 0.0 0.0 0.0
    baz      (foo) float64 0.0 0.0 0.0

# Output of second conversion
<xarray.Dataset>
Dimensions:  (qux: 3)
Coordinates:
  * qux      (qux) int64 1 2 3
Data variables:
    foo      (qux) float64 0.0 0.0 0.0
    bar      (qux) float64 0.0 0.0 0.0
    baz      (qux) float64 0.0 0.0 0.0

Anything else we need to know?

This came up when I did to_dataset("param") on the fit result returned by curvefit, and one of the data dimensions happened to be the same as one of the arguments of the function which I was fitting. I was initially very confused by this.

Environment

INSTALLED VERSIONS

commit: None
python: 3.10.10 (main, Mar 01 2023, 21:10:14) [GCC]
python-bits: 64
OS: Linux
OS-release: 6.2.12-1-default
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: ('en_GB', 'UTF-8')
libhdf5: 1.12.2
libnetcdf: None

xarray: 2023.4.2
pandas: 2.0.1
numpy: 1.23.5
scipy: 1.10.1
netCDF4: None
pydap: None
h5netcdf: 1.1.0
h5py: 3.8.0
Nio: None
zarr: 2.14.2
cftime: 1.6.2
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: 1.3.7
dask: 2023.4.1
distributed: None
matplotlib: 3.7.1
cartopy: None
seaborn: 0.12.2
numbagg: None
fsspec: 2023.4.0
cupy: None
pint: None
sparse: 0.14.0
flox: None
numpy_groupies: None
setuptools: 65.5.0
pip: 22.3.1
conda: None
pytest: 7.3.1
mypy: None
IPython: 8.13.2
sphinx: 6.2.1

@mgunyho mgunyho added bug needs triage Issue that has not been reviewed by xarray team member labels May 6, 2023
@max-sixty max-sixty removed the needs triage Issue that has not been reviewed by xarray team member label Nov 6, 2023
@max-sixty
Copy link
Collaborator

Sorry this didn't get a reply — this doesn't look like ideal behavior. 

I think raising would make sense. Though possibly it makes the success of the reshape dependent on the data at runtime, so it may be too conservative, and warning would be OK.

@mgunyho
Copy link
Contributor Author

mgunyho commented Nov 6, 2023

No worries, this hasn't come up that often for me.

I think this boils down to a choice of UX, do we

  • raise and explicitly ask the user to deal with the situation (more work for the user),
  • warn but leave the variable out (potentially confusing if the warning is missed), or
  • warn and automatically rename the variable (also potentially confusing).

Now that I've typed this out, I would definitely prefer to raise. I can send a PR to fix this.

@dcherian
Copy link
Contributor

dcherian commented Nov 7, 2023

I agree with raising. A PR would be welcome!

mgunyho added a commit to mgunyho/xarray that referenced this issue Nov 9, 2023
mgunyho added a commit to mgunyho/xarray that referenced this issue Nov 14, 2023
max-sixty pushed a commit that referenced this issue Nov 14, 2023
…of a coordinate (#8433)

* Add tests for issue #7823

* Use 2D array to properly test to_dataset error handling logic

* Raise exception if a variable is also a coordinate in to_dataset

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>

* Update whats-new

* Ensure that coordinates are in the original order

---------

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants