Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore ability to specify _FillValue as Python native integers #9258

Merged
merged 1 commit into from
Jul 22, 2024

Conversation

djhoese
Copy link
Contributor

@djhoese djhoese commented Jul 18, 2024

See #9136 for the longer discussion of this. In #9136 numpy 2 compatibility was fixed by working around integer overflow errors when a _FillValue was specified in an unsigned dtype (ex. np.uint8(255)) but needed to be cast to the signed dtype to be stored to a NetCDF file. This is all related to the use of the special _Unsigned attribute. However, the referenced PR broke the ability to specify _FillValue with native python integers (ex. -1).

This PR restores this behavior by handling both primary use cases:

  1. _FillValue corresponds to the unsigned in-memory data
  2. _FillValue corresponds to the signed on-disk data

I'm sure there are at least 10 ways of interpreting what the best solution should be. This one I think causes the least amount of disruption and restores previous behavior as far as I can tell.

CC @dcherian @kmuehlbauer

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@djhoese
Copy link
Contributor Author

djhoese commented Jul 22, 2024

Any chance someone could look at this before the next release as the current state of things would break my (and I assume many) users workflows regarding _FillValue and python native integers.

Comment on lines +523 to +526
try:
# user provided the on-disk signed fill
new_fill = signed_dtype.type(attrs["_FillValue"])
except OverflowError:
Copy link
Contributor

@dcherian dcherian Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit ill-defined it seems, so let's raise SerializationWarning to encourage users to provide the signed on-disk value.

Do we have the except clause only for potentially-wrong backwards compatibile behaviour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it now, the except clause is probably the common case as far as xarray is concerned. That is, data loaded with xarray will use the decode logic which will use the unsigned integer version of the fill value. However, for anyone constructing their own DataArray/Dataset objects, there is a good chance of providing the signed version of the fill value (the try clause).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I should have said, my original thinking in the related issue was that _FillValue should always be the on-disk type, but then realized it makes sense to have it match the in-memory loaded data type. I too would like there to be only one way of doing it, but I'm not sure how likely that is. I suppose the current except clause should be the preferred way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decode logic will treat -1 and 255 the same for on-disk int8 type (as intended).

my original thinking in the related issue was that _FillValue should always be the on-disk type, but then realized it makes sense to have it match the in-memory loaded data type.

CF says _FillValue must be the on-disk type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood @kmuehlbauer, the in-memory type is needed for masking decoding to work, but also said that maybe the unsigned and masking should be combined. I could also see an argument for it being wrong to have a _FillValue that doesn't work for and match the data it is attached to (in-memory).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct for the encoding pipeline. I can add a serialization warning for the case where the provided _FillValue is not already the on-disk signed type.

For the decoding, you say that it already overwrites the _FillValue after casting to unsigned. Which part overwrites _FillValue? The CFMaskCoder uses the unsigned _FillValue and retains it in .encoding["_FillValue"] from my tests with a real NetCDF file. I'm also just realizing that xarray .open_dataset always converts a pure integer variable to a floating point variable and replaces fills with NaN.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a serialization warning for the case where the provided _FillValue is not already the on-disk signed type.

Won't this warn for python integers though?

The CFMaskCoder uses the unsigned _FillValue and retains it in .encoding["_FillValue"] from my tests with a real NetCDF file.

Ah yes this is correct, we don't reset the type after masking and that would be why we want to combine UnsignedIntegerCoder and CFMaskCoder.

I think we can combine these in a future PR

I'm also just realizing that xarray .open_dataset always converts a pure integer variable to a floating point variable and replaces fills with NaN.

Yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a serialization warning for the case where the provided _FillValue is not already the on-disk signed type.

Won't this warn for python integers though?

To be clear this would be in the encode pipeline and would only happen if the fill provided doesn't fit into the on-disk dtype.

In [9]: np.dtype(np.int8).type(255)
---------------------------------------------------------------------------
OverflowError                             Traceback (most recent call last)
Cell In[9], line 1
----> 1 np.dtype(np.int8).type(255)

OverflowError: Python integer 255 out of bounds for int8

In [10]: np.dtype(np.int8).type(-1)
Out[10]: np.int8(-1)

In [11]: np.__version__
Out[11]: '2.0.0'

So 255 -> int8 is error, -1 -> int8 is fine.

BUT this means anyone doing a xr.open_dataset followed by a .to_netcdf will get the serialization warning with the logic as is (main and this PR) because the _FillValue was made unsigned in the decode pipeline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUT this means anyone doing a xr.open_dataset followed by a .to_netcdf will get the serialization warning with the logic as is (main and this PR) because the _FillValue was made unsigned in the decode pipeline.

OK good argument.

It seems to me like this is good to go then? And we can follow up with a PR that merges UnsignedCoder and MaskCoder, and handles the FillValue properly. Does that sound right to you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the new serialization warning, yes, I think this is as I expect. Right now this PR should at least restore behavior. If anything, especially based on this discussion, this code is too flexible...but that's kind of how anyone claiming they make CF-compliant NetCDF files assumes things will work.

@dcherian
Copy link
Contributor

I'm going to merge to unblock the release.

Right now roundtripping with _Unsigned isn't actually a true roundtrip since the type of _FillValue is modified. To fully solve this, we'll need to merge UnsignedIntegerCoder and CFMaskCoder but this can be done later. Clearly we also need better roundtrip testing!

@dcherian dcherian merged commit 45d25b9 into pydata:main Jul 22, 2024
28 checks passed
@djhoese djhoese deleted the bugfix-nc-ufill branch July 22, 2024 18:27
dcherian added a commit that referenced this pull request Jul 24, 2024
* main: (54 commits)
  Adding `open_datatree` backend-specific keyword arguments (#9199)
  [pre-commit.ci] pre-commit autoupdate (#9202)
  Restore ability to specify _FillValue as Python native integers (#9258)
  add backend intro and how-to diagram (#9175)
  Fix copybutton for multi line examples in double digit ipython cells (#9264)
  Update signature for _arrayfunction.__array__ (#9237)
  Add encode_cf_datetime benchmark (#9262)
  groupby, resample: Deprecate some positional args (#9236)
  Delete ``base`` and ``loffset`` parameters to resample (#9233)
  Update dropna docstring (#9257)
  Grouper, Resampler as public api (#8840)
  Fix mypy on main (#9252)
  fix fallback isdtype method (#9250)
  Enable pandas type checking (#9213)
  Per-variable specification of boolean parameters in open_dataset (#9218)
  test push
  Added a space to the documentation (#9247)
  Fix typing for test_plot.py (#9234)
  Allow mypy to run in vscode (#9239)
  Revert "Test main push"
  ...
dcherian added a commit to JoelJaeschke/xarray that referenced this pull request Jul 25, 2024
…monotonic-variable

* main: (995 commits)
  Adding `open_datatree` backend-specific keyword arguments (pydata#9199)
  [pre-commit.ci] pre-commit autoupdate (pydata#9202)
  Restore ability to specify _FillValue as Python native integers (pydata#9258)
  add backend intro and how-to diagram (pydata#9175)
  Fix copybutton for multi line examples in double digit ipython cells (pydata#9264)
  Update signature for _arrayfunction.__array__ (pydata#9237)
  Add encode_cf_datetime benchmark (pydata#9262)
  groupby, resample: Deprecate some positional args (pydata#9236)
  Delete ``base`` and ``loffset`` parameters to resample (pydata#9233)
  Update dropna docstring (pydata#9257)
  Grouper, Resampler as public api (pydata#8840)
  Fix mypy on main (pydata#9252)
  fix fallback isdtype method (pydata#9250)
  Enable pandas type checking (pydata#9213)
  Per-variable specification of boolean parameters in open_dataset (pydata#9218)
  test push
  Added a space to the documentation (pydata#9247)
  Fix typing for test_plot.py (pydata#9234)
  Allow mypy to run in vscode (pydata#9239)
  Revert "Test main push"
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants