Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine UnsignedIntegerCoder and CFMaskCoder #9266

Closed
dcherian opened this issue Jul 22, 2024 · 5 comments
Closed

Combine UnsignedIntegerCoder and CFMaskCoder #9266

dcherian opened this issue Jul 22, 2024 · 5 comments

Comments

@dcherian
Copy link
Contributor

What is your issue?

Longer discussion here: #9258 (comment) but fundamentally, it's not possible to handle the _Unsigned attribute and masking separately so we'll need to merge these.

@djhoese will you be able to send a PR for this please?

@djhoese
Copy link
Contributor

djhoese commented Jul 22, 2024

I will try my best, but no guarantees on timeliness. I'll mostly be using this as yet another source of procrastination on the work I'm supposed to be doing. Anyway...

I think the high-level summary of our discussion on my PR (#9258) and the PR that came before it (#9136) as far as changes needed are:

  1. _FillValue as far as the user is concerned should always be the on-disk type. For the case where _Unsigned: "true" that means it will be signed; even in-memory. This means it will not always match the dtype of the data array it is attached/assigned to.
  2. Absorb the UnsignedIntegerCoder into the CFMaskCoder (or some logical not-too-long rename) so that _FillValue can be decoded to unsigned type if necessary, used for masking, and then discarded in favor of CF compliance and retaining the on-disk type of the _FillValue attribute.
  3. If it makes sense, add a serialization warning when in-memory _FillValues go through the encode process and they do not already match the on-disk dtype. For example, a uint8 DataArray with FillValue of 255, but is going to be saved to a NetCDF file as an int8 variable via _Unsigned: "true". 255 can be converted to a signed int8, but it will result in a warning as -1 should have been used instead.

@dcherian you seemed to know that xarray can't load integer arrays as integers and instead produces floats. Is this by design or something that is too complicated to fix?

@keewis
Copy link
Collaborator

keewis commented Jul 22, 2024

you seemed to know that xarray can't load integer arrays as integers and instead produces floats. Is this by design or something that is too complicated to fix?

The reason for this is that there is no representation for masked arrays with dtypes other than complex and float (and the new string dtype) that support masked values – and complex and float reuse nan for this, which actually means something else. Obviously that also means that if you don't need to apply _FillValue the conversion of other dtypes to float does not happen.

The logic is here:

def maybe_promote(dtype: np.dtype) -> tuple[np.dtype, Any]:
"""Simpler equivalent of pandas.core.common._maybe_promote
Parameters
----------
dtype : np.dtype
Returns
-------
dtype : Promoted dtype that can hold missing values.
fill_value : Valid missing value for the promoted dtype.
"""
# N.B. these casting rules should match pandas
dtype_: np.typing.DTypeLike
fill_value: Any
if isdtype(dtype, "real floating"):
dtype_ = dtype
fill_value = np.nan
elif isinstance(dtype, np.dtype) and np.issubdtype(dtype, np.timedelta64):
# See https://github.com/numpy/numpy/issues/10685
# np.timedelta64 is a subclass of np.integer
# Check np.timedelta64 before np.integer
fill_value = np.timedelta64("NaT")
dtype_ = dtype
elif isdtype(dtype, "integral"):
dtype_ = np.float32 if dtype.itemsize <= 2 else np.float64
fill_value = np.nan
elif isdtype(dtype, "complex floating"):
dtype_ = dtype
fill_value = np.nan + np.nan * 1j
elif isinstance(dtype, np.dtype) and np.issubdtype(dtype, np.datetime64):
dtype_ = dtype
fill_value = np.datetime64("NaT")
else:
dtype_ = object
fill_value = np.nan
dtype_out = np.dtype(dtype_)
fill_value = dtype_out.type(fill_value)
return dtype_out, fill_value

To fix that we need one of (or both):

  • a maskedarray implementation that behaves like a normal array (for example marray)
  • truly nullable dtypes

Both of these are sufficiently hard to get right that nothing other than the pandas extension dtypes exist (that I'm aware of). To give you an idea of how hard this is, there are several deferred NEPs that attempt to resolve this (NEPs 12, 24, 25, 26).

@djhoese
Copy link
Contributor

djhoese commented Jul 22, 2024

Well it seems obvious when you say it like that 😉

Ok so I was focused on the encoding/decoding of dtypes, but it is more about the masking and operations that need to know the validity of a value. So while I might know that my data quality flag that is uint8 where a value of 1 means "bad"/mask, that doesn't mean that any numpy operation will know that even though I probably wouldn't use the variable in normal arithmetic operations anyway.

@djhoese
Copy link
Contributor

djhoese commented Jul 24, 2024

FYI I'm almost done with this at least as a first draft. I'll need to clean it up a lot as right now it is mostly a copy/paste job. Right now I'm just trying to get tests to pass and make sense. The hardest part is figuring out when things should be in .encoding and when they should be in .attrs, but I'm almost there.

@dcherian
Copy link
Contributor Author

Closed by #9274

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants