Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_zarr with mode='a' fails when _FillValue is present, caused by open_zarr with mask_and_scale=False. #9053

Open
5 tasks done
cpegel opened this issue May 30, 2024 · 13 comments
Labels
bug topic-CF conventions topic-zarr Related to zarr storage library

Comments

@cpegel
Copy link

cpegel commented May 30, 2024

What happened?

When opening a Zarr dataset with open_zarr and then writing to it using to_zarr, we get a ValueError when the _FillValue attribute of at least one data variable or coordinate is present. This happens for example when opening the dataset with mask_and_scale=False.

ValueError: failed to prevent overwriting existing key _FillValue in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.

The error can be worked around by deleting the _FillValue attributes of all data variables and coordinates before calling to_zarr. However, if the Zarr metadata contains meaningful fill_value attributes beforehand, they will then be lost after a round-trip of open_zarr (with mask_and_scale=False) and to_zarr.

The same behavior, but without the cause being mask_and_scale=False, has been reported in the open issues #6069 and #6329 without any solutions.

What did you expect to happen?

I expect to be able to read from and then write to a Zarr storage without having to delete attributes in between or lose available fill_value metadata from the Zarr storage. Calling to_zarr with mode='a' should just write any DataArrays _FillValue attribute to the fill_value field in the Zarr metadata instead of failing with a ValueError.

Minimal Complete Verifiable Example

import xarray as xr

# Create a dataset and write to Zarr storage

ds = xr.Dataset(dict(A=xr.DataArray([1.0])))
# ds.A.attrs is empty here.

zarr_path = "/path/to/storage.zarr"
ds.to_zarr(zarr_path, mode='a')

# Read the dataset from Zarr again using `mask_and_scale=False`

ds = xr.open_zarr(zarr_path, mask_and_scale=False)
# ds.A.attrs is now {'_FillValue': nan}

# Write the dataset to Zarr again
ds.to_zarr(zarr_path, mode='a')

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 07:53:56) [MSC v.1937 64 bit (AMD64)]
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: ('de_DE', 'cp1252')
libhdf5: None
libnetcdf: None

xarray: 2024.3.0
pandas: 2.2.0
numpy: 1.26.4
scipy: 1.12.0
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.18.0
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: 2024.4.1
distributed: 2024.4.1
matplotlib: 3.8.2
cartopy: None
seaborn: None
numbagg: None
fsspec: 2024.3.1
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 69.0.3
pip: 24.0
conda: None
pytest: None
mypy: None
IPython: 8.21.0
sphinx: None

@cpegel cpegel added bug needs triage Issue that has not been reviewed by xarray team member labels May 30, 2024
Copy link

welcome bot commented May 30, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@ghiggi
Copy link

ghiggi commented Jun 3, 2024

Hi @cpegel

I just stumbled on the same problem while rechunking some large datasets.

This also occurs by just doing ds = xr.open_zarr(zarr_path, decode_cf=False)!

I think fixing this is very relevant! One or two years ago, I was using the decode_cf=False, mask_and_scale=False options when rechunking datasets several terabytes in size using the rechunker package. These arguments allowed me to reduce the overall size of the rechunking tasks by a factor of 2 when data was smartly saved as scaled integers. I was not encountering any problems back then.

So something must have changed in the past year or so!

@ghiggi
Copy link

ghiggi commented Jun 3, 2024

MVCE four our use case:

  • _FillValue / add_offset / scale_factor allowed in the attrs of new variables to be appended.
  • _FillValue / add_offset / scale_factor not allowed in the attrs of coordinates
import xarray as xr

zarr_src_path = "/ltenas8/tmp/storage_src.zarr"
zarr_dst_path = "/ltenas8/tmp/storage_rechunked.zarr"

# Create a dataset and write to Zarr storage
ds = xr.Dataset(dict(A=xr.DataArray([1.0])))
ds = ds.assign_coords({"dim_0": [1]})
ds["A"].encoding = {"dtype": "uint16", 
                    "scale_factor": 1.0, 
                    "add_offset": 0.0,
                    "_FillValue": 65535}
ds["dim_0"].encoding = {"dtype": "uint16", 
                        "scale_factor": 1.0, 
                        "add_offset": 0.0,
                        "_FillValue": 65535}
ds.to_zarr(zarr_src_path, mode='w')

# Read the dataset from a Zarr Store (using  `decode_cf=False, mask_and_scale=False`)
ds = xr.open_zarr(zarr_src_path, decode_cf=False, mask_and_scale=False)

# Do some rechunking
ds = ds.compute().chunk("auto")

# Write data with new chunking to another Zarr Store
ds.to_zarr(zarr_dst_path, mode='w')

# Add another variable to the Zarr Store
ds = ds.rename({"A": "B"}) # creation of another dummy variable 
ds = ds.compute().chunk("auto")

# This does not work if the coordinate has _FillValue, add_offset or scale_factor
ds["dim_0"].attrs
ds.to_zarr(zarr_dst_path, mode='a')

# This work if we remove _FillValue / add_offset / scale_factor from coordinates 
# --> Also if variable B has _FillValue / add_offset / scale_factor in the attributes !
ds["dim_0"].attrs = {}
ds["B"].attrs # this encodings are allowed  {'add_offset': 0.0, 'scale_factor': 1.0, '_FillValue': 65535}
ds.to_zarr(zarr_dst_path, mode='a')

@kmuehlbauer
Copy link
Contributor

@ghiggi Can you explain why you are using packed data (add_offset, scale_factor) without any packing? Why not use uint16 in the first place, instead of float?

@ghiggi
Copy link

ghiggi commented Jun 3, 2024

Hey @kmuehlbauer. I am not sure I understand your question. I had a long day. What do you refer with 'without any packing?'. Specifying the encoding as in the MVCE automatically pack the data and cast the source floating variables to uint16 when writing to disk. If the user open the dataset with decode_cf=True it will get the float variables (approximated to 2,3 or so decimals) with invalid values masked to np.nan ...

@kmuehlbauer
Copy link
Contributor

@ghiggi With scale_factor=1.0 and add_offset=0 this packing is a no-op:

unpacked = packed * scale_factor + add_offset

This means, that any decimal places which are there in the unpacked data are effectively removed when packing. I can't see how you get back any decimal places when unpacking. Maybe I'm missing something, though.

But nevertheless I think I've detected the root cause of the issue:

if existing_variable_names:
# We make sure that values to be appended are encoded *exactly*
# as the current values in the store.
# To do so, we decode variables directly to access the proper encoding,
# without going via xarray.Dataset to avoid needing to load
# index variables into memory.
existing_vars, _, _ = conventions.decode_cf_variables(
variables={
k: self.open_store_variable(name=k) for k in existing_variable_names
},
# attributes = {} since we don't care about parsing the global
# "coordinates" attribute
attributes={},
)
# Modified variables must use the same encoding as the store.
vars_with_encoding = {}
for vn in existing_variable_names:
if self._mode in ["a", "a-", "r+"]:
_validate_datatypes_for_zarr_append(
vn, existing_vars[vn], variables[vn]
)
vars_with_encoding[vn] = variables[vn].copy(deep=False)
vars_with_encoding[vn].encoding = existing_vars[vn].encoding
vars_with_encoding, _ = self.encode(vars_with_encoding, {})
variables_encoded.update(vars_with_encoding)

In this block the existing variables in the store are decoded (even if the dataset was opened with decode_cf=False) and their encoding overrides the encoding on the appended data. Thus, that undecoded data is treated as decoded which creates that error. I'm not sure that removing the attrs as suggested in the raised error is the right fix for this, as then the undecoded data is encoded a second time.

I do not immediately see a solution to this. Maybe just checking if the data to be written has encoding specified and only try to encode in those cases is enough?

@ghiggi
Copy link

ghiggi commented Jun 4, 2024

Hey @kmuehlbauer. Shame on me. I was to tired yesterday to note that in the MCVE I provided I unconsciously specified scale_factor and add_offset that does not actually pack the data :-P
For my use case, I got around the issue by just decoding the coordinates using the relevant options in xr.decode_cf .
I am neither sure what is the best solution to solve the issue, but thanks to have dived into it.

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Jun 5, 2024

@jhamman, @rabernat Your expertise is needed here. How should encoding/decoding be handled in the case of undecoded or partially decoded data? Should this raise a more meaningful error?

@kmuehlbauer kmuehlbauer added topic-CF conventions topic-zarr Related to zarr storage library and removed needs triage Issue that has not been reviewed by xarray team member labels Jun 5, 2024
@kmuehlbauer
Copy link
Contributor

Also connected to #5405.

Again, the issue is due to (unconditionally) encode variables which are already encoded. See comment above.

@rabernat
Copy link
Contributor

Your expertise is needed here. How should encoding/decoding be handled in the case of undecoded or partially decoded data?

This is a tricky issue. From a fundamental point of view, this is challenging because there are really two layers of encoding: Xarray's encoding pipelines and the Zarr codecs. In Zarr world, codec handle all of the data transformations, while attrs are used to hold user-defined metadata. Xarray overloads the Zarr attrs with encoding information, and Zarr doesn't really know or understand this.

The opinion I have developed is that it would be best if we could push as much encoding as possible down to the Zarr layer, rather than Xarray. (This is already possible today for scale_factor and add_offset). When we originally implemented the Zarr backend, we just copied a lot from NetCDF, which delegated lots of encoding to Xarray. But, especially with Zarr V3, we have the ability to create new, custom codecs that do exactly what we want. CF-time encoding, for example, could be a Zarr codec.

In the case where Xarray is handling the encoding, I feel like we need a more comprehensive set of compatibility checks between the encoding attributes and the zarr attributes on disk in the existing store. Right now these are quite ad hoc. We need a way to clearly distinguish which zarr attributes are actually encoding parameters.

@dopplershift
Copy link
Contributor

dopplershift commented Jun 12, 2024

Though wouldn't using a custom codec, for writing, potentially create some interoperability challenges with other clients?

@rabernat
Copy link
Contributor

rabernat commented Jun 12, 2024

"Custom codec" here means an official Zarr extension. We now have this possibility in Zarr V3. Yes, implementations would have to support it.

It's important to recognize that we are already essentially using such custom codecs (without any spec at the Zarr level) when we let Xarray handle the encoding. If a non-Xarray Zarr client opens data with cf-time encoding, or scale_factor and add_offset in the attrs, it will not be decoded properly. If we push this to the Zarr level, at least such clients will be able to raise an error rather than sending un-decoded data to the user.

@dopplershift
Copy link
Contributor

👍 to official codec.

I agree it's good to formalize, make the user experience more robust. I don't think it's necessarily better to provide an error rather than undecoded data. Undecoded data can be introspected and allows the possibility for the user to manually handle. An error is a full stop, with no workaround. It is by definition, user-hostile, especially in the case where they're not responsible for creating the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-CF conventions topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

5 participants