Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving Parcels output directly in zarr format #1199

Merged
merged 89 commits into from
Oct 5, 2022
Merged

Conversation

erikvansebille
Copy link
Member

@erikvansebille erikvansebille commented Jul 20, 2022

With this PR, we are implementing directly saving Parcels trajectory output data in zarr format. With the merging of #1165 in v2.3.1, it was already possible to store the output after conversion to a zarr file; after this PR there is no need for conversion at all.

Advantages are:

  • No need for converting the pickled dumps to file at the end of a simulation (with all the memory issues like Memory limitation in netcdf export step #1091)
  • No need for explicit ParticleFile.export() call
  • Directly analyse the Parcels output during a simulation
  • Much simpler (hence easier-to-maintain) code

Only major disadvantage so far is that zarr folders are not as 'transportable' as netCDF files. But it's fairly trivial to convert zarr data to a netcdf file using xarray.open_zarr() and xarray.to_netcdf()

Developments to do before merging:

  • Test in real-world examples how much faster/slower the new writing is
  • Implement support for control of chunks sizes in the ParticleFile object
  • Further simplify code of ParticleSet.write() (especially the collection.toDictionary() method)
  • Further clean-up now-redundant methods and code
  • Double check of tutorials (especially the three 'main ones' on the top of https://oceanparcels.org/#tutorials) to be in line with new output format

erikvansebille and others added 30 commits July 12, 2022 16:18
Did not work because can't append in two dimensions at the same time
erikvansebille and others added 21 commits August 24, 2022 17:21
Simplifying (and speeidng up) the writing to zarr in MPI mode, by letting each processor write to its own file and then combine with `xr.merge()` at the end
For easier merging of multiple zarr outputs in MPI mode
In xarray version 2022.6.0. Thanks for noticing, @JamiePringle!
Code to combine output from MPI run into a single Zarr file, re-chunk the data, and change variable types, and add variables to the zarr file even for very large output sizes.
Note link will not work yet; only when merged into master
change documentation_MPI.ipynb and documentation_LargeRunsOutput.ipynb to work around slow .to_zarr() of datasets that contain datetime variables.
@CKehl
Copy link
Contributor

CKehl commented Sep 28, 2022

Comment: while working on #1247 , I get errors that are related to the new ParticleSet writing.

____________________________________________________________________________ test_plotting[hist2d-aos] ____________________________________________________________________________

pset_mode = 'aos', mode = 'hist2d', tmpdir = local('/tmp/pytest-of-christian/pytest-8/test_plotting_hist2d_aos_0')

    @pytest.mark.parametrize('pset_mode', pset_modes)
    @pytest.mark.parametrize('mode', ['2d', '3d', 'movie2d', 'hist2d'])
    def test_plotting(pset_mode, mode, tmpdir):
        if mode == '3d' and sys.platform in ['linux', 'linux2']:
            logger.info('Skipping 3d test in linux Travis, since it fails to find display to connect')
            return
>       fp = create_outputfiles(tmpdir, pset_mode)

tests/test_scripts.py:49: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/test_scripts.py:38: in create_outputfiles
    output_file.close()
/var/scratch/test_env/lib/python3.8/site-packages/parcels-2.3.3.dev8-py3.8.egg/parcels/particlefile/baseparticlefile.py:217: in close
    self.export()
/var/scratch/test_env/lib/python3.8/site-packages/parcels-2.3.3.dev8-py3.8.egg/parcels/particlefile/particlefileaos.py:179: in export
    ds.to_netcdf(self.fname)
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/core/dataset.py:1900: in to_netcdf
    return to_netcdf(
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/api.py:1077: in to_netcdf
    dump_to_store(
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/api.py:1124: in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/common.py:262: in store
    variables, attributes = self.encode(variables, attributes)
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/common.py:352: in encode
    variables = {k: self.encode_variable(v) for k, v in variables.items()}
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/common.py:352: in <dictcomp>
    variables = {k: self.encode_variable(v) for k, v in variables.items()}
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/scipy_.py:203: in encode_variable
    variable = encode_nc3_variable(variable)
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/netcdf3.py:95: in encode_nc3_variable
    data = coerce_nc3_dtype(var.data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

arr = array([[                  0,                   0,                   0,
                          0,                   ...807,
        9223372036854775807, 9223372036854775807, 9223372036854775807,
        9223372036854775807]], dtype=int64)

    def coerce_nc3_dtype(arr):
        """Coerce an array to a data type that can be stored in a netCDF-3 file
    
        This function performs the dtype conversions as specified by the
        ``_nc3_dtype_coercions`` mapping:
            int64  -> int32
            uint64 -> int32
            uint32 -> int32
            uint16 -> int16
            uint8  -> int8
            bool   -> int8
    
        Data is checked for equality, or equivalence (non-NaN values) using the
        ``(cast_array == original_array).all()``.
        """
        dtype = str(arr.dtype)
        if dtype in _nc3_dtype_coercions:
            new_dtype = _nc3_dtype_coercions[dtype]
            # TODO: raise a warning whenever casting the data-type instead?
            cast_arr = arr.astype(new_dtype)
            if not (cast_arr == arr).all():
>               raise ValueError(
                    f"could not safely cast array from dtype {dtype} to {new_dtype}"
                )
E               ValueError: could not safely cast array from dtype int64 to int32

/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/netcdf3.py:66: ValueError
------------------------------------------------------------------------------ Captured stderr call -------------------------------------------------------------------------------
INFO: Compiled ObjectJITParticleAdvectionRK4 ==> /tmp/parcels-1000/lib91743b572285f1361b6ef8c2cdad600c_0.so
-------------------------------------------------------------------------------- Captured log call --------------------------------------------------------------------------------
INFO     parcels.tools.loggers:basekernel.py:255 Compiled ObjectJITParticleAdvectionRK4 ==> /tmp/parcels-1000/lib91743b572285f1361b6ef8c2cdad600c_0.so

In short: When writing the particle set, there is some incorrect conversion of its attributes in terms of its dtype when writing the file as NetCDF. This is currently not picked up by any tests cause the set zarr as dependency, auto-detect zarr, and only test zarr in all the tests.
I myself don't have zarr installed, which is why I spotted the error.

I can't fix it anymore due to time and cause I also didn't look on the zarr implementation. But may be good to have a look on here (@erikvansebille ).