Saving Parcels output directly in zarr format #1199

erikvansebille · 2022-07-20T11:48:25Z

With this PR, we are implementing directly saving Parcels trajectory output data in zarr format. With the merging of #1165 in v2.3.1, it was already possible to store the output after conversion to a zarr file; after this PR there is no need for conversion at all.

Advantages are:

No need for converting the pickled dumps to file at the end of a simulation (with all the memory issues like Memory limitation in netcdf export step #1091)
No need for explicit ParticleFile.export() call
Directly analyse the Parcels output during a simulation
Much simpler (hence easier-to-maintain) code

Only major disadvantage so far is that zarr folders are not as 'transportable' as netCDF files. But it's fairly trivial to convert zarr data to a netcdf file using xarray.open_zarr() and xarray.to_netcdf()

Developments to do before merging:

Test in real-world examples how much faster/slower the new writing is
Implement support for control of chunks sizes in the ParticleFile object
Further simplify code of ParticleSet.write() (especially the collection.toDictionary() method)
Further clean-up now-redundant methods and code
Double check of tutorials (especially the three 'main ones' on the top of https://oceanparcels.org/#tutorials) to be in line with new output format

…o_zarr

Did not work because can't append in two dimensions at the same time

…ith zarr output)

…e_file

Simplifying (and speeidng up) the writing to zarr in MPI mode, by letting each processor write to its own file and then combine with `xr.merge()` at the end

For easier merging of multiple zarr outputs in MPI mode

@JamiePringle

In xarray version 2022.6.0. Thanks for noticing, @JamiePringle!

Code to combine output from MPI run into a single Zarr file, re-chunk the data, and change variable types, and add variables to the zarr file even for very large output sizes.

Note link will not work yet; only when merged into master

change documentation_MPI.ipynb and documentation_LargeRunsOutput.ipynb to work around slow .to_zarr() of datasets that contain datetime variables.

CKehl · 2022-09-28T14:43:50Z

Comment: while working on #1247 , I get errors that are related to the new ParticleSet writing.

____________________________________________________________________________ test_plotting[hist2d-aos] ____________________________________________________________________________

pset_mode = 'aos', mode = 'hist2d', tmpdir = local('/tmp/pytest-of-christian/pytest-8/test_plotting_hist2d_aos_0')

    @pytest.mark.parametrize('pset_mode', pset_modes)
    @pytest.mark.parametrize('mode', ['2d', '3d', 'movie2d', 'hist2d'])
    def test_plotting(pset_mode, mode, tmpdir):
        if mode == '3d' and sys.platform in ['linux', 'linux2']:
            logger.info('Skipping 3d test in linux Travis, since it fails to find display to connect')
            return
>       fp = create_outputfiles(tmpdir, pset_mode)

tests/test_scripts.py:49: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/test_scripts.py:38: in create_outputfiles
    output_file.close()
/var/scratch/test_env/lib/python3.8/site-packages/parcels-2.3.3.dev8-py3.8.egg/parcels/particlefile/baseparticlefile.py:217: in close
    self.export()
/var/scratch/test_env/lib/python3.8/site-packages/parcels-2.3.3.dev8-py3.8.egg/parcels/particlefile/particlefileaos.py:179: in export
    ds.to_netcdf(self.fname)
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/core/dataset.py:1900: in to_netcdf
    return to_netcdf(
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/api.py:1077: in to_netcdf
    dump_to_store(
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/api.py:1124: in dump_to_store
    store.store(variables, attrs, check_encoding, writer, unlimited_dims=unlimited_dims)
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/common.py:262: in store
    variables, attributes = self.encode(variables, attributes)
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/common.py:352: in encode
    variables = {k: self.encode_variable(v) for k, v in variables.items()}
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/common.py:352: in <dictcomp>
    variables = {k: self.encode_variable(v) for k, v in variables.items()}
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/scipy_.py:203: in encode_variable
    variable = encode_nc3_variable(variable)
/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/netcdf3.py:95: in encode_nc3_variable
    data = coerce_nc3_dtype(var.data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

arr = array([[                  0,                   0,                   0,
                          0,                   ...807,
        9223372036854775807, 9223372036854775807, 9223372036854775807,
        9223372036854775807]], dtype=int64)

    def coerce_nc3_dtype(arr):
        """Coerce an array to a data type that can be stored in a netCDF-3 file
    
        This function performs the dtype conversions as specified by the
        ``_nc3_dtype_coercions`` mapping:
            int64  -> int32
            uint64 -> int32
            uint32 -> int32
            uint16 -> int16
            uint8  -> int8
            bool   -> int8
    
        Data is checked for equality, or equivalence (non-NaN values) using the
        ``(cast_array == original_array).all()``.
        """
        dtype = str(arr.dtype)
        if dtype in _nc3_dtype_coercions:
            new_dtype = _nc3_dtype_coercions[dtype]
            # TODO: raise a warning whenever casting the data-type instead?
            cast_arr = arr.astype(new_dtype)
            if not (cast_arr == arr).all():
>               raise ValueError(
                    f"could not safely cast array from dtype {dtype} to {new_dtype}"
                )
E               ValueError: could not safely cast array from dtype int64 to int32

/var/scratch/bin-x86_64/anaconda/envs/OpenParcels_MPI/lib/python3.8/site-packages/xarray/backends/netcdf3.py:66: ValueError
------------------------------------------------------------------------------ Captured stderr call -------------------------------------------------------------------------------
INFO: Compiled ObjectJITParticleAdvectionRK4 ==> /tmp/parcels-1000/lib91743b572285f1361b6ef8c2cdad600c_0.so
-------------------------------------------------------------------------------- Captured log call --------------------------------------------------------------------------------
INFO     parcels.tools.loggers:basekernel.py:255 Compiled ObjectJITParticleAdvectionRK4 ==> /tmp/parcels-1000/lib91743b572285f1361b6ef8c2cdad600c_0.so

In short: When writing the particle set, there is some incorrect conversion of its attributes in terms of its dtype when writing the file as NetCDF. This is currently not picked up by any tests cause the set zarr as dependency, auto-detect zarr, and only test zarr in all the tests.
I myself don't have zarr installed, which is why I spotted the error.

I can't fix it anymore due to time and cause I also didn't look on the zarr implementation. But may be good to have a look on here (@erikvansebille ).

erikvansebille and others added 30 commits July 12, 2022 16:18

SIgnificant simplification of baseparticlefile by using ds.to_zarr()

f232cf1

Updating particlefile.write() to support deleting particles

f08fea6

Updating particlefilesoa and -aos to new simpler writing

f354d1a

Updating output-logger message in pset.execute()

831ca7c

Updating test_particle_file to zarr dumping (first part)

684d9cc

Update baseparticlefile.py

084a68d

Further additions to test_particle_file

8101eb8

Updating dump_to_zarr to support adding and removing particles

762a4ea

dump_to_zarr passing all tests in test_particle_file

d5554a9

Vecorising the dump_to_zarr

97a2950

Removing convert_npydir_to_netcdf.py script as not needed with dump_t…

a40c110

…o_zarr

Cleanign up particlefilesoa.py and particlefileaos.py

6a4f2c4

Removing old dump_to_zarr attempt via xarray DataSets appending

4ff05ed

Did not work because can't append in two dimensions at the same time

Removing conert_at_end keyword from baseparticlefile (as not needed w…

8cb7a2d

…ith zarr output)

Cleaning up test_particle_file

0e6f5d7

Fixing dump_to_zarr in unit tests in other function than test_particl…

1909047

…e_file

Update example_mitgcm.py

ad70153

Update parcels_tutorial to zarr output

fe3ae3d

Update NedetedFields tutorial to zarr output

e527c00

Update tutorial_Argofloats to use zarr

8bed36e

Update tutorial_SummedFields.ipynb

01cfbc4

Update tutorial_delaystart.ipynb

d98efa2

Update tutorial_Argofloats.ipynb

93e4340

Update tutorial_diffusion.ipynb

b97c73e

Merge branch 'master' into dump_to_zarr

65cc361

Update tutorial_interaction to zarr output

494145e

Update tutorial_interpolation to zarr output

c19e112

Update tutorial_nemo_curvilinear to use zarr output

b172e26

Updating tutorial_plotting to default zarr output

3837830

Update tutorial_parcels_structure to zarr output

6859a42

erikvansebille and others added 21 commits August 24, 2022 17:21

Fixing small bug when chunks < maxids in baseparticlefile

b1c5d1e

Merge branch 'master' into dump_to_zarr

8a18cfb

Updating delaystart tutorial to use zarr output

18c75c2

Updating MPI version of zarr writing

f789075

Simplifying (and speeidng up) the writing to zarr in MPI mode, by letting each processor write to its own file and then combine with `xr.merge()` at the end

Fixing bug when MPI not installed

6e2c05e

Making trajectory a coordinate variable in zarr

dd61b21

For easier merging of multiple zarr outputs in MPI mode

Merge branch 'master' into dump_to_zarr

6a9a32b

Simplifying code in baseparticlefile for findings ids to write

31ea667

Adding an obs coordinate to the zarr file

1a9731d

Add unit test to check if zarr trajectory.dtype is int64

44b8593

Fixing bug where trajectory written as float

4279a20

In xarray version 2022.6.0. Thanks for noticing, @JamiePringle!

Fixing naming of zarr files for MPI

8a02ccd

Adding info on zarr output in MPI to documentation

3365460

Adding info on MPI concatenation to output tutorial

1f2b2a0

Updating text of zarr explanation in MPI and output tutorials

051c8bf

Updating default zarr chunks to 1 obs

66807da

Updating MPI documentation to also import os.path

8846861

Add documentation to combine lMPI run output into single Zarr store

01bf12f

Code to combine output from MPI run into a single Zarr file, re-chunk the data, and change variable types, and add variables to the zarr file even for very large output sizes.

Moving documentation_largeoutputfile to parcels/examples folder

125dc61

Adding cell on large output runs to mpi documentation

0b1f4a2

Note link will not work yet; only when merged into master

update documentation to fix xaray datetime64[ns] bug

cfd4821

change documentation_MPI.ipynb and documentation_LargeRunsOutput.ipynb to work around slow .to_zarr() of datasets that contain datetime variables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saving Parcels output directly in zarr format #1199

Saving Parcels output directly in zarr format #1199

erikvansebille commented Jul 20, 2022 •

edited

Loading

CKehl commented Sep 28, 2022

Saving Parcels output directly in zarr format #1199

Saving Parcels output directly in zarr format #1199

Conversation

erikvansebille commented Jul 20, 2022 • edited Loading

CKehl commented Sep 28, 2022

erikvansebille commented Jul 20, 2022 •

edited

Loading