xarray based ParticleFile to support netcdf and zarr output #1165

erikvansebille · 2022-04-14T15:37:36Z

This PR implements exporting of ParticleFile objects based on creating a xarray.DataSet object, rather than core netcdf routines we used before. The big advantage is that this will make it much easier to also support other file formats such as zarr, because if we first create a DataSet, then it can be trivially exported either with ds.to_netcdf() to ds._tozarr().

As suggested by @JamiePringle in #1164, zarr format may have much better performance in disk space and increased speed. This PR will make it easier to test that for real-world applications

To-do

Implement xarray-based DataSet creation for SOA
Implement xarray-based DataSet creation for AOS
Also implement on Nodes branch (do you think this is a good idea, @CKehl?)
Add unit tests for zarr to test_particle_file.py
Clean up code
Expand tutorials to explain use of zarr

Beyond scope of this PR (but interesting to explore)

Instead of dumping numpy pickled files to disk during the simulation, @daanreijnders suggested that we could create a zarr directory/file immediately during the pset.execution. That would mean no need for conversion upon completion of execute, and hence would directly fix issues like Memory limitation in netcdf export step #1091. To be explored!

This commit implements exporting of ParticleFile based on xarray, rather than core netcdf. The big advantage is that this will make it much easier to also support other file formats like zarr. zarr is also supported now for SOA (AOS to come)

…leaos

CKehl

From the code, it looks kind-of okay. Personally, I don't have any experience with zarr, but I guess I get the gist: instead of writing a file of structured output, zarr now writes z(ip)-compressed files with 1 file per particle attribute. Yes, that certainly shrinks the size. If it's faster analysing this, especially if one needs all the attributes, is questionable, but yeah: possible. I think that is still up for some benchmarking done by Parcels users.

Most importantly, I suggest branding that with version 3.0. Reason here: the output format changes distinctly, which means all the analysing scripts will need to be adapted. In short: this way of writing may lead to some incompatibility with user scripts using Parcels output.

erikvansebille · 2022-04-19T11:16:19Z

Thanks for the response, @CKehl. Note that the implementation now is completely backward-compatible; the code can still output netcdf (this is the default for now) and these netcdf files are exactly the same as is now on the master branch. To use the new zarr-formatted output, the only thing a user has to do is provide a filename ending on .zarr for the ParticleFile object

And yes, I agree that we have to testdrive Zarr-format in real-world simulations and release version 3.0 before we could ever make this default behavior. But in the meantime, this PR does not break any backward compatibility.

JamiePringle · 2022-04-19T13:36:24Z

@CKehl and @erikvansebille, a couple of comments on zarr. Christian is right that zarr writes out compressed files, but the output scheme is more complex than presented above. Each variable has its own directory. How the files within the directory are broken up depends on how the data set is chunked. Chunking is similar in concept to the chunking in netCDF, and how performant reading and writing to zarr is depends on the choice, often significantly. The chunking on the dataset can be changed after it is written, but this can be time consuming.

As I commented, I would default to 10000 trajectories and all observations as the default chunk size, but I am sure this could be optimized (and different choices are optimal for different access patterns -- again, this should be familiar from netCDF).

I would spend a little time reading about why zarr was designed as it was. Essentially, data formats like hdf5 (the core of netCDF4) have re-implemented a filesystem. In zarr, you let the heavily optimized file system of your computer act as a file system. This makes it much easier to share data over the network from cloud services, and allows much more graceful parallel writing of data and out-of-order output, especially for compressed data. For my large problem sizes, both of these are very, very useful.

Also, your file system is often better optimized for the actual physical layout of the storage than something like netCDF4/hdf5 could ever be. (but note that netCDF is starting to implement a zarr backend, and people are working on using zarr and netCDF together... Rich Signell is a good person to follow on this).

One way to look at why Zarr was designed the way it was is to look at these hints for optimal creation of netCDF: https://www.unidata.ucar.edu/software/netcdf/workshops/2011/performance/Time.html . Much of the design of Zarr was to avoid having to worry about this (except for the choice of chunking).

I am happy to talk more about the tradeoffs of Zarr versus netCDF4, but I think Zarr is a carefully designed data store that has many advantages (and some disadvantages) relative to netCDF, and there is a reason many of us who work with very large data sets are attracted to it.

erikvansebille · 2022-04-19T16:49:27Z

Thanks for this extensive write-up, @JamiePringle. What you say makes a lot of sense, I'm glad that we now implemented native zarr output through xarray in this PR; can you give a go and see how it behaves?

Note that this PR does not include your optimized writing in #1091, we may think of including that too?

And in the future, we might even swap the *.npy temporary output for native .zarr output during execution, so that users who prefer netcdf output can still convert at the end but don't have to anymore. All with full backward compatibility. Exciting!

JamiePringle · 2022-04-19T18:21:32Z

I will try the patch request sometime this week.

I think it would be non-trivial to move my #1091 code into parcels. The main issue is that a single *.npy contain drifter output from different times. I am not sure how it happens, but it does. But much of my code in #1091 deals with that. Also my code saves memory by analyzing each MPI rank by itself... that is harder to do when they are all running in real time.

I agree that writing straight to zarr or netCDF from parcels would be great. Every time I have done parallel IO I have found that 1) you can make it work and 2) it takes significantly longer than expected.

I might start by having each rank write to their own Zarr files, and then merging them afterwards.

I will try to say something more intelligent after I look at the code that writes out the *.npy files -- there are some odd things in how time ends up in them that I need to understand before being able to talk sensibly about this...

Jamie

erikvansebille · 2022-04-20T06:13:16Z

Thanks @JamiePringle! I understand that MPI support for on-the-fly zarr dumping may be tricky, but going for one zarr directory per node would be a good start

The main issue is that a single *.npy contain drifter output from different times. I am not sure how it happens, but it does.

This is probably because particles are being deleted between time-outputdt/2 and time+outputdt/2.
See the collectionsoa._to_write_particles() function:
https://github.com/OceanParcels/parcels/blob/523a37227364fb2a3b6dfcf1904e3a487ce3c03b/parcels/collection/collectionsoa.py#L29-L37
which controls which particles are dumped to dict at
https://github.com/OceanParcels/parcels/blob/523a37227364fb2a3b6dfcf1904e3a487ce3c03b/parcels/collection/collectionsoa.py#L849

Does this help? If you don't delete particles, are all the times in a single *.npy file the same?

JamiePringle · 2022-04-20T19:04:02Z

@erikvansebille -- I am buried with reviews and end of term stuff now. I will check if keeping particles from deleting eliminates this problem, but won't be able to do until later this week or Monday. I also wonder if the issue might be the precise equality of your tests and the rounding errors... All of the time to die, release time intervals and write time intervals are multiples of each other...

Anyway, I will let you know what I find in half a week or so.

One last thought -- when writing to Zarr, it may make sense to arrange chunking to optimize writing many trajectories at a single observation, and then re-chunk on conversion. The current output scheme is a little tricky, because your current scheme has each trajectory start at obs=0 at the first time it is written. I think that is a fine scheme, but it means at each output step you can write to many different observations. Given that observations that contain no data do not take up much space in zarr (because a uniform set of nan's or any other constant value will compress very well), perhaps in the temporary output having output by (trajectory,output_time_step) might make things much faster, because you can then write to many_trajectories, one_output_timestep in a single operation... This is essentially what you almost do in the npy files...

Jamie

JamiePringle · 2022-04-26T21:54:07Z

@erikvansebille Indeed, the issue goes away if the particles do not die but live forever. I strongly suspect the issue has to do with the exact equality check... but it is still odd, because the issue pops up in the middle of a string of outputs... for example, the time since release of one of the outputs (converted to days by dividing by 86400), looks like this

       28.75      , 29.        , 29.25      , 29.5       , 29.75      ,
       30.        , 30.25      , 30.5       ,         nan, 30.75      ,
       31.        , 31.04166667]

Where the sequence of numbers is from the a sequence of *.npy files (your current output code handles this perfectly, and the gap is not in the netCDF file). I start particles every half day, and write out every quarter day. It is still unclear to me why the output for some particles skips a step. However, all particles that start at the same time skip the same *.npy

I will investigate later, but it is not a big deal because all of our codes handle it properly.

Jamie

JamiePringle · 2022-04-28T14:25:24Z

@erikvansebille I apologize. I was going to test this code, but can't figure out how to change the pip install to get the version with these changes. How do I change
pip install git+https://github.com/OceanParcels/parcels.git@master so that it installs the code from this branch so I can test it?

Sorry, I am new to using github at this level

Jamie

erikvansebille · 2022-04-28T14:31:43Z

Hi @JamiePringle. I don't use pip myself but guess you 'only' have to go to the directory where parcels is installed and then use git checkout xarray_based_output (see also https://git-scm.com/docs/git-checkout)

JamiePringle · 2022-04-28T14:35:50Z

That gave me the hint I needed. The pip command is
pip install git+https://github.com/OceanParcels/parcels.git@xarray_based_output

daanreijnders · 2022-04-28T16:40:52Z

Another way to easily install a package from repo, is to clone the repo, switch to the branch that you're working from, and by using pip install -e .. This will install the package in 'editable' mode: any changes you make will immediately be available in Python, while at the same time, all scripts (e.g. the one to convert .npy to .nc) are installed correctly.

JamiePringle · 2022-04-28T18:06:35Z

@erikvansebille I tested your code on my runs, and it did what it was supposed to and produced a zarr output. The default chunking it picked was fine for most access patterns. Like all of these schemes (including netCDF) if you know exactly how you will most often access the data, you can do better by altering the chunking scheme (which is easy enough to do with the rechunker package).

erikvansebille · 2022-04-29T08:10:24Z

Thanks @JamiePringle, great to hear that the code works for you. Since it's completely backward-compatible, I then suggest we merge it into master. I've just also added information on zarr in the tutorial_output.ipynb tutorial.

Does not seem to work on pytest on local machine; but works outside of pytest?

erikvansebille added 5 commits April 14, 2022 17:24

Adding zarr requirement to environments

aff5748

Copying changes from particlefilesoa for xarray support to particlefi…

9348ba3

…leaos

Adding unit test for zarr output to test_particle_file

fe33a59

Small improvements to zarr unit tests

523a372

CKehl reviewed Apr 19, 2022

View reviewed changes

Merge branch 'master' into xarray_based_output

7aa828c

erikvansebille marked this pull request as ready for review April 29, 2022 07:20

erikvansebille added 2 commits April 29, 2022 09:54

Updating .gitignore to also ignore zarr filestores in subdirectories

a356b38

Adding info on zarr output to tutorial_output.ipynb

178206b

erikvansebille added 2 commits April 29, 2022 10:15

Adding FillValue also for trajectory

be3424c

Does not seem to work on pytest on local machine; but works outside of pytest?

Updating tutorial_output

649f6c6

erikvansebille requested a review from CKehl April 29, 2022 08:37

CKehl approved these changes May 1, 2022

View reviewed changes

erikvansebille merged commit 05f0ad9 into master May 2, 2022

claudiofgcardoso mentioned this pull request Jul 15, 2022

Update documentation for ParticleSet.from_particlefile #1198

Closed

erikvansebille mentioned this pull request Jul 20, 2022

Saving Parcels output directly in zarr format #1199

Merged

5 tasks

erikvansebille deleted the xarray_based_output branch June 23, 2023 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xarray based ParticleFile to support netcdf and zarr output #1165

xarray based ParticleFile to support netcdf and zarr output #1165

erikvansebille commented Apr 14, 2022 •

edited

Loading

CKehl left a comment

erikvansebille commented Apr 19, 2022

JamiePringle commented Apr 19, 2022 •

edited

Loading

erikvansebille commented Apr 19, 2022

JamiePringle commented Apr 19, 2022

erikvansebille commented Apr 20, 2022

JamiePringle commented Apr 20, 2022

JamiePringle commented Apr 26, 2022

JamiePringle commented Apr 28, 2022

erikvansebille commented Apr 28, 2022

JamiePringle commented Apr 28, 2022

daanreijnders commented Apr 28, 2022

JamiePringle commented Apr 28, 2022

erikvansebille commented Apr 29, 2022 •

edited

Loading

xarray based ParticleFile to support netcdf and zarr output #1165

xarray based ParticleFile to support netcdf and zarr output #1165

Conversation

erikvansebille commented Apr 14, 2022 • edited Loading

CKehl left a comment

Choose a reason for hiding this comment

erikvansebille commented Apr 19, 2022

JamiePringle commented Apr 19, 2022 • edited Loading

erikvansebille commented Apr 19, 2022

JamiePringle commented Apr 19, 2022

erikvansebille commented Apr 20, 2022

JamiePringle commented Apr 20, 2022

JamiePringle commented Apr 26, 2022

JamiePringle commented Apr 28, 2022

erikvansebille commented Apr 28, 2022

JamiePringle commented Apr 28, 2022

daanreijnders commented Apr 28, 2022

JamiePringle commented Apr 28, 2022

erikvansebille commented Apr 29, 2022 • edited Loading

erikvansebille commented Apr 14, 2022 •

edited

Loading

JamiePringle commented Apr 19, 2022 •

edited

Loading

erikvansebille commented Apr 29, 2022 •

edited

Loading