Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray based ParticleFile to support netcdf and zarr output #1165

Merged
merged 10 commits into from
May 2, 2022

Conversation

erikvansebille
Copy link
Member

@erikvansebille erikvansebille commented Apr 14, 2022

This PR implements exporting of ParticleFile objects based on creating a xarray.DataSet object, rather than core netcdf routines we used before. The big advantage is that this will make it much easier to also support other file formats such as zarr, because if we first create a DataSet, then it can be trivially exported either with ds.to_netcdf() to ds._tozarr().

As suggested by @JamiePringle in #1164, zarr format may have much better performance in disk space and increased speed. This PR will make it easier to test that for real-world applications

To-do

  • Implement xarray-based DataSet creation for SOA
  • Implement xarray-based DataSet creation for AOS
  • Also implement on Nodes branch (do you think this is a good idea, @CKehl?)
  • Add unit tests for zarr to test_particle_file.py
  • Clean up code
  • Expand tutorials to explain use of zarr

Beyond scope of this PR (but interesting to explore)

  • Instead of dumping numpy pickled files to disk during the simulation, @daanreijnders suggested that we could create a zarr directory/file immediately during the pset.execution. That would mean no need for conversion upon completion of execute, and hence would directly fix issues like Memory limitation in netcdf export step #1091. To be explored!

This commit implements exporting of ParticleFile based on xarray, rather than core netcdf. The big advantage is that this will make it much easier to also support other file formats like zarr. zarr is also supported now for SOA (AOS to come)
Copy link
Contributor

@CKehl CKehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the code, it looks kind-of okay. Personally, I don't have any experience with zarr, but I guess I get the gist: instead of writing a file of structured output, zarr now writes z(ip)-compressed files with 1 file per particle attribute. Yes, that certainly shrinks the size. If it's faster analysing this, especially if one needs all the attributes, is questionable, but yeah: possible. I think that is still up for some benchmarking done by Parcels users.

Most importantly, I suggest branding that with version 3.0. Reason here: the output format changes distinctly, which means all the analysing scripts will need to be adapted. In short: this way of writing may lead to some incompatibility with user scripts using Parcels output.

@erikvansebille
Copy link
Member Author

Thanks for the response, @CKehl. Note that the implementation now is completely backward-compatible; the code can still output netcdf (this is the default for now) and these netcdf files are exactly the same as is now on the master branch. To use the new zarr-formatted output, the only thing a user has to do is provide a filename ending on .zarr for the ParticleFile object

And yes, I agree that we have to testdrive Zarr-format in real-world simulations and release version 3.0 before we could ever make this default behavior. But in the meantime, this PR does not break any backward compatibility.

@JamiePringle
Copy link
Collaborator

JamiePringle commented Apr 19, 2022

@CKehl and @erikvansebille, a couple of comments on zarr. Christian is right that zarr writes out compressed files, but the output scheme is more complex than presented above. Each variable has its own directory. How the files within the directory are broken up depends on how the data set is chunked. Chunking is similar in concept to the chunking in netCDF, and how performant reading and writing to zarr is depends on the choice, often significantly. The chunking on the dataset can be changed after it is written, but this can be time consuming.

As I commented, I would default to 10000 trajectories and all observations as the default chunk size, but I am sure this could be optimized (and different choices are optimal for different access patterns -- again, this should be familiar from netCDF).

I would spend a little time reading about why zarr was designed as it was. Essentially, data formats like hdf5 (the core of netCDF4) have re-implemented a filesystem. In zarr, you let the heavily optimized file system of your computer act as a file system. This makes it much easier to share data over the network from cloud services, and allows much more graceful parallel writing of data and out-of-order output, especially for compressed data. For my large problem sizes, both of these are very, very useful.

Also, your file system is often better optimized for the actual physical layout of the storage than something like netCDF4/hdf5 could ever be. (but note that netCDF is starting to implement a zarr backend, and people are working on using zarr and netCDF together... Rich Signell is a good person to follow on this).

One way to look at why Zarr was designed the way it was is to look at these hints for optimal creation of netCDF: https://www.unidata.ucar.edu/software/netcdf/workshops/2011/performance/Time.html . Much of the design of Zarr was to avoid having to worry about this (except for the choice of chunking).

I am happy to talk more about the tradeoffs of Zarr versus netCDF4, but I think Zarr is a carefully designed data store that has many advantages (and some disadvantages) relative to netCDF, and there is a reason many of us who work with very large data sets are attracted to it.

@erikvansebille
Copy link
Member Author

Thanks for this extensive write-up, @JamiePringle. What you say makes a lot of sense, I'm glad that we now implemented native zarr output through xarray in this PR; can you give a go and see how it behaves?

Note that this PR does not include your optimized writing in #1091, we may think of including that too?

And in the future, we might even swap the *.npy temporary output for native .zarr output during execution, so that users who prefer netcdf output can still convert at the end but don't have to anymore. All with full backward compatibility. Exciting!

@JamiePringle
Copy link
Collaborator

I will try the patch request sometime this week.

I think it would be non-trivial to move my #1091 code into parcels. The main issue is that a single *.npy contain drifter output from different times. I am not sure how it happens, but it does. But much of my code in #1091 deals with that. Also my code saves memory by analyzing each MPI rank by itself... that is harder to do when they are all running in real time.

I agree that writing straight to zarr or netCDF from parcels would be great. Every time I have done parallel IO I have found that 1) you can make it work and 2) it takes significantly longer than expected.

I might start by having each rank write to their own Zarr files, and then merging them afterwards.

I will try to say something more intelligent after I look at the code that writes out the *.npy files -- there are some odd things in how time ends up in them that I need to understand before being able to talk sensibly about this...

Jamie

@erikvansebille
Copy link
Member Author

Thanks @JamiePringle! I understand that MPI support for on-the-fly zarr dumping may be tricky, but going for one zarr directory per node would be a good start

The main issue is that a single *.npy contain drifter output from different times. I am not sure how it happens, but it does.

This is probably because particles are being deleted between time-outputdt/2 and time+outputdt/2.
See the collectionsoa._to_write_particles() function:
https://github.com/OceanParcels/parcels/blob/523a37227364fb2a3b6dfcf1904e3a487ce3c03b/parcels/collection/collectionsoa.py#L29-L37
which controls which particles are dumped to dict at
https://github.com/OceanParcels/parcels/blob/523a37227364fb2a3b6dfcf1904e3a487ce3c03b/parcels/collection/collectionsoa.py#L849

Does this help? If you don't delete particles, are all the times in a single *.npy file the same?

@JamiePringle
Copy link
Collaborator

@erikvansebille -- I am buried with reviews and end of term stuff now. I will check if keeping particles from deleting eliminates this problem, but won't be able to do until later this week or Monday. I also wonder if the issue might be the precise equality of your tests and the rounding errors... All of the time to die, release time intervals and write time intervals are multiples of each other...

Anyway, I will let you know what I find in half a week or so.

One last thought -- when writing to Zarr, it may make sense to arrange chunking to optimize writing many trajectories at a single observation, and then re-chunk on conversion. The current output scheme is a little tricky, because your current scheme has each trajectory start at obs=0 at the first time it is written. I think that is a fine scheme, but it means at each output step you can write to many different observations. Given that observations that contain no data do not take up much space in zarr (because a uniform set of nan's or any other constant value will compress very well), perhaps in the temporary output having output by (trajectory,output_time_step) might make things much faster, because you can then write to many_trajectories, one_output_timestep in a single operation... This is essentially what you almost do in the npy files...

Jamie

@JamiePringle
Copy link
Collaborator

@erikvansebille Indeed, the issue goes away if the particles do not die but live forever. I strongly suspect the issue has to do with the exact equality check... but it is still odd, because the issue pops up in the middle of a string of outputs... for example, the time since release of one of the outputs (converted to days by dividing by 86400), looks like this

       28.75      , 29.        , 29.25      , 29.5       , 29.75      ,
       30.        , 30.25      , 30.5       ,         nan, 30.75      ,
       31.        , 31.04166667]

Where the sequence of numbers is from the a sequence of *.npy files (your current output code handles this perfectly, and the gap is not in the netCDF file). I start particles every half day, and write out every quarter day. It is still unclear to me why the output for some particles skips a step. However, all particles that start at the same time skip the same *.npy

I will investigate later, but it is not a big deal because all of our codes handle it properly.

Jamie

@JamiePringle
Copy link
Collaborator

@erikvansebille I apologize. I was going to test this code, but can't figure out how to change the pip install to get the version with these changes. How do I change
pip install git+https://github.com/OceanParcels/parcels.git@master so that it installs the code from this branch so I can test it?

Sorry, I am new to using github at this level

Jamie

@erikvansebille
Copy link
Member Author

Hi @JamiePringle. I don't use pip myself but guess you 'only' have to go to the directory where parcels is installed and then use git checkout xarray_based_output (see also https://git-scm.com/docs/git-checkout)

@JamiePringle
Copy link
Collaborator

That gave me the hint I needed. The pip command is
pip install git+https://github.com/OceanParcels/parcels.git@xarray_based_output

@daanreijnders
Copy link
Collaborator

Another way to easily install a package from repo, is to clone the repo, switch to the branch that you're working from, and by using pip install -e .. This will install the package in 'editable' mode: any changes you make will immediately be available in Python, while at the same time, all scripts (e.g. the one to convert .npy to .nc) are installed correctly.

@JamiePringle
Copy link
Collaborator

@erikvansebille I tested your code on my runs, and it did what it was supposed to and produced a zarr output. The default chunking it picked was fine for most access patterns. Like all of these schemes (including netCDF) if you know exactly how you will most often access the data, you can do better by altering the chunking scheme (which is easy enough to do with the rechunker package).

@erikvansebille erikvansebille marked this pull request as ready for review April 29, 2022 07:20
@erikvansebille
Copy link
Member Author

erikvansebille commented Apr 29, 2022

Thanks @JamiePringle, great to hear that the code works for you. Since it's completely backward-compatible, I then suggest we merge it into master. I've just also added information on zarr in the tutorial_output.ipynb tutorial.

Does not seem to work on pytest on local machine; but works outside of pytest?
@erikvansebille erikvansebille deleted the xarray_based_output branch June 23, 2023 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants