Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet reference files from git with simplecache #455

Open
wachsylon opened this issue May 6, 2024 · 1 comment
Open

Parquet reference files from git with simplecache #455

wachsylon opened this issue May 6, 2024 · 1 comment

Comments

@wachsylon
Copy link

Hi,

I thought it might be a good idea to put the lazy Reference parquet files into git. Using this data directly from git is somehow not possible - e.g. our gitlab server also do not allow byte-range requests which are required at some point, I guess.

So I thought I could add a simplecache:: in the URL and ended up with a catalog which contains entries configured like this:

  pressure-level_analysis_daily:
    args:
      chunks: auto
      consolidated: false
      storage_options:
        lazy: true
        remote_protocol: http
      urlpath: reference::simplecache::{{CATALOG_DIR}}/disk/E5pl00_1D/combined.parq
    driver: zarr

I thought that this triggers downloads of required reference files first before opening it. For opening (getting metadata and coordinates) it seems to work. But for getting the real variable data, it seems like the caching is not fast enough or not syncronized correctly. Especially if I use it with dask, I get a lot of OS errors or incomplete parquet errors when accessing it the first time. When accessing it a third time, it usually works then - as if the caching has only finished then.

Is there a way to configure the process to wait for the caching before using the parquets? Or am I on the wrong track here?

Btw I also tried simplecache::reference:: but then all data that is used and referenced in the reference files is also cached. I only want to cache the reference parquets however...

Thanks and best,
Fabi

@martindurant
Copy link
Member

My guess is, that you have multiple threads on the worker (or multiple workers that see the same filesystem). Since simplecache really is simple, it assumes that if a file is present, it is the whole cached file. So if one thread starts to download and another tries to open the file before that finishes, it will read the partial file on disk. That would account for what you see. I haven't yet thought of how you can solve this...

It may be worth opening an issue on fsspec, whereby the cacher downloads to a different filename and moves to the final destination when done (which may result in some files downloading multiple times, but that's not too bad).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants