Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiZarrToZarr append method - coo_map not working as expected #479

Open
sreesanjeevkg opened this issue Jul 18, 2024 · 5 comments
Open

Comments

@sreesanjeevkg
Copy link

I have approximately 4000 kerchunk JSON files in the mentioned directory, all of which require MultiZarrToZarr processing to create a single reference file, and all require http calls. When I call the MultiZarrToZarr.translate() method on all of them at once, I encounter a server disconnected error. As a quick workaround, I thought of appending to the reference file in batches. However, I came across an error when attempting to append.

It seems that coo_map is not working as expected for the append operation

Additionally, I'm wondering if there's a way to append directly to an empty path, rather than first creating a reference file and then appending to it.

Could you please provide guidance on how to improve this approach?

Screenshot 2024-07-18 at 5 52 41 PM

@sreesanjeevkg
Copy link
Author

The server disconnected error when i try to access a large number of files:

Screenshot 2024-07-18 at 6 24 20 PM
Screenshot 2024-07-18 at 6 24 34 PM

@martindurant
Copy link
Member

The following should fix the first issue:

--- a/kerchunk/combine.py
+++ b/kerchunk/combine.py
@@ -212,7 +212,7 @@ class MultiZarrToZarr:
         )
         mzz.coos = {}
         for var, selector in mzz.coo_map.items():
-            if selector.startswith("cf:") and "M" not in mzz.coo_dtypes.get(var, ""):
+            if isinstance(selector, str) and selector.startswith("cf:") and "M" not in mzz.coo_dtypes.get(var, ""):
                 import cftime
                 import datetime

As for your question: it would be totally reasonable to have append() create the reference set if it doesn't already exist, so that you would not have to have two different calls in your code.

For the final issue with ServerDisconnect: this is probably happening during inlining of values. The backend HTTPFileSystem has a few ways to limit the number of concurrent connections allowed. Probably the easierst is to set the following

fsspec.config.conf["nofiles_gather_batch_size"] = N

where N is a number well less than the default 1280. This setting is for the current session only (but for all async backends) unless you explicitly save the config.

@sreesanjeevkg
Copy link
Author

Sure, Thanks Martin. Can you just open a PR for the changes for combine method and merge them.

and any timeline on the feature request for the append(), when can it be pushed ?

Also, let me try the fsspec config as well, for the server requests.

@martindurant
Copy link
Member

any timeline on the feature request for the append()

I'm not sure when I'll get to it, but you can keep pinging me :)

@martindurant
Copy link
Member

#481

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants