Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use config to select implementation #1982

Merged
merged 44 commits into from
Jul 26, 2024

Conversation

brokkoli71
Copy link
Member

fixes #1878

Using the config (https://github.com/pytroll/donfig), the user can specify now the implementation of all codecs, the CodecPipeline, Buffer and NDBuffer.
For each of these objects, the codec registry can deal with multiple different implementations and will use the one selected by the config.

Further changes:

  • All calls on classes Buffer and NDBuffer now get called on selected Implementation
  • Registry was expanded to register codec-pipelines, buffers and ndbuffers
  • Moved registry.py from zarr.codecs.registry to zarr.registry

# Conflicts:
#	src/zarr/array.py
#	src/zarr/config.py
#	src/zarr/group.py
#	src/zarr/metadata.py
#	tests/v3/test_config.py
# Conflicts:
#	src/zarr/abc/codec.py
#	src/zarr/array.py
#	src/zarr/codecs/pipeline.py
#	src/zarr/codecs/sharding.py
#	src/zarr/codecs/transpose.py
#	src/zarr/metadata.py
#	tests/v3/test_indexing.py
# Conflicts:
#	src/zarr/store/remote.py
#	tests/v3/test_buffer.py
@normanrz normanrz self-requested a review June 27, 2024 13:09
@normanrz normanrz added this to the 3.0.0 milestone Jun 27, 2024
@normanrz
Copy link
Contributor

@madsbk I was wondering what you think about overriding the default_buffer_prototype via config. Do you think that is a good idea or unneccessary?

# Conflicts:
#	src/zarr/codecs/registry.py
#	src/zarr/indexing.py
@madsbk
Copy link
Contributor

madsbk commented Jun 28, 2024

@madsbk I was wondering what you think about overriding the default_buffer_prototype via config. Do you think that is a good idea or unneccessary?

I think it would a good idea, or maybe add a default attribute to each AsyncArray instance, which would be set to the config value if not specified when creating the array?

@brokkoli71
Copy link
Member Author

brokkoli71 commented Jun 28, 2024

@madsbk

I think it would a good idea, or maybe add a default attribute to each AsyncArray instance, which would be set to the config value if not specified when creating the array?

Do you mean to have additionally to the BufferPrototype parameter in e.g. setitem another fallback BufferPrototype stored in the AsyncArray instance which might get set upon creation of the array? So the decision of which buffer to use would be like:

prototype in setitem → prototype in AsyncArray instance → config → numpy
(with "→" being the fallback if previous was not set)

@madsbk
Copy link
Contributor

madsbk commented Jun 29, 2024

Yes exactly, but I see your point, it might be a bit too many fall backs :)

In any case, if we allow modification of default_buffer_prototype, I think we need an another constant like numpy_buffer_prototype that is always backed by a numpy array for internal use. E.g. when reading the shard index, we always want to use numpy: https://github.com/zarr-developers/zarr-python/blob/v3/src/zarr/codecs/sharding.py#L610

@brokkoli71
Copy link
Member Author

brokkoli71 commented Jul 1, 2024

good point! @madsbk

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brokkoli71 - thank for pushing this forward. Most everything is looking great! Just a few comments and requests for documentation.

src/zarr/abc/codec.py Outdated Show resolved Hide resolved
src/zarr/config.py Outdated Show resolved Hide resolved
tests/v3/test_buffer.py Outdated Show resolved Hide resolved
tests/v3/test_config.py Outdated Show resolved Hide resolved
src/zarr/config.py Outdated Show resolved Hide resolved
@jhamman jhamman added the V3 Related to compatibility with V3 spec label Jul 1, 2024
src/zarr/config.py Outdated Show resolved Hide resolved
src/zarr/config.py Outdated Show resolved Hide resolved
src/zarr/registry.py Outdated Show resolved Hide resolved
entry_points = get_entry_points()
for e in entry_points.select(group="zarr.codecs"):
__lazy_load_codecs[e.name] = e
for e in entry_points.select(group="zarr"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in multiple libraries in my env declare a buffer class? How will I be able to pick the right one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean multiple implementations per entrypoint?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, multiple implementations of the same codec would overwrite each other in the __lazy_load_codecs dict

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that needs to change. The idea is that multiple libraries can provide implementations for a specific codec (e.g. CPU- and GPU-based gzip) via entrypoints and the user can select one of these implementations via config.

I was also concerned about the other entrypoints (e.g. buffer, pipeline). It would be nice if libraries could declare more than one of each. Again, the user could then select the class via the config through the fully-qualified name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm how would one specify that in entry_points.txt?
my ideas were

  1. seperate different implementations comma-separated like
[zarr.codecs]
gzip = package:EntrypointGzipCodec1, package:EntrypointGzipCodec2
[zarr]
buffer = package:TestEntrypointBuffer, package:AnotherTestEntrypointBuffer

but that would deviate from the PEP 508 standard of entry_points.txt

  1. provide names for different implementations (which do not have an effect) and have a group for buffer, ndbuffer, pipeline and every Codec
[zarr.codecs.gzip]
some_name = package:EntrypointGzipCodec1
another = package:EntrypointGzipCodec2
[zarr.buffer]
some_name = package:TestEntrypointBuffer
another = package:AnotherTestEntrypointBuffer

I'd prefer the second but I dont like having random names for each implementation that will not be used at parsing
any other ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to support both single items (e.g. zarr.buffer = package:TestEntrypointBuffer) and groups (e.g. zarr.buffer = { gpu_buffer = package:TestEntrypointBuffer, cpu_buffer = package:AnotherTestEntrypointBuffer })?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! allowed syntax is now all of the following:

[zarr.codecs]
gzip = package:EntrypointGzipCodec1
[zarr.codecs.gzip]
some_name = package:EntrypointGzipCodec2
another = package:EntrypointGzipCodec3

[zarr]
buffer = package:TestBuffer1
[zarr.buffer]
xyz = package:TestBuffer2
abc = package:TestBuffer3

src/zarr/config.py Outdated Show resolved Hide resolved
@brokkoli71 brokkoli71 requested a review from normanrz July 10, 2024 09:37
"json_indent": 2,
"codec_pipeline": {
"path": "zarr.codecs.pipeline.BatchedCodecPipeline",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing this, I wonder if we should use an object with a path (or class?) for codecs as well. While we don't need it right now, this could be useful in the future for backwards-compat. We could also add that in the future, though.
cc @jhamman @d-v-b

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the back and forth on this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am proposing is

        {
            "array": {"order": "C"},
            "async": {"concurrency": None, "timeout": None},
            "json_indent": 2,
            "codec_pipeline": {
                "path": "zarr.codecs.pipeline.BatchedCodecPipeline",
                "batch_size": 1,
            },
            "codecs": {
                "blosc": { "path": "zarr.codecs.blosc.BloscCodec" },
                "gzip": { "path": "zarr.codecs.gzip.GzipCodec" },
                "zstd": { "path": "zarr.codecs.zstd.ZstdCodec" },
                "bytes": { "path": "zarr.codecs.bytes.BytesCodec" },
                "endian": { "path": "zarr.codecs.bytes.BytesCodec" },  # compatibility with earlier versions of ZEP1
                "crc32c": { "path": "zarr.codecs.crc32c_.Crc32cCodec" },
                "sharding_indexed": { "path": "zarr.codecs.sharding.ShardingCodec" },
                "transpose": { "path": "zarr.codecs.transpose.TransposeCodec" },
            },
            "buffer": "zarr.buffer.Buffer",
            "ndbuffer": "zarr.buffer.NDBuffer",
        }

src/zarr/registry.py Show resolved Hide resolved
brokkoli71 and others added 2 commits July 10, 2024 17:38
Co-authored-by: Norman Rzepka <code@normanrz.com>
@normanrz normanrz merged commit cbc0887 into zarr-developers:v3 Jul 26, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V3 Related to compatibility with V3 spec
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Use config to select CodecPipeline and Codec implementations
4 participants