Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codecs without array metadata #1632

Merged
merged 18 commits into from
Feb 7, 2024

Conversation

normanrz
Copy link
Contributor

This PR removes the array_metadata attribute from the codecs. That makes it possible to construct Codec classes that can be reused for multiple arrays. It is kind of a precursor for #1623.

Instead there is now a chunk_metadata argument passed to the encode and decode methods. This chunk metadata can be the same object for all chunks, which is valid for regular chunk grids. When we have variable chunking in the future, each chunk can have its own metadata passed through the codec pipeline.

There are 2 new methods on the Codec interface:

  • validate validates a codec configuration against array metadata. Consequently, it can only be called after the array metadata has been constructed (which also contains a list of codecs). For example, validate can be used to validate that the number of dimensions in the order attribute in the transpose configuration matches the number of dimensions in the array. It cannot modify the codec configuration but raise exceptions.
  • evolve can be used to create a new codec with "fixed" configuration. For example, the blosc codec has a typesize attribute, which can be trivially be determined when knowing the dtype of the array. It would be poor UX to require this to be explicitly set by the user when constructing the codec. The evolve method can perform this modification. It only has 2 arguments for now ndim and data_type. It is called before constructing the array metadata so we can't pass that in. I don't think this is the best interface we can come up with. Ideas welcome.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@normanrz normanrz added the V3 Related to compatibility with V3 spec label Jan 12, 2024
@normanrz normanrz self-assigned this Jan 12, 2024
@@ -154,20 +163,19 @@ class ShardingCodecIndexLocation(Enum):


@frozen
class CoreArrayMetadata:
shape: ChunkCoords
class ChunkMetadata:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about the name for this class:

  • the 'chunk" part. None of the attributes of this class are specific to chunks per se -- if we aliased chunk-shape with shape, data_type to dtype, then we have a generic specification of a NDArray-like that wouldn't look out of place in numpy. So what if we drop the "chunk" references in the name and attributes?
    The old name was CoreArrayMetadata. "core array metadata" feels like a better description than "chunk metadata", but it's a bit long, and I don't think we need to semantically distinguish between "core" arrays and "non-core arrays (i.e., chunks)" in code like this.

  • the "metadata" part: I'm not sure that I would call the shape and dtype of an array "metadata" in the conventional sense -- "property" seems like a better fit? But i'm not sure. And we are already in a module called metadata, so repeating that word in class names might be redundant; what do you think of moving from the <Thing>Metadata pattern to <Thing>, as long as the "metadata-ness" is conveyed by the module name?

A couple of alternative names for this class, off the top of my head:

  • ArraySpec
  • ArrayInfo
  • ArrayProps
  • NDArray (I kind of prefer this one, as long as we namespace it in the right module to make things unambiguous)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find NDArray confusing, because it is not an array but only contains information about an array. ArraySpec or ArrayInfo are fine with me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just go with ArraySpec and if people hate it we can find a better name at that time?

zarr/v3/metadata.py Outdated Show resolved Hide resolved
return CoreArrayMetadata(
shape=self.shape,
def get_chunk_metadata(
self, _chunk_coords: ChunkCoords, runtime_configuration: RuntimeConfiguration
Copy link
Contributor

@d-v-b d-v-b Jan 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither _chunk_coords nor runtime_configuration get used here. I'm guessing that this function takes _chunk_coords as an argument in anticipation of irregular chunking, but what's the use for runtime_configuration? (nvm, the GH ui didn't show the full return value) Maybe a docstring could help here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing that this function takes _chunk_coords as an argument in anticipation of irregular chunking,

Exactly! We can leave it out for now.

zarr/v3/codecs/transpose.py Outdated Show resolved Hide resolved
for x, i in enumerate(self.order):
inverse_order[x] = i
chunk_array = chunk_array.transpose(inverse_order)
return chunk_array

async def encode(
self,
chunk_array: np.ndarray,
self, chunk_array: np.ndarray, chunk_metadata: ChunkMetadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it important that this method operate on chunks, as opposed to arrays (of which chunks are a subset)? If we want these functions to be agnostic to the semantics of chunking, and scoped more broadly to transforming arrays / bytes, then maybe renaming the chunk_array and chunk_metadata parameters to array and array_metadata would make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

transpose is an ArrayArrayCodec that is why the arguments are called chunk_array in contrast to chunk_bytes.

) -> Optional[np.ndarray]:
chunk_array = chunk_array.transpose(self.order)
return chunk_array

def compute_encoded_size(self, input_byte_length: int) -> int:
def compute_encoded_size(self, input_byte_length: int, _chunk_metadata: ChunkMetadata) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on compute_encoded_size vs encoded_size, or get_encoded_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All fine with me. I just copied it straight from the implementation recommendations in the spec.

zarr/v3/array_v2.py Outdated Show resolved Hide resolved
@d-v-b
Copy link
Contributor

d-v-b commented Jan 15, 2024

evolve can be used to create a new codec with "fixed" configuration. For example, the blosc codec has a typesize attribute, which can be trivially be determined when knowing the dtype of the array. It would be poor UX to require this to be explicitly set by the user when constructing the codec. The evolve method can perform this modification. It only has 2 arguments for now ndim and data_type. It is called before constructing the array metadata so we can't pass that in. I don't think this is the best interface we can come up with. Ideas welcome.

Am I right in thinking that evolve is basically a copy, with constraints? Personally I prefer the word "copy" over "evolve", because that makes the semantics a bit more clear -- based on the name alone, it's not intuitive to me whether "evolve" modifies an object in place, or returns a new object. Of course intuitions can change with time 😉, so if lots of people are fine with "evolve" then I can adapt.

Can we be sure that ndim and data_type will be the only free parameters for all the codecs? If not, it might make sense to have a more open-ended API like the one pydantic uses for model_copy, wherein you pass an optional update parameter, which is a dict of {field_name: new_value}. Even if we keep evolve(self, ndim, data_type), it probably makes sense for that function to call a more general purpose copy method, unless we are super confident that ndim and data_type are the only free parameters.

@normanrz
Copy link
Contributor Author

evolve can be used to create a new codec with "fixed" configuration. For example, the blosc codec has a typesize attribute, which can be trivially be determined when knowing the dtype of the array. It would be poor UX to require this to be explicitly set by the user when constructing the codec. The evolve method can perform this modification. It only has 2 arguments for now ndim and data_type. It is called before constructing the array metadata so we can't pass that in. I don't think this is the best interface we can come up with. Ideas welcome.

Am I right in thinking that evolve is basically a copy, with constraints? Personally I prefer the word "copy" over "evolve", because that makes the semantics a bit more clear -- based on the name alone, it's not intuitive to me whether "evolve" modifies an object in place, or returns a new object. Of course intuitions can change with time 😉, so if lots of people are fine with "evolve" then I can adapt.

attrs calls it evolve and dataclasses calls it replace. I think it is a bit more than a copy because it changes the attributes.

Can we be sure that ndim and data_type will be the only free parameters for all the codecs? If not, it might make sense to have a more open-ended API like the one pydantic uses for model_copy, wherein you pass an optional update parameter, which is a dict of {field_name: new_value}. Even if we keep evolve(self, ndim, data_type), it probably makes sense for that function to call a more general purpose copy method, unless we are super confident that ndim and data_type are the only free parameters.

Very likely, there will be additional required arguments in the future. I already declared ndim and data_type as keyword-only arguments. That essentially provides the dict-like semantics that you proposed with typing. Codecs can add a **_kwargs argument to their signature to eat up unknown arguments.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 15, 2024

attrs calls it evolve and dataclasses calls it replace. I think it is a bit more than a copy because it changes the attributes.

ok! I'm fine with evolve.

Very likely, there will be additional required arguments in the future. I already declared ndim and data_type as keyword-only arguments. That essentially provides the dict-like semantics that you proposed with typing. Codecs can add a **_kwargs argument to their signature to eat up unknown arguments.

Another approach that could keep a constant function signature would be to use a single keyword argument, update, and implementations of evolve would define their own TypedDict for that kwarg. But I don't think this is a big deal either way.

One broad question: what do you think about using dtype internally instead of data_type? I know that the v3 spec uses the name data_type, but I think the user base of zarr-python is largely people working in the broader pydata ecosystem where dtype is standard, and so data_type as a kwarg / attriibute when dtype was available might be a source of friction.

@normanrz
Copy link
Contributor Author

Very likely, there will be additional required arguments in the future. I already declared ndim and data_type as keyword-only arguments. That essentially provides the dict-like semantics that you proposed with typing. Codecs can add a **_kwargs argument to their signature to eat up unknown arguments.

Another approach that could keep a constant function signature would be to use a single keyword argument, update, and implementations of evolve would define their own TypedDict for that kwarg. But I don't think this is a big deal either way.

Wouldn't that weaken the type "safety"?

One broad question: what do you think about using dtype internally instead of data_type? I know that the v3 spec uses the name data_type, but I think the user base of zarr-python is largely people working in the broader pydata ecosystem where dtype is standard, and so data_type as a kwarg / attriibute when dtype was available might be a source of friction.

In zarrita, I only used dtype if it actually is an np.dtype. While data_type and dtype are similar, they are not exactly the same. Notably, the dtype also contains information about endianness. In the future, there will likely be data types that are not 1-to-1 translatable to dtypes.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 15, 2024

In zarrita, I only used dtype if it actually is an np.dtype. While data_type and dtype are similar, they are not exactly the same. Notably, the dtype also contains information about endianness. In the future, there will likely be data types that are not 1-to-1 translatable to dtypes.

Does the the byteorder attribute of a np.dtype object not convey the endianness? And more broadly, I'm a little nervous about having two representations (dtype and data_type) of the same thing (same in the conceptual sense), as I can see this becoming a point of confusion. If data_type does end up using extra attributes beyond those present on np.dtype, I think the best way to do this would be by implementing an np.dtype-like class that either inherits from np.dtype or acts like it inherits from np.dtype, and I think we should call this attribute dtype inside the API, even if it's called data_type in the metadata document. In my opinion, there should probably be just 1 place where we convert an attribute called data_type to an attribute called dtype, and that's where in a function that consumes or generates a metadata document.

@normanrz
Copy link
Contributor Author

Does the the byteorder attribute of a np.dtype object not convey the endianness?

It is the other way around: dtype has endianness via byteorder, data_type does not know about endianness.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 15, 2024

Very likely, there will be additional required arguments in the future. I already declared ndim and data_type as keyword-only arguments. That essentially provides the dict-like semantics that you proposed with typing. Codecs can add a **_kwargs argument to their signature to eat up unknown arguments.

Another approach that could keep a constant function signature would be to use a single keyword argument, update, and implementations of evolve would define their own TypedDict for that kwarg. But I don't think this is a big deal either way.

Wouldn't that weaken the type "safety"?

In what way? I had a think about this, and a fundamental issue is that if the evolve function itself takes any keyword arguments (e.g., deep, to specify whether a deep or shallow copy should be used), then we need those keyword arguments to be separate from the names of the attributes we are updating. Putting all the updated attributes in their own dict avoids potential name collisions.

Does the the byteorder attribute of a np.dtype object not convey the endianness?

It is the other way around: dtype has endianness via byteorder, data_type does not know about endianness.

Aha, sorry for misreading you. In that case, If np.dtype is semantically "broader" than data_type, then it seems OK to use np.dtype internally, unless there's some something we need that it can't provide?

) -> Tuple[ArrayBytesCodec, ArraySpec]:
for codec, chunk_metadata in codecs:
if isinstance(codec, ArrayBytesCodec):
return (codec, chunk_metadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanrz - can you comment on whether we have a check elsewhere to make sure we never have two ArrayBytesCodecs in the array metadata object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_metadata: ArraySpec,
runtime_configuration: RuntimeConfiguration,
) -> np.ndarray:
codecs = list(self._codecs_with_resolved_metadata(chunk_metadata))[::-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it would be easier to reason about this if we split the codec pipeline into three groups:

  1. BytesBytesCodecs
  2. BytesArrayCodecs
  3. ArrayArrayCodecs

That would cut down on the number of instance checks later on.

Copy link
Contributor

@d-v-b d-v-b Jan 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, could we do a tagged union by adding a codec_type attribute with type Literal["BytesBytes", "BytesArray", "ArrayArray"]? then we just need 1 class definition and attribute checks instead of isinstance (edit: we probably need 3 classes, because the function signatures of encode /decode are different, but I do think putting a codec_type attribute on those classes gives us discrimination that's a bit cleaner than isinstance checks)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, my proposal is actually not an alternative to @jhamman's idea. i think we could do both of these things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that the proposal to partition the codecs by type actually brings us back to the design of hdf5 / v2, where there is a separate filter pipeline (filters basically being ArrayArray transformations).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it makes anything more readable: b825b4f

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, could we do a tagged union by adding a codec_type attribute with type Literal["BytesBytes", "BytesArray", "ArrayArray"]? then we just need 1 class definition and attribute checks instead of isinstance (edit: we probably need 3 classes, because the function signatures of encode /decode are different, but I do think putting a codec_type attribute on those classes gives us discrimination that's a bit cleaner than isinstance checks)

I think isinstance actually works pretty well here (from developer and mypy pov). Why would an attribute check be better?

Copy link
Contributor

@d-v-b d-v-b Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An attribute check would be better if we wanted to enable developers to use codecs without requiring that they inherit from our base classes (we could also achieve this with structural subtyping). And isinstance(x, ArrayArrayCodec) is teeny bit more work to read / write than x.codec_type == "ArrayArray". That being said, I don't think either of these complaints should block your efforts here, so feel free to ignore

zarr/v3/metadata.py Outdated Show resolved Hide resolved
if chunk_bytes_maybe is None:
return None
chunk_bytes = chunk_bytes_maybe

return chunk_bytes

def compute_encoded_size(self, byte_length: int) -> int:
return reduce(lambda acc, codec: codec.compute_encoded_size(acc), self.codecs, byte_length)
async def encode_partial(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would the consequences be for removing encode, renaming encode_partial to encode, and using using exceptions for codecs that can't do partial encoding (i.e., for these codecs, if selection is anything other than a full selection, raise an exception, which the caller can handle). logically, encode_partial and decode_partial are just more general versions of encode and decode, so it feels safe to rely on this fact in the API design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think encode and encode_partial are different enough to keep them separate. encode_partial has a number of limitations, the main one being that it is currently only works with a single codec. Also, encode_partial takes a reference to the store to load the data itself; encode takes a buffer.

zarr/v3/codecs/__init__.py Outdated Show resolved Hide resolved
@jhamman
Copy link
Member

jhamman commented Feb 6, 2024

@normanrz - a quick rebase and this should be good to go.

@normanrz normanrz force-pushed the codecs-without-array-metadata branch from c503994 to b6a6d41 Compare February 7, 2024 13:55
@normanrz normanrz merged commit 5e85500 into zarr-developers:v3 Feb 7, 2024
7 checks passed
@normanrz normanrz deleted the codecs-without-array-metadata branch February 7, 2024 13:57
@jhamman jhamman added this to the 3.0.0.alpha milestone Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V3 Related to compatibility with V3 spec
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants