Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add array storage helpers #2065

Open
wants to merge 13 commits into
base: v3
Choose a base branch
from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Aug 3, 2024

This PR adds nchunks, nbytes, and nchunks_initialized functionality from 2.x.

closes #2027
depends on #2064

details

Adds the following to array.py:

  • (AsyncArray / Array).nchunks : deprecated, the total number of chunks in the array. exists for 2.xx compatibility.
  • (AsyncArray / Array).cdata_shape : deprecated, the shape of the chunk grid. exists for 2.xx compatibility.
  • (AsyncArray / Array).nbytes : the total number of bytes that the array can store
  • (AsyncArray / Array)._iter_chunk_coords : an iterator over tuples of ints which represent positions in the chunk grid
  • (AsyncArray / Array)._iter_chunk_regions : an iterator over slices which represent the contiguous array region spanned by each chunk
  • (AsyncArray / Array)._iter_chunk_keys : an iterator over strings which represent the paths in storage for all the chunks
  • chunks_initialized(array): a function that takes an array and returns a tuple of the chunk keys for that array that exist in storage. this also has tests.
  • nchunks_initialized(array): deprecated, a function that calls len(chunks_initialized(array)). this exists for 2.xx compatibility.

All of the above _iter_chunk_* methods should be considered private and provisional. I added them because their functionality is valuable, but eventually I think we will have a better array API that renders these methods obsolete. If we think these are cluttering the array API, I'd be happy splitting them off into stand-alone functions.

  • adds a function iter_grid to indexing.py, this just provides lexicographic iteration over the elements of a bounded N-dimensional, positive grid (e.g., a grid of chunks).

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@d-v-b d-v-b requested review from jhamman and normanrz August 3, 2024 14:56
@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 3, 2024

@tomwhite let me know if this looks workable for you

@tomwhite
Copy link
Contributor

tomwhite commented Aug 5, 2024

Thanks @d-v-b this looks great!

I wondered why you deprecated nchunks (and nchunks_initialized) though? The number of chunks in an array is something that should always be well-defined. Also, deprecating something usually means there's a better alternative, but I don't see one here.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 5, 2024

I wondered why you deprecated nchunks (and nchunks_initialized) though? The number of chunks in an array is something that should always be well-defined. Also, deprecating something usually means there's a better alternative, but I don't see one here.

my thinking for this is twofold:

  • with the new chunks_initialized function that gives the names of the initialized chunks, one can easily do len(chunks_initialized(...)), i.e. we don't need a separate function to express the composition of chunks_initialized and len. similarly, nchunks is merely len(array._iter_chunk_keys). If this logic is unsound, or these deprecation warnings are a problem, then we can remove them, but see the second point:
  • we haven't yet figured out how we are going to express sharded arrays in the top-level array API, and I think those decisions might require rethinking how we express chunking more broadly. see conversations happening in this discussion. Until we solve that problem, I don't feel comfortable committing to any "here's what your chunks are like" APIs, especially if they are APIs that developed pre-sharding. hence adding private methods in this PR, and the deprecation warnings.

does this check out? I'm sorry if the warnings are inconvenient, but I really would like to find a proper expression of v3 semantics on the Array class and I worry that a blanket policy of forward-propagating v2-isms could be a hindrance to that effort.

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 5, 2024

The number of chunks in an array is something that should always be well-defined.

to expand on this: v3 introduces two kinds of chunks, read-chunks and write chunks. the number of read chunks may not equal the number of write chunks. so where we had 1 nchunks quantity in v2, v3 has two possible answers to nchunks. that's why it is not straightforward to commit to this aspect of the array API.

"""
for key, value in dict.items():
await self.set(key, value)
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is just a helper for our test suite, right? I think I'd favor moving this to either a utility function or a public API (something like Store.set_many()). Or perhaps our set_partial_values signature needs to evolve to handle this type of thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now it's just being used in the test suite, so i'd be happy splitting it off into a private stand-alone function that takes a store. the dict part is just for convenience; what we ultimately will need for proper batching is just a store method that takes tuple[tuple[key, value], ...], which I don't think we have yet?

@jhamman jhamman added the V3 Related to compatibility with V3 spec label Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V3 Related to compatibility with V3 spec
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

[v3] Missing array attributes: nbytes, nchunks, nchunks_initialized
3 participants