Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strategy for chunking arrays #3433

Closed
TomNicholas opened this issue Aug 10, 2022 · 12 comments
Closed

Strategy for chunking arrays #3433

TomNicholas opened this issue Aug 10, 2022 · 12 comments
Labels
new-feature entirely novel capabilities or strategies question not sure it's a bug? questions welcome

Comments

@TomNicholas
Copy link
Contributor

TomNicholas commented Aug 10, 2022

I often want to test some parallel dask code for input with a range of different chunking structures. This seems like a possible candidate for a general chunks strategy that could live in hypothesis.

We actually made a simple strategy to test various chunking structures in both xhistogram here and in rechunker here, but now I'm wondering if there is a general solution I could implement. It would also be really useful to use in xarray's tests.

Dask's rules for how they defines chunk shapes are:

We always specify a chunks argument to tell dask.array how to break up the underlying array into chunks. We can specify chunks in a variety of ways:

  1. A uniform dimension size like 1000, meaning chunks of size 1000 in each dimension

  2. A uniform chunk shape like (1000, 2000, 3000), meaning chunks of size 1000 in the first axis, 2000 in the second axis, and 3000 in the third

  3. Fully explicit sizes of all blocks along all dimensions, like ((1000, 1000, 500), (400, 400), (5, 5, 5, 5, 5))

  4. A dictionary specifying chunk size per dimension like {0: 1000, 1: 2000, 2: 3000}. This is just another way of writing the forms 2 and 3 above

  1. Xarray adds another way to specify chunks, which is the same as (4) but with string dimension names instead of integer axis names. We distinguish that representation by referring to it as .chunksizes instead of just .chunks.

I think the most generally useful form of chunk strategy should produce (3), which can then be wrapped up by the user to make (4) or (5) if desired.

My attempt so far looks like this

from typing import Tuple

import hypothesis.strategies as st


@st.composite
def chunks(
    draw: st.DrawFn,
    shape: Tuple[int, ...],
    axes: Tuple[int, ...] = None,
    min_chunk_length: int = 1,
    max_chunk_length: int = None,
) -> st.SearchStrategy[Tuple[Tuple[int, ...], ...]]:
    """
    Generate different chunking patterns for an N-D array of data.

    Returns chunking structure as a tuple of tuples, with each inner tuple containing
    the block lengths along one dimension of the array.

    You can limit chunking to specific axes using the `axes` kwarg.
    """

    if min_chunk_length < 1:
        raise ValueError()

    if axes is None:
        axes = tuple(range(len(shape)))

    chunks = []
    for axis, n in enumerate(shape):

        if max_chunk_length:
            _max_chunk_length = max(max_chunk_length, n)
        else:
            _max_chunk_length = n

        if axes is not None and axis in axes:
            block_lengths = draw(st.integers(min_value=min_chunk_length, max_value=_max_chunk_length))
        else:
            # don't chunk along this dimension
            block_lengths = n

        chunks.append(block_lengths)

    return tuple(chunks)

However this isn't actually quite what I want: it returns (2) rather than (3). To return (3) I need to work out how to make a strategy to create a list of integers whose value sums to a given number, e.g. chunks=((3, 3, 2)) for a 1D array of length 8. Any pointers on the best way to do that would be appreciated!

Happy to submit a PR if this is of interest!

@TomNicholas
Copy link
Contributor Author

cc @jacobtomlinson because I just saw dask/dask#9262. Maybe this should live in dask instead?

@TomNicholas
Copy link
Contributor Author

I need to work out how to make a strategy to create a list of integers whose value sums to a given number

Would it be really inefficient to do this by (for each axis) first generating the length of the first block, then generating the length of the next, and so on? I have no intuition for whether or not that will create a strategy that simplifies effectively either.

@honno
Copy link
Member

honno commented Aug 11, 2022

To return (3) I need to work out how to make a strategy to create a list of integers whose value sums to a given number, e.g. chunks=((3, 3, 2)) for a 1D array of length 8.

My initial, quite rudimentary solution was to work on the basis of generating the size of what you want to chunk (e.g. if axes=None, presumably just the array size), and creating a sometimes-uniform list of chunk-sizes (uniformity depends if things neatly divide).

    n_chunks = draw(st.integers(1, size))
    base_chunksize = size // n_chunks
    chunks = [base_chunksize for _ in range(n_chunks)]
    remainder = size % n_chunks
    if remainder != 0:
        if n_chunks == 1:
            i = 0
        else:
            i = draw(st.integers(0, n_chunks - 1))
        chunks[i] = chunks[i] + remainder

Then you can sometimes "stagger" the chunk-sizes with draw(st.booleans()), for example

    if draw(st.booleans()):
        # Note the following isn't particularly exhaustive, or would shrink well,
        # just might be a good starting point. 
        take_i = draw(st.integers(0, n_chunks - 1))
        give_i = draw(st.integers(0, n_chunks - 1).filter(lambda n: n != take_i))
        assume(chunks[take_i] > 1)
        take_amount = draw(st.integers(1, chunks[take_i] - 1))
        chunks[take_i] -= take_amount
        chunks[give_i] += take_amount

And then once you see the final chunks list, you can return it in different ways

    if all(chunksize == chunks[0] for chunksize in chunks) and draw(st.booleans()):
        return chunks[0]  # 1.
    else:
        if draw(st.booleans()):
            return tuple(chunks)  # 2.
        else:
            if draw(st.booleans()):
                return {i: chunksize for i, chunksize in enumerate(chunks)}  # 3.
            else:
                return {str(i): chunksize for i, chunksize in enumerate(chunks)}  # 4.

Anywho this was a bit of a rush job, happy to formalise/test things a bit more—just wanted to see if that's what you're going for, or if Zac is going to fly by with a better solution 😅

@jacobtomlinson
Copy link

This looks great! If this is super dask specific then maybe it should live in dask. But if a chunking strategy would be broadly useful then it could live here too.

We are also discussing in dask/distributed#6806 about making public testing utilities more explicitly namespaced in dask for downstream projects to use. Right now we have various utilities in different packages, but this would be a good candidate for a public utility if dask is the right place for this to live. cc @hendrikmakait

@TomNicholas
Copy link
Contributor Author

Thanks @honno! That's an interesting way of solving it.

I actually did manage to make my own version last night, which basically uses a while loop to keeping drawing new chunks along one axis until we reach the end of the array. This approach was inspired by the "seed and diff" idea mentioned here, so it doesn't need to call assume.

Unsure whose approach is more likely to generate faster / simplify better though?

@st.composite
def block_lengths(
    draw: st.DrawFn,
    ax_length: int,
    min_chunk_length: int = 1,
    max_chunk_length: Optional[int] = None,
) -> st.SearchStrategy[Tuple[int, ...]]:
    """Generate different chunking patterns along one dimension of an array."""

    chunks = []
    remaining_length = ax_length
    while remaining_length > 0:
        _max_chunk_length = (
            min(remaining_length, max_chunk_length)
            if max_chunk_length
            else remaining_length
        )

        if min_chunk_length > _max_chunk_length:
            # if we are at the end of the array we have no choice but to use a smaller chunk
            chunk = remaining_length
        else:
            chunk = draw(
                st.integers(min_value=min_chunk_length, max_value=_max_chunk_length)
            )

        chunks.append(chunk)
        remaining_length = remaining_length - chunk

    return tuple(chunks)


@st.composite
def chunks(
    draw: st.DrawFn,
    shape: Tuple[int, ...],
    axes: Optional[Union[int, Tuple[int, ...]]] = None,
    min_chunk_length: int = 1,
    max_chunk_length: Optional[int] = None,
) -> st.SearchStrategy[Tuple[Tuple[int, ...], ...]]:
    """
    Generates different chunking patterns for an N-D array with a given shape.

    Returns chunking structure as a tuple of tuples of ints, with each inner tuple containing
    the block lengths along one dimension of the array.

    You can limit chunking to specific axes using the `axes` kwarg, and specify minimum and
    maximum block lengths.

    Requires the hypothesis package to be installed.

    Parameters
    ----------
    shape : tuple of ints
        Shape of the array for which you want to generate a chunking pattern.
    axes : None or int or tuple of ints, optional
        ...
    min_chunk_length : int, default is 1
        Minimum chunk length to use along all axes.
    max_chunk_length: int, optional
        Maximum chunk length to use along all axes.
        Default is that the chunk can be as long as the length of the array along that axis.

    Examples
    --------
    Chunking along all axes by default

    >>> chunks(shape=(2, 3)).example()
    ((1, 1), (1, 2))

    Chunking only along the second axis

    >>> chunks(shape=(2, 3), axis=1).example()
    ((2,), (1, 1, 1))

    Minimum size chunks of length 2 along all axes

    >>> chunks(shape=(2, 3), min_chunk_length=2).example()
    ((2,), (2, 1))

    Smallest possible chunks along all axes

    >>> chunks(shape=(2, 3), max_chunk_length=1).example()
    ((1, 1), (1, 1, 1))

    Maximum size chunks along all axes

    >>> chunks(shape=(2, 3), axes=()).example()
    ((2,), (3,))
    """

    if min_chunk_length < 1 or not isinstance(min_chunk_length, int):
        raise ValueError("min_chunk_length must be an integer >= 1")

    if max_chunk_length:
        if max_chunk_length < 1 or not isinstance(min_chunk_length, int):
            raise ValueError("max_chunk_length must be an integer >= 1")

    if axes is None:
        axes = tuple(range(len(shape)))
    elif isinstance(axes, int):
        axes = (axes,)

    chunks = []
    for axis, ax_length in enumerate(shape):

        _max_chunk_length = (
            min(max_chunk_length, ax_length) if max_chunk_length else ax_length
        )

        if axes is not None and axis in axes:
            block_lengths_along_ax = draw(
                block_lengths(
                    ax_length,
                    min_chunk_length=min_chunk_length,
                    max_chunk_length=_max_chunk_length,
                )
            )
        else:
            # don't chunk along this dimension
            block_lengths_along_ax = (ax_length,)

        chunks.append(block_lengths_along_ax)

    return tuple(chunks)

@honno
Copy link
Member

honno commented Aug 11, 2022

I actually did manage to make my own version last night, which basically uses a while loop to keeping drawing new chunks along one axis until we reach the end of the array. This approach was inspired by the "seed and diff" idea mentioned here, so it doesn't need to call assume.

Ayup my use of assume was laziness heh (in my example, ideally you'd check the current chunksize first before taking from it). No glaring problems on your solution—it seems pretty thorough!

As it's drawing so much it might be probematic on larger arrays, but really that's only something to deal with if you start to notice it. In that vein, the API of specifying chunksize bounds might become problematic for larger arrays too, where instead something like specifying the number-of-chunks bounds could become preferable (a concern if you were to make this public API).

@Zac-HD
Copy link
Member

Zac-HD commented Aug 11, 2022

image

OK, more seriously:

  • These strategies look pretty good to me, but shrinking performance (i.e. number of attempted shrinks and quality of final result) is going to be a challenges, just because there are a lot of draws which depend on another shape which we're trying to shrink independently.
    • There are mitigations available via private Hypothesis internals, but it's a fundamental issue.
  • I'm pretty nervous about exposing strategies via Dask and Xarray, just because those packages have relatively slow update cycles and the strategy APIs you're talking about are far from mature.
    • How about a owned-by-maintainers "third party" extension, hypothesis-dask or hypothesis-xarray?
    • I'd also be very happy to adopt this as hypothesis.extra.___ once we work out what it should include
  • For choosing chunk sizes, I'd think separately about testing different representations and different chunk schemes separately - the first should only ever be relevant briefly before you transform to the fully-explicit representation, right?
    • Generate the number of chunks and the offset-from-uniform simultaneously, with lists(st.integers(-3, 3), ...) - then create a chunk per element (plus one) and adjust the size by the amount given. Do that for each dimension, and you're done.
    • Might want to make no-staggering more likely (just(0)|), and use swarm-testing tricks to mess with the distribution some more, but these are after-it-works considerations

@Zac-HD Zac-HD added question not sure it's a bug? questions welcome new-feature entirely novel capabilities or strategies labels Aug 11, 2022
@TomNicholas
Copy link
Contributor Author

Thanks for your input @Zac-HD !

I'm pretty nervous about exposing strategies via Dask and Xarray, just because those packages have relatively slow update cycles and the strategy APIs you're talking about are far from mature.

I think I was being overly-keen to contribute to hypothesis itself with my original suggestion, and there is no reason why these kinds of strategies can't just live in dask or xarray. (xref dask/dask#9374 and pydata/xarray#6908)

For choosing chunk sizes, I'd think separately about testing different representations and different chunk schemes separately - the first should only ever be relevant briefly before you transform to the fully-explicit representation, right?

I don't think I personally am that interested in testing different representations of chunking - here I've already chosen the most explicit representation in order to find bugs in computational code which acts on arrays that have been correctly chunked using that explicit representation.

  • Generate the number of chunks and the offset-from-uniform simultaneously, with lists(st.integers(-3, 3), ...) - then create a chunk per element (plus one) and adjust the size by the amount given. Do that for each dimension, and you're done.

That's an interesting idea... The purpose being that nesting builtin strategies directly should hopefully give better shrinking performance?

@Zac-HD
Copy link
Member

Zac-HD commented Aug 11, 2022

  • Generate the number of chunks and the offset-from-uniform simultaneously, with lists(st.integers(-3, 3), ...) - then create a chunk per element (plus one) and adjust the size by the amount given. Do that for each dimension, and you're done.

That's an interesting idea... The purpose being that nesting builtin strategies directly should hopefully give better shrinking performance?

I'd think of this as "minimizing the number of dependencies between draws"; our guide to writing strategies with good shrinking behaviour calls it 'keeping things local'.

@jacobtomlinson
Copy link

jacobtomlinson commented Aug 12, 2022

I'm pretty nervous about exposing strategies via Dask and Xarray, just because those packages have relatively slow update cycles and the strategy APIs you're talking about are far from mature.

Dask releases bi-weekly, is that too slow?

How about a owned-by-maintainers "third party" extension

I'm apprehensive to introduce more packages, dependencies, etc to maintain. But this could be a good fallback if bi-weekly is too slow.

@Zac-HD
Copy link
Member

Zac-HD commented Aug 12, 2022

I think I just underestimated the enthusiasm and expertise that y'all bring to bear, so shipping strategies directly as part of Dask or Xarray seems fine 👍

@Zac-HD
Copy link
Member

Zac-HD commented Aug 17, 2022

I think I was being overly-keen to contribute to hypothesis itself with my original suggestion, and there is no reason why these kinds of strategies can't just live in dask or xarray. (xref dask/dask#9374 and pydata/xarray#6908)

Since these projects seem to be going well, I'll close this issue as "no action required in Hypothesis". Thread remains open for discussion if that's useful though!

@Zac-HD Zac-HD closed this as completed Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-feature entirely novel capabilities or strategies question not sure it's a bug? questions welcome
Projects
None yet
Development

No branches or pull requests

4 participants