Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow Support #8329

Open
wiredfool opened this issue Aug 25, 2024 · 11 comments
Open

Arrow Support #8329

wiredfool opened this issue Aug 25, 2024 · 11 comments

Comments

@wiredfool
Copy link
Member

wiredfool commented Aug 25, 2024

Following on to some of the discussion in #1888, specifically here: #1888 (comment)

Rationale

Arrow is the emerging memory layout for zero copy sharing of data in the new data ecosystem. It is an uncompressed columnar format, specifically designed for interop between different implementations and languages. It can be viewed as the spiritual successor to the existing numpy array interface that we provide. The arrow format is supported by numpy 2, pandas 2, polars, pyarrow, and arro3, and others in the python ecosystem.

What Support means

  • The ability to export an image to an arrow array and read/process that data with no memory copies
  • The ability to read an image in arrow array storage with 0 copies.

Technical Details

(Apache docs are here: https://arrow.apache.org/docs/format/Columnar.html)

An Arrow Schema is a set of metadata, containing type information, and potentially child schemas. An Arrow Array has an (implicitly) associated schema, metadata about the length of the storage, as well as a buffer of a contiguously allocated chunk of memory for the data. The Arrow Array will generally have the same parent/child arrangement as the schema structure.

  • obj.__arrow_c_schema__() must return a PyCapsule with an arrow_schema name and an arrow schema struct.
  • obj.__arrow_c_array__(schema=None) must return a tuple of the schema above and a PyCapsule with an arrow_array name and an arrow array struct. The schema is advisory, caller may request a format.

The lifetime of the Schema and Array structures is dependent on the caller -- so there are release callbacks that must be called when the caller is done with the memory. This complicates the lifetime of our image storage.

We have two cases at the moment:

  1. single channel image
  2. multichannel image

A single channel image can be encoded as a single array of height*width items, using the type of the underlying storage. (e.g., uint8/int32/float32).

A multichannel image can be encoded in a similar manner, using 4*height*width items in the array. The caller would be responsible for knowing that it's 4 elements per pixel. It's also possible to use a parent type of a FixedWidthArray of 4 elements, and a child array of 4*height*width elements. The fixed width arrays are statically defined, so the underlying array is still the same continuous block of memory.

Flat:

<pyarrow.lib.FixedSizeListArray object at 0x106ad4280>
[
    20,
    21,
    67,
    255
    17,
    18,
    62,
    255
...

Nested:

<pyarrow.lib.FixedSizeListArray object at 0x106ad4280>
[
  [
    20,
    21,
    67,
    255
  ],
  [
    17,
    18,
    62,
    255
  ],

An alternate encoding of a multichannel image would be to use a struct of channels, e.g. Struct[r,g,b,a]. This would require 4 child arrays, each allocated in a continuous chunk, as in planar image storage. This is not compatible with our current storage.

While our core storage is generally compatible with this layout, there are three issues:

  1. The block allocator in ImagingAllocateArray packs a number of scanlines in a 16mb block, leaving empty space at the end of the block. This limits the array length to < 1 16mb block. This is not an issue with the single chunk ImagingAllocateBlock, which does the image in one chunk. (note, blocks for the array allocator, arrow arrays fully work with the block allocator. Naming is hard.) This may be possible to work around with the streaming interface.
  2. Some modes have line length padding (BGR;15, BGR;24), and will not work without copying.
  3. Some modes have ignored pixel bands (LA/PA). This is a documentation issue for consumers.

Implementation Notes

The PR #8330 implements Pillow->Arrow for images that don't trip the above caveats.

There are no additional build or runtime dependencies. The arrow structures are designed to be copied into a header and used from there. (licensing is not an issue as those fragments are under an Apache License). There is an additional test dependency on PyArrow at the moment. In theory, numpy 2 could be used for this, but I'm not sure if we'd be testing the legacy array access or arrow access.

The lifetime of the core imaging struct is now separated from the imaging Python Object. There's effectively a refcount implemented for this -- there's an initial 1 for the image->im reference, every arrow array that references an image increments it, and calling ImagingDelete decrements it.

Outstanding Questions

For consumers of data -- what's the most useful format?

  • Flat array arr[(y*(width)+x)*4 + channel]
  • or Fixed Pixel array arr[y*(width)+x][channel]?
  • Would it make sense to embed this into a set of FixedArrays that are a line length, arr[y][x][channel]?
@Yay295
Copy link
Contributor

Yay295 commented Aug 25, 2024

The Variable-size Binary View Layout supports multiple data buffers, though it seems like that's designed more for a list of strings, so I'm not sure how it would handle image data.

@wiredfool
Copy link
Member Author

I don't see where a variable length structure would really gain us anything -- We'd have to construct an offset buffer, we'd lose actual types, and we still wouldn't be able to splice multiple allocation blocks together.

@Yay295
Copy link
Contributor

Yay295 commented Aug 25, 2024

Well, like I said, I'm not sure how it would handle image data. I just noticed that that seems to be the only way to provide multiple data buffers. Arrow requiring all data to be in a single contiguous buffer just seems absurd to me.

@Yay295
Copy link
Contributor

Yay295 commented Aug 25, 2024

It looks like PyArrow has a way to handle that: https://arrow.apache.org/docs/python/data.html#tables

Also, it might not be efficient, but there's a way to convert a NumPy array to an Arrow array. Since Pillow already supports NumPy, this might be an easy way to get something working before doing things in C to make it faster.

@wiredfool
Copy link
Member Author

wiredfool commented Aug 25, 2024

@Yay295 I think from a utility point of view, we'd want to be exposing band level values. Binary chunks aren't going to be nearly as useful if they have to be interpreted. There are also some alignment issues that would come from that, at least for large binaries (64 byte boundaries). It also wouldn't solve the core issue of the storage needing to be continuous.

At the moment, the np array calls require a memory copy, e.g. a tobytes call into a buffer that's then shared. The trouble here is that the memory copy is only required for the biggest images, which is kind of the wrong way to go. They'd already work if they were allocated using imaging._new_block().

It looks like what PyArrow is doing with the table is effectively the __arrow_c_stream__ which returns a sequence of arrow arrays, and copies them into a single arrow array for further export. It looks like the stream and array interfaces are effectively interchangeable, so we can implement one or both of them.

@fdintino
Copy link

Would there ever be a future where we might account for chroma subsampling in ImagingMemoryInstance? If so, I imagine we might also use a null arrow_band_format for that?

@wiredfool
Copy link
Member Author

I'd think the best way to accomplish that would be with planar image storage. My understanding of subsampling is that the resolution of one of the channels is effectively 1/2 or 1/4 of the resolution of the other bands. If we did this with planar storage, chroma would just be a uint8 image with 1/4 of the pixels.

Alternately, it could be stored as a null mapping in the validity buffer. (which we're not currently handling, but would probably be appropriate for the 2 and three channel image formats (pa/la/rgb/hsv). For subsampling, we could null out every nth item in a particular channel.

@fdintino
Copy link

fdintino commented Aug 28, 2024

I think the first approach might be complicated a bit for 10- and 12-bit images (or maybe not, besides the fact that it wouldn't be a uint8 image). In case it is at all useful or relevant: libavutil in ffmpeg uses two structs, AVPixFmtDescriptor and AVComponentDescriptor (see pixdesch.h and pixdesc.c), to describe the various pixel storage formats it supports.

@kylebarron
Copy link

For multi-channel images (assuming each channel has the same data type and dimensions) you could represent that as an array with type Fixed Shape Tensor.

@wiredfool
Copy link
Member Author

I've just put in a comment on that in here: apache/arrow#43831 (comment) -- what I don't see in a tensor is how to represent that in the PyCapsule interface, unless it's a nested set of fixed-size list (+w).

@kylebarron
Copy link

what I don't see in a tensor is how to represent that in the PyCapsule interface, unless it's a nested set of fixed-size list (+w).

Yeah that's it. Plus extra extension metadata on the field

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants