Arrow Support #8329

wiredfool · 2024-08-25T13:58:41Z

Following on to some of the discussion in #1888, specifically here: #1888 (comment)

Rationale

Arrow is the emerging memory layout for zero copy sharing of data in the new data ecosystem. It is an uncompressed columnar format, specifically designed for interop between different implementations and languages. It can be viewed as the spiritual successor to the existing numpy array interface that we provide. The arrow format is supported by numpy 2, pandas 2, polars, pyarrow, and arro3, and others in the python ecosystem.

What Support means

The ability to export an image to an arrow array and read/process that data with no memory copies
The ability to read an image in arrow array storage with 0 copies.

Technical Details

(Apache docs are here: https://arrow.apache.org/docs/format/Columnar.html)

An Arrow Schema is a set of metadata, containing type information, and potentially child schemas. An Arrow Array has an (implicitly) associated schema, metadata about the length of the storage, as well as a buffer of a contiguously allocated chunk of memory for the data. The Arrow Array will generally have the same parent/child arrangement as the schema structure.

obj.__arrow_c_schema__() must return a PyCapsule with an arrow_schema name and an arrow schema struct.
obj.__arrow_c_array__(schema=None) must return a tuple of the schema above and a PyCapsule with an arrow_array name and an arrow array struct. The schema is advisory, caller may request a format.

The lifetime of the Schema and Array structures is dependent on the caller -- so there are release callbacks that must be called when the caller is done with the memory. This complicates the lifetime of our image storage.

We have two cases at the moment:

single channel image
multichannel image

A single channel image can be encoded as a single array of height*width items, using the type of the underlying storage. (e.g., uint8/int32/float32).

A multichannel image can be encoded in a similar manner, using 4*height*width items in the array. The caller would be responsible for knowing that it's 4 elements per pixel. It's also possible to use a parent type of a FixedWidthArray of 4 elements, and a child array of 4*height*width elements. The fixed width arrays are statically defined, so the underlying array is still the same continuous block of memory.

Flat:

<pyarrow.lib.FixedSizeListArray object at 0x106ad4280>
[
    20,
    21,
    67,
    255
    17,
    18,
    62,
    255
...

Nested:

<pyarrow.lib.FixedSizeListArray object at 0x106ad4280>
[
  [
    20,
    21,
    67,
    255
  ],
  [
    17,
    18,
    62,
    255
  ],

An alternate encoding of a multichannel image would be to use a struct of channels, e.g. Struct[r,g,b,a]. This would require 4 child arrays, each allocated in a continuous chunk, as in planar image storage. This is not compatible with our current storage.

While our core storage is generally compatible with this layout, there are three issues:

The block allocator in ImagingAllocateArray packs a number of scanlines in a 16mb block, leaving empty space at the end of the block. This limits the array length to < 1 16mb block. This is not an issue with the single chunk ImagingAllocateBlock, which does the image in one chunk. (note, blocks for the array allocator, arrow arrays fully work with the block allocator. Naming is hard.) This may be possible to work around with the streaming interface.
Some modes have line length padding (BGR;15, BGR;24), and will not work without copying.
Some modes have ignored pixel bands (LA/PA). This is a documentation issue for consumers.

Implementation Notes

The PR #8330 implements Pillow->Arrow for images that don't trip the above caveats.

There are no additional build or runtime dependencies. The arrow structures are designed to be copied into a header and used from there. (licensing is not an issue as those fragments are under an Apache License). There is an additional test dependency on PyArrow at the moment. In theory, numpy 2 could be used for this, but I'm not sure if we'd be testing the legacy array access or arrow access.

The lifetime of the core imaging struct is now separated from the imaging Python Object. There's effectively a refcount implemented for this -- there's an initial 1 for the image->im reference, every arrow array that references an image increments it, and calling ImagingDelete decrements it.

Outstanding Questions

For consumers of data -- what's the most useful format?

Flat array arr[(y*(width)+x)*4 + channel]
or Fixed Pixel array arr[y*(width)+x][channel]?
Would it make sense to embed this into a set of FixedArrays that are a line length, arr[y][x][channel]?

The text was updated successfully, but these errors were encountered:

Yay295 · 2024-08-25T17:52:24Z

The Variable-size Binary View Layout supports multiple data buffers, though it seems like that's designed more for a list of strings, so I'm not sure how it would handle image data.

wiredfool · 2024-08-25T18:49:52Z

I don't see where a variable length structure would really gain us anything -- We'd have to construct an offset buffer, we'd lose actual types, and we still wouldn't be able to splice multiple allocation blocks together.

Yay295 · 2024-08-25T19:27:15Z

Well, like I said, I'm not sure how it would handle image data. I just noticed that that seems to be the only way to provide multiple data buffers. Arrow requiring all data to be in a single contiguous buffer just seems absurd to me.

Yay295 · 2024-08-25T19:42:33Z

It looks like PyArrow has a way to handle that: https://arrow.apache.org/docs/python/data.html#tables

Also, it might not be efficient, but there's a way to convert a NumPy array to an Arrow array. Since Pillow already supports NumPy, this might be an easy way to get something working before doing things in C to make it faster.

wiredfool · 2024-08-25T21:47:38Z

@Yay295 I think from a utility point of view, we'd want to be exposing band level values. Binary chunks aren't going to be nearly as useful if they have to be interpreted. There are also some alignment issues that would come from that, at least for large binaries (64 byte boundaries). It also wouldn't solve the core issue of the storage needing to be continuous.

At the moment, the np array calls require a memory copy, e.g. a tobytes call into a buffer that's then shared. The trouble here is that the memory copy is only required for the biggest images, which is kind of the wrong way to go. They'd already work if they were allocated using imaging._new_block().

It looks like what PyArrow is doing with the table is effectively the __arrow_c_stream__ which returns a sequence of arrow arrays, and copies them into a single arrow array for further export. It looks like the stream and array interfaces are effectively interchangeable, so we can implement one or both of them.

fdintino · 2024-08-28T16:35:33Z

Would there ever be a future where we might account for chroma subsampling in ImagingMemoryInstance? If so, I imagine we might also use a null arrow_band_format for that?

wiredfool · 2024-08-28T16:45:36Z

I'd think the best way to accomplish that would be with planar image storage. My understanding of subsampling is that the resolution of one of the channels is effectively 1/2 or 1/4 of the resolution of the other bands. If we did this with planar storage, chroma would just be a uint8 image with 1/4 of the pixels.

Alternately, it could be stored as a null mapping in the validity buffer. (which we're not currently handling, but would probably be appropriate for the 2 and three channel image formats (pa/la/rgb/hsv). For subsampling, we could null out every nth item in a particular channel.

fdintino · 2024-08-28T20:21:28Z

I think the first approach might be complicated a bit for 10- and 12-bit images (or maybe not, besides the fact that it wouldn't be a uint8 image). In case it is at all useful or relevant: libavutil in ffmpeg uses two structs, AVPixFmtDescriptor and AVComponentDescriptor (see pixdesch.h and pixdesc.c), to describe the various pixel storage formats it supports.

kylebarron · 2024-08-29T16:56:14Z

For multi-channel images (assuming each channel has the same data type and dimensions) you could represent that as an array with type Fixed Shape Tensor.

wiredfool · 2024-08-29T17:02:38Z

I've just put in a comment on that in here: apache/arrow#43831 (comment) -- what I don't see in a tensor is how to represent that in the PyCapsule interface, unless it's a nested set of fixed-size list (+w).

kylebarron · 2024-08-30T19:42:49Z

what I don't see in a tensor is how to represent that in the PyCapsule interface, unless it's a nested set of fixed-size list (+w).

Yeah that's it. Plus extra extension metadata on the field

wiredfool mentioned this issue Aug 25, 2024

Image -> Arrow support #8330

Open

radarhere added the Enhancement label Aug 26, 2024

wiredfool mentioned this issue Aug 26, 2024

[Python] PyCapsule interface for Image/Raster Data apache/arrow#43831

Open

kylebarron mentioned this issue Aug 30, 2024

[Python] Promote usage of the Arrow PyCapsule Protocol (for the C Data Inteface) apache/arrow#39195

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow Support #8329

Arrow Support #8329

wiredfool commented Aug 25, 2024 •

edited by hugovk

Loading

Yay295 commented Aug 25, 2024

wiredfool commented Aug 25, 2024

Yay295 commented Aug 25, 2024

Yay295 commented Aug 25, 2024

wiredfool commented Aug 25, 2024 •

edited

Loading

fdintino commented Aug 28, 2024

wiredfool commented Aug 28, 2024

fdintino commented Aug 28, 2024 •

edited by wiredfool

Loading

kylebarron commented Aug 29, 2024

wiredfool commented Aug 29, 2024

kylebarron commented Aug 30, 2024

Arrow Support #8329

Arrow Support #8329

Comments

wiredfool commented Aug 25, 2024 • edited by hugovk Loading

Rationale

What Support means

Technical Details

Implementation Notes

Outstanding Questions

Yay295 commented Aug 25, 2024

wiredfool commented Aug 25, 2024

Yay295 commented Aug 25, 2024

Yay295 commented Aug 25, 2024

wiredfool commented Aug 25, 2024 • edited Loading

fdintino commented Aug 28, 2024

wiredfool commented Aug 28, 2024

fdintino commented Aug 28, 2024 • edited by wiredfool Loading

kylebarron commented Aug 29, 2024

wiredfool commented Aug 29, 2024

kylebarron commented Aug 30, 2024

wiredfool commented Aug 25, 2024 •

edited by hugovk

Loading

wiredfool commented Aug 25, 2024 •

edited

Loading

fdintino commented Aug 28, 2024 •

edited by wiredfool

Loading