Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] PyCapsule interface for Image/Raster Data #43831

Open
wiredfool opened this issue Aug 26, 2024 · 4 comments
Open

[Python] PyCapsule interface for Image/Raster Data #43831

wiredfool opened this issue Aug 26, 2024 · 4 comments
Labels
Component: Python Type: usage Issue is a user question

Comments

@wiredfool
Copy link

Describe the usage question you have. Please include as many useful details as possible.

I'm implementing support for the Arrow PyCapsule Protocol in Pillow, as referenced here: python-pillow/Pillow#8329, implementation here: python-pillow/Pillow#8330

There are a couple of implementation questions that arise from it:

Internally, we store images as a binary chunk, in full raster lines up to 16MB. Above that, the images overflow to the next chunk. There's a variable amount of dead space between the end of the last scan line up to the 16mb point. So for the simple, small image case, we can just point at this memory as the array buffer.

Is an __arrow_c_stream__ the best way to implement what would effectively be chunked arrays? Is there a way in the protocol to fall back from the __arrow_c_array__ to the stream on err/null? For our purposes, a stream is likely as lightweight to provide as an array.

Is there a preferred array representation of Image raster data? There are a few possible, but I'd like to provide something that looks vaguely like a standard. FWIW, at the moment, the numpy array interface does return a shaped array, so the dimensions of the image are available.

  • Flat array arr[(y*(width)+x)*4 + channel]
  • or Fixed Pixel array arr[y*(width)+x][channel]?
  • Would it make sense to embed this into a set of FixedArrays that are a line length, arr[y][x][channel]?

Component(s)

Python

@wiredfool wiredfool added the Type: usage Issue is a user question label Aug 26, 2024
@rok
Copy link
Member

rok commented Aug 26, 2024

Is there a preferred array representation of Image raster data? There are a few possible, but I'd like to provide something that looks vaguely like a standard. FWIW, at the moment, the numpy array interface does return a shaped array, so the dimensions of the image are available.

  • Flat array arr[(y*(width)+x)*4 + channel]
  • or Fixed Pixel array arr[y*(width)+x][channel]?
  • Would it make sense to embed this into a set of FixedArrays that are a line length, arr[y][x][channel]?

FixedShapeTensorArray would probably fit your usecase best, it comes with shape and zero copy to/from numpy conversions.

Edit: are you looking to represent a single image or a series of images? If it's a single one perhaps Tensor would make more sense. I'm not sure how it would handle chunks.

@wiredfool
Copy link
Author

We'd definitely be starting with single images, though at some point we might look at the animated images as a series of images. They're not typically kept in memory at one time, so lightweight sharing is more difficult.

@wiredfool
Copy link
Author

Had a quick look at the pyarrow tensor -- I'm not clear what it actually is, in terms of an arrow schema or array. It doesn't seem to support the PyCapsule interface anyway:

ten = pa.Tensor.from_numpy(x, dim_names=["dim1","dim2"])
>>> ten.__arrow_c_schema__()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'pyarrow.lib.Tensor' object has no attribute '__arrow_c_schema__'
>>> arr, schema = ten.__arrow_c_array__()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'pyarrow.lib.Tensor' object has no attribute '__arrow_c_array__'

How could I create something that's compatible with a tensor through the __arrow_c_array__ interface?

(Note, I didn't mention earlier -- but we're aiming for native support, not having PyArrow being a dependency.)

@kylebarron
Copy link
Contributor

I haven't used the pyarrow.Tensor API myself. I'd guess you need to create a length-1 array out of that, and then it'll have the __arrow_c_array__ interface.

In terms of native support: the FixedShapeTensor extension is defined here. It uses nested FixedSizeLists plus extension metadata on the associated field that describes it as a FixedShapeTensor array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Python Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

5 participants
@wiredfool @rok @kylebarron and others