Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: use new columnar GetArrowStream if GDAL>=3.6 and pyarrow available #155

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Sep 8, 2022

This is an experiment to use the Arrow Stream API added to GDAL in upcoming 3.6 (RFC 86: Column-oriented read API for vector layers)

For one specific benchmark using the nz-buildings-outline.gpkg file I noticed another 2x speed-up compared to base pyogrio (see slide).

This does not support all features, or behaves a bit differently in some cases. Some observations:

  • skip_features or max_features is not supported (this would need some change on the GDAL side to for example set a batchsize, and then only get one batch)
  • this always reads all columns at the moment, but that should be solveable by explicitly ignoring fields with OGR_L_SetIgnoredFields as mentioned in ENH: avoid I/O when columns don't need to be returned  #153 (something we should do anyway) -> selecting columns is supported
  • this always reads the geometry column (not sure if that is something you can turn off with existing GDAL APIs) -> not reading the geometry column (or reading a layer without geometry) is supported
  • this always returns an "FID" column, although the name doesn't seem to be fixed (my quick observations is that for file formats that don't have this natively, it comes as a "OGC_FID" column, while for geopackage, this reads as "fid", at least for a gpkg file written by geopandas) -> by default this now also doesn't read an FID column (to be consistent with the non-arrow code path), and it follows the return_fids keyword to include it as a column or not
  • the name of the geometry column can vary, and OGR_L_GetGeometryColumn doesn't seem to always seem to know this (maybe some formats don't require a name (or support a custom name), in which case this method also doesn't return anything?). But for now I making "geometry" in all cases anyway to be consistent with the normal read path.
  • This automatically supports all data types (eg also nested data types)

pyogrio/_io.pyx Outdated
Comment on lines 1026 to 1037
IF CTE_GDAL_VERSION >= (3, 6, 0):

if not OGR_L_GetArrowStream(ogr_layer, &stream, NULL):
raise RuntimeError("Failed to open ArrowArrayStream from Layer")

ELSE:
raise RuntimeError("Need GDAL>=3.6 for Arrow support")

stream_ptr = <uintptr_t> &stream

import pyarrow as pa
table = pa.RecordBatchStreamReader._import_from_c(stream_ptr).read_all()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is actually the main piece that is different from the existing ogr_read function. So it might be cleaner to integrate it into that one (to avoid too much duplication of code) by passing through the use_arrow keyword to ogr_read.
On the other hand, that makes the return value a bit messy (unless we would change this in a nested tuples of length 2 ((meta, data) where data can be (fid_data, geometries, field_data) in the standard case and table in the arrow case). And the arrow code path is for now not handling all keywords, so that might also be cleaner as a separate function.

Copy link
Member

@brendan-ward brendan-ward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @jorisvandenbossche ! The potential of a 2x speedup is exciting!

A bit of code duplication is OK for now, esp. if the return signature from Cython function is different and the supported parameters are a bit different.

Question: if we detect GDAL >= 3.6 and pyarrow, and the parameters are compatible, would we want to automatically opt-in to using arrow for I/O with GDAL to / from GeoDataFrames, instead of only based only on a keyword? Though this would be added complexity around code paths based on parameters...

Perhaps generalizing use_arrow to something that allows "auto", "numpy", "arrow" values, where "auto" (default) is based on what is available, falling back to "numpy" if arrow path is not available.

Do we also want to consider adding these as raw I/O functions as well, e.g., read_arrow / write_arrow

this always reads the geometry column

We should probably investigate this more. Some folks use pyogrio to get at non-geometry tables in ESRI FileGeodatabases, among others. Would be nice to use arrow to get those, if possible.

You can use

read_dataframe('zip://pyogrio/tests/fixtures/test_fgdb.gdb.zip', layer='basetable')

for testing tabular-only data

pyogrio/_io.pyx Outdated Show resolved Hide resolved
pyogrio/arrow_bridge.h Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member Author

estion: if we detect GDAL >= 3.6 and pyarrow, and the parameters are compatible, would we want to automatically opt-in to using arrow for I/O with GDAL to / from GeoDataFrames, instead of only based only on a keyword?

Eventually we should indeed do that, I think, but maybe not to start with?

Generalizing the keyword seems good for that. But maybe we then need another name?

Do we also want to consider adding these as raw I/O functions as well, e.g., read_arrow / write_arrow

Good idea, I think adding a read_arrow would be a good idea (write_arrow is currently not yet in scope, since the GDAL interface only provides arrow-based reading)

this always reads the geometry column

We should probably investigate this more. Some folks use pyogrio to get at non-geometry tables in ESRI FileGeodatabases, among others. Would be nice to use arrow to get those, if possible.

Yes, so that actually works (just tested, and it reads the table nicely without geometry column). The main issue for this case is how we detect reliably that there is no geometry column.

But the issue is more that if your layer has a geometry, then it is always included in the resulting arrow table. Which means we might not be able to support the ignore_geometry=True option.

@jorisvandenbossche
Copy link
Member Author

Which means we might not be able to support the ignore_geometry=True option.

Ah, it seems that OGR_L_SetIgnoredFields can also be used to ignore the geometry field using a special value (https://gdal.org/api/vector_c_api.html#_CPPv422OGR_L_SetIgnoredFields9OGRLayerHPPKc). Will give that a test

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Sep 22, 2022

Something hadn't thought about before is the lifetime of the GDAL Dataset/Layer, as long as pyarrow still has data that is backed by buffers owned by GDAL.

Quote from the GDAL docs for OGR_L_GetArrowStream about this:

There are extra precautions to take into account in a OGR context. Unless otherwise specified by a particular driver implementation, the ArrowArrayStream structure, and the ArrowSchema or ArrowArray objects its callbacks have returned, should no longer be used (except for potentially being released) after the OGRLayer from which it was initialized has been destroyed (typically at dataset closing).

So according to that, we should only destroy the GDAL dataset after we have fully consumed the stream (the data can still be owned by GDAL though, we are still allowed to call the release callback).

Simplified, this PR currently implements the following:

def read(path, layer):
    cdef OGRDataSourceH ogr_dataset = NULL
    cdef OGRLayerH ogr_layer = NULL
    cdef ArrowArrayStream stream

    ogr_dataset = GDALOpenEx(path)
    try:
        ogr_layer = GDALDatasetGetLayer(ogr_dataset, layer)
        OGR_L_GetArrowStream(ogr_layer, &stream, options)
        table = pa.RecordBatchReader._import_from_c(<uintptr_t> &stream).read_all()
    finally:
        GDALClose(ogr_dataset)

    return table

So because I am currently creating a RecordBatchReader ànd actually consuming the iterator by calling read_all() (which will call get_next() on the ArrowArrayStream until the end), I assume we are OK here?

Returning a materialized Table (from read_all()) is fine for our current use case, but I can imagine that in the future, we might want to expose an iterative RecordBatchReader as well (eg as batch-wise input to query engine).
When we want to do that, I assume that we somehow need to keep the GDAL Dataset alive (putting it in a Python object (wrapping in a small class, or putting in a PyCapsule with destructor), and keeping a reference to that object from the RecordBatchReader). But RecordBatchReader._import_from_c doesn't currently have a way to do that?
(cc @pitrou)

@paleolimbot
Copy link

I can imagine that in the future, we might want to expose an iterative RecordBatchReader as well (eg as batch-wise input to query engine).

This is fresh off the presses so no guarantees that it's a sound design, but when drafting r-spatial/sf#2036 I made a little wrapper stream that keeps the dataset alive until release() is called: https://github.com/r-spatial/sf/pull/2036/files#diff-f71a5e86883208e12685329bb59c3d169dcfae62dfd8e37de5a58f3e47dbc007R12-R69

@paleolimbot
Copy link

Just something I noticed in the documentation that you may or may not be aware of (and that we should probably fix upstream since it is not consistent with general ArrowSchema/Array usage):

ArrowArrayStream structure, and the ArrowSchema or ArrowArray objects its callbacks have returned, should no longer be used (except for potentially being released) after the OGRLayer from which it was initialized has been destroyed (typically at dataset closing).

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 22, 2022

@paleolimbot see my comment above: #155 (comment)
I am not sure that's necessarily something that has to change in GDAL, but in any case on the Python side we don't really have the tools to "import from c" with an additional object to be kept alive (apart from doing something like you did in r-spatial/sf#2036 setting private_data on the struct?).

@paleolimbot
Copy link

Oh sorry I missed that comment with the exact quote 🤦 .

I think the stream lifecycle being contained within the layer lifecycle is fine, easy to work around, and consistent with what ADBC does with the lifecycle of its "statement"), but the Array lifecycle is problematic because the assumptions around not copying the array data run deep and it's pretty much impossible to guarantee that the ArrowArray object will not be accessed after the dataset is closed. (Right?)

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 22, 2022

it's pretty much impossible to guarantee that the ArrowArray object will not be accessed after the dataset is closed. (Right?)

But I think that is OK? It was my assumption that you can access the ArrowArray data after the dataset is closed, as long as this ArrowArray already exists (so you need to consume the stream so all ArrowArray structs are created, before the dataset is closed).

Although when now rereading this quote, it is indeed saying that "using" also the ArrowArray (not just the stream) shouldn't be done after the dataset is closed. It is not super clear to me what "should no longer be used (except for potentially being released)" exactly means though, as I would assume that as long as you don't call the release callback, the actual buffers pointed to by the ArrowArray should be guaranteed to still be alive.

But let's maybe open an issue on the GDAL side to clarify this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants