ENH: use new columnar GetArrowStream if GDAL>=3.6 and pyarrow available #155

jorisvandenbossche · 2022-09-08T12:56:17Z

This is an experiment to use the Arrow Stream API added to GDAL in upcoming 3.6 (RFC 86: Column-oriented read API for vector layers)

For one specific benchmark using the nz-buildings-outline.gpkg file I noticed another 2x speed-up compared to base pyogrio (see slide).

This does not support all features, or behaves a bit differently in some cases. Some observations:

skip_features or max_features is not supported (this would need some change on the GDAL side to for example set a batchsize, and then only get one batch)
this always reads all columns at the moment, but that should be solveable by explicitly ignoring fields with OGR_L_SetIgnoredFields as mentioned in ENH: avoid I/O when columns don't need to be returned #153 (something we should do anyway) -> selecting columns is supported
~~this always reads the geometry column (not sure if that is something you can turn off with existing GDAL APIs)~~ -> not reading the geometry column (or reading a layer without geometry) is supported
this always returns an "FID" column, although the name doesn't seem to be fixed (my quick observations is that for file formats that don't have this natively, it comes as a "OGC_FID" column, while for geopackage, this reads as "fid", at least for a gpkg file written by geopandas) -> by default this now also doesn't read an FID column (to be consistent with the non-arrow code path), and it follows the return_fids keyword to include it as a column or not
the name of the geometry column can vary, and OGR_L_GetGeometryColumn doesn't seem to always seem to know this (maybe some formats don't require a name (or support a custom name), in which case this method also doesn't return anything?). But for now I making "geometry" in all cases anyway to be consistent with the normal read path.
This automatically supports all data types (eg also nested data types)

jorisvandenbossche · 2022-09-08T13:21:42Z

pyogrio/_io.pyx

+        IF CTE_GDAL_VERSION >= (3, 6, 0):
+
+            if not OGR_L_GetArrowStream(ogr_layer, &stream, NULL):
+                raise RuntimeError("Failed to open ArrowArrayStream from Layer")
+
+        ELSE:
+            raise RuntimeError("Need GDAL>=3.6 for Arrow support")
+
+        stream_ptr = <uintptr_t> &stream
+
+        import pyarrow as pa
+        table = pa.RecordBatchStreamReader._import_from_c(stream_ptr).read_all()


This part is actually the main piece that is different from the existing ogr_read function. So it might be cleaner to integrate it into that one (to avoid too much duplication of code) by passing through the use_arrow keyword to ogr_read.
On the other hand, that makes the return value a bit messy (unless we would change this in a nested tuples of length 2 ((meta, data) where data can be (fid_data, geometries, field_data) in the standard case and table in the arrow case). And the arrow code path is for now not handling all keywords, so that might also be cleaner as a separate function.

pyogrio/arrow_bridge.h

brendan-ward

Thanks for working on this @jorisvandenbossche ! The potential of a 2x speedup is exciting!

A bit of code duplication is OK for now, esp. if the return signature from Cython function is different and the supported parameters are a bit different.

Question: if we detect GDAL >= 3.6 and pyarrow, and the parameters are compatible, would we want to automatically opt-in to using arrow for I/O with GDAL to / from GeoDataFrames, instead of only based only on a keyword? Though this would be added complexity around code paths based on parameters...

Perhaps generalizing use_arrow to something that allows "auto", "numpy", "arrow" values, where "auto" (default) is based on what is available, falling back to "numpy" if arrow path is not available.

Do we also want to consider adding these as raw I/O functions as well, e.g., read_arrow / write_arrow

this always reads the geometry column

We should probably investigate this more. Some folks use pyogrio to get at non-geometry tables in ESRI FileGeodatabases, among others. Would be nice to use arrow to get those, if possible.

You can use

read_dataframe('zip://pyogrio/tests/fixtures/test_fgdb.gdb.zip', layer='basetable')

for testing tabular-only data

pyogrio/_io.pyx

pyogrio/arrow_bridge.h

jorisvandenbossche · 2022-09-21T17:37:45Z

estion: if we detect GDAL >= 3.6 and pyarrow, and the parameters are compatible, would we want to automatically opt-in to using arrow for I/O with GDAL to / from GeoDataFrames, instead of only based only on a keyword?

Eventually we should indeed do that, I think, but maybe not to start with?

Generalizing the keyword seems good for that. But maybe we then need another name?

Do we also want to consider adding these as raw I/O functions as well, e.g., read_arrow / write_arrow

Good idea, I think adding a read_arrow would be a good idea (write_arrow is currently not yet in scope, since the GDAL interface only provides arrow-based reading)

this always reads the geometry column

We should probably investigate this more. Some folks use pyogrio to get at non-geometry tables in ESRI FileGeodatabases, among others. Would be nice to use arrow to get those, if possible.

Yes, so that actually works (just tested, and it reads the table nicely without geometry column). The main issue for this case is how we detect reliably that there is no geometry column.

But the issue is more that if your layer has a geometry, then it is always included in the resulting arrow table. Which means we might not be able to support the ignore_geometry=True option.

jorisvandenbossche · 2022-09-21T18:00:54Z

Which means we might not be able to support the ignore_geometry=True option.

Ah, it seems that OGR_L_SetIgnoredFields can also be used to ignore the geometry field using a special value (https://gdal.org/api/vector_c_api.html#_CPPv422OGR_L_SetIgnoredFields9OGRLayerHPPKc). Will give that a test

jorisvandenbossche · 2022-09-22T12:55:14Z

Something hadn't thought about before is the lifetime of the GDAL Dataset/Layer, as long as pyarrow still has data that is backed by buffers owned by GDAL.

Quote from the GDAL docs for OGR_L_GetArrowStream about this:

There are extra precautions to take into account in a OGR context. Unless otherwise specified by a particular driver implementation, the ArrowArrayStream structure, and the ArrowSchema or ArrowArray objects its callbacks have returned, should no longer be used (except for potentially being released) after the OGRLayer from which it was initialized has been destroyed (typically at dataset closing).

So according to that, we should only destroy the GDAL dataset after we have fully consumed the stream (the data can still be owned by GDAL though, we are still allowed to call the release callback).

Simplified, this PR currently implements the following:

def read(path, layer):
    cdef OGRDataSourceH ogr_dataset = NULL
    cdef OGRLayerH ogr_layer = NULL
    cdef ArrowArrayStream stream

    ogr_dataset = GDALOpenEx(path)
    try:
        ogr_layer = GDALDatasetGetLayer(ogr_dataset, layer)
        OGR_L_GetArrowStream(ogr_layer, &stream, options)
        table = pa.RecordBatchReader._import_from_c(<uintptr_t> &stream).read_all()
    finally:
        GDALClose(ogr_dataset)

    return table

So because I am currently creating a RecordBatchReader ànd actually consuming the iterator by calling read_all() (which will call get_next() on the ArrowArrayStream until the end), I assume we are OK here?

Returning a materialized Table (from read_all()) is fine for our current use case, but I can imagine that in the future, we might want to expose an iterative RecordBatchReader as well (eg as batch-wise input to query engine).
When we want to do that, I assume that we somehow need to keep the GDAL Dataset alive (putting it in a Python object (wrapping in a small class, or putting in a PyCapsule with destructor), and keeping a reference to that object from the RecordBatchReader). But RecordBatchReader._import_from_c doesn't currently have a way to do that?
(cc @pitrou)

paleolimbot · 2022-11-10T16:34:30Z

I can imagine that in the future, we might want to expose an iterative RecordBatchReader as well (eg as batch-wise input to query engine).

This is fresh off the presses so no guarantees that it's a sound design, but when drafting r-spatial/sf#2036 I made a little wrapper stream that keeps the dataset alive until release() is called: https://github.com/r-spatial/sf/pull/2036/files#diff-f71a5e86883208e12685329bb59c3d169dcfae62dfd8e37de5a58f3e47dbc007R12-R69

paleolimbot · 2022-11-22T14:05:05Z

Just something I noticed in the documentation that you may or may not be aware of (and that we should probably fix upstream since it is not consistent with general ArrowSchema/Array usage):

ArrowArrayStream structure, and the ArrowSchema or ArrowArray objects its callbacks have returned, should no longer be used (except for potentially being released) after the OGRLayer from which it was initialized has been destroyed (typically at dataset closing).

jorisvandenbossche · 2022-11-22T14:09:09Z

@paleolimbot see my comment above: #155 (comment)
I am not sure that's necessarily something that has to change in GDAL, but in any case on the Python side we don't really have the tools to "import from c" with an additional object to be kept alive (apart from doing something like you did in r-spatial/sf#2036 setting private_data on the struct?).

paleolimbot · 2022-11-22T14:23:11Z

Oh sorry I missed that comment with the exact quote 🤦 .

I think the stream lifecycle being contained within the layer lifecycle is fine, easy to work around, and consistent with what ADBC does with the lifecycle of its "statement"), but the Array lifecycle is problematic because the assumptions around not copying the array data run deep and it's pretty much impossible to guarantee that the ArrowArray object will not be accessed after the dataset is closed. (Right?)

jorisvandenbossche · 2022-11-22T14:32:18Z

it's pretty much impossible to guarantee that the ArrowArray object will not be accessed after the dataset is closed. (Right?)

But I think that is OK? It was my assumption that you can access the ArrowArray data after the dataset is closed, as long as this ArrowArray already exists (so you need to consume the stream so all ArrowArray structs are created, before the dataset is closed).

Although when now rereading this quote, it is indeed saying that "using" also the ArrowArray (not just the stream) shouldn't be done after the dataset is closed. It is not super clear to me what "should no longer be used (except for potentially being released)" exactly means though, as I would assume that as long as you don't call the release callback, the actual buffers pointed to by the ArrowArray should be guaranteed to still be alive.

But let's maybe open an issue on the GDAL side to clarify this.

jorisvandenbossche added 2 commits August 23, 2022 16:19

ENH: use new columnar GetArrowStream if GDAL>=3.6 and pyarrow available

cf092b2

use compile_time variable

13d4b61

jorisvandenbossche commented Sep 8, 2022

View reviewed changes

jorisvandenbossche added 3 commits September 8, 2022 15:23

Check if pyarrow is available

4eaba0d

add basic test

9389dc4

install pyarrow on latest GDAL docker build test

d1a1153

pitrou reviewed Sep 8, 2022

View reviewed changes

pyogrio/arrow_bridge.h Show resolved Hide resolved

brendan-ward reviewed Sep 10, 2022

View reviewed changes

pyogrio/_io.pyx Outdated Show resolved Hide resolved

pyogrio/arrow_bridge.h Show resolved Hide resolved

support selecting columns and ignoring geometry

eb5943d

jorisvandenbossche mentioned this pull request Sep 21, 2022

ENH: avoid I/O when columns don't need to be returned #153

Closed

jorisvandenbossche added 7 commits September 21, 2022 22:24

fix reading without geometry

bfa8ae7

avoid warning for nested types that are supported for arrow

41be9cc

simplify GDAL version check

eabed97

properly support return_fids

5c039f8

clean-up

74706f8

Merge remote-tracking branch 'upstream/main' into gdal-arrow-stream

c643c5d

synchronize arrow definitions with latest upstream version

755b43b

raise error if force_2d is used

fd21ad5

jorisvandenbossche mentioned this pull request Oct 5, 2022

Python IO using fiona/OGR geopolars/geopolars#48

Closed

kylebarron mentioned this pull request Oct 6, 2022

Support for RFC 86: Column-oriented read API for vector layers georust/gdal#280

Closed

kylebarron mentioned this pull request Oct 15, 2022

Trying out OGR_L_GetArrowStream Toblerity/Fiona#1121

Closed

kylebarron mentioned this pull request Nov 7, 2022

GDAL Support for creating dataframes from geospatial files and data geopolars/geopolars#168

Open

paleolimbot mentioned this pull request Nov 10, 2022

Experimental support for GDAL 3.6 columnar API r-spatial/sf#2036

Merged

paleolimbot mentioned this pull request Nov 22, 2022

Clarify lifecycle of ArrowArray and ArrowSchema structures in columnar access API OSGeo/gdal#6772

Closed

jorisvandenbossche added 7 commits December 18, 2022 14:52

Merge remote-tracking branch 'upstream/main' into gdal-arrow-stream

a67ec7f

add pyarrow to conda test environment

f80278c

fix linting

7759a47

fixup linting

fd6e175

include arrow_bridge.h in sdists

dff16ee

Merge remote-tracking branch 'upstream/main' into gdal-arrow-stream

7320702

mention in docs

26d0910

jorisvandenbossche mentioned this pull request Dec 29, 2022

BLD/RLS: update wheels to include GDAL 3.6 #188

Merged

Merge remote-tracking branch 'upstream/main' into gdal-arrow-stream

616f090

jorisvandenbossche merged commit 0188fdb into geopandas:main Jan 2, 2023

jorisvandenbossche deleted the gdal-arrow-stream branch January 2, 2023 09:07

This was referenced Jan 2, 2023

ENH: expose Arrow-based IO as read_arrow + doc #191

Merged

Release version 0.5.0 #192

Closed

jorisvandenbossche added this to the 0.5.0 milestone Jan 2, 2023

brendan-ward mentioned this pull request Jan 5, 2023

ENH: Ignore fields not in specified columns when reading features #195

Merged

jorisvandenbossche mentioned this pull request Jan 16, 2023

Empty geometry name in read_arrow metadata #204

Open

kylebarron mentioned this pull request Jan 18, 2023

Yield batches from ogr_read_arrow #205

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: use new columnar GetArrowStream if GDAL>=3.6 and pyarrow available #155

ENH: use new columnar GetArrowStream if GDAL>=3.6 and pyarrow available #155

jorisvandenbossche commented Sep 8, 2022 •

edited

Loading

jorisvandenbossche Sep 8, 2022

brendan-ward left a comment

jorisvandenbossche commented Sep 21, 2022

jorisvandenbossche commented Sep 21, 2022

jorisvandenbossche commented Sep 22, 2022 •

edited

Loading

paleolimbot commented Nov 10, 2022

paleolimbot commented Nov 22, 2022

jorisvandenbossche commented Nov 22, 2022 •

edited

Loading

paleolimbot commented Nov 22, 2022

jorisvandenbossche commented Nov 22, 2022 •

edited

Loading

ENH: use new columnar GetArrowStream if GDAL>=3.6 and pyarrow available #155

ENH: use new columnar GetArrowStream if GDAL>=3.6 and pyarrow available #155

Conversation

jorisvandenbossche commented Sep 8, 2022 • edited Loading

jorisvandenbossche Sep 8, 2022

Choose a reason for hiding this comment

brendan-ward left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 21, 2022

jorisvandenbossche commented Sep 21, 2022

jorisvandenbossche commented Sep 22, 2022 • edited Loading

paleolimbot commented Nov 10, 2022

paleolimbot commented Nov 22, 2022

jorisvandenbossche commented Nov 22, 2022 • edited Loading

paleolimbot commented Nov 22, 2022

jorisvandenbossche commented Nov 22, 2022 • edited Loading

jorisvandenbossche commented Sep 8, 2022 •

edited

Loading

jorisvandenbossche commented Sep 22, 2022 •

edited

Loading

jorisvandenbossche commented Nov 22, 2022 •

edited

Loading

jorisvandenbossche commented Nov 22, 2022 •

edited

Loading