Yield batches from ogr_read_arrow #205

kylebarron · 2023-01-18T22:37:00Z

I'm playing around more with reading arrow tables from pyogrio and it's really exciting. It does feel like having some API to yield batches would be helpful to work with larger datasets. @jorisvandenbossche wrote in #155:

Returning a materialized Table (from read_all()) is fine for our current use case, but I can imagine that in the future, we might want to expose an iterative RecordBatchReader as well (eg as batch-wise input to query engine).
When we want to do that, I assume that we somehow need to keep the GDAL Dataset alive (putting it in a Python object (wrapping in a small class, or putting in a PyCapsule with destructor), and keeping a reference to that object from the RecordBatchReader).

I've never touched ogr bindings before, but naively it seems the easiest way to do this is by using a context manager:

with open_arrow("file.shp") as reader:
    for record_batch in reader:
		table

would that work? just putting a yield here?

The text was updated successfully, but these errors were encountered:

brendan-ward · 2023-01-19T19:26:14Z

Do you envision that there would be a counterpart function to write data in batches, via the Arrow I/O API (once available)?

kylebarron · 2023-01-19T21:36:51Z

That seems entirely dependent on GDAL? https://gdal.org/development/rfc/rfc86_column_oriented_api.html says

Potential future work, not in the scope of this RFC, could be the addition of a column-oriented method to write new features, a WriteRecordBatch() method.

I would of course love for that to be added to GDAL, and getting greater adoption of RFC 86 seems very helpful for that.

#206 appeared to work on my local machine 🤷‍♂️. If a maintainer could enable CI on that PR that would be helpful!

kylebarron · 2023-01-19T21:38:48Z

If GDAL were to add it, I think a similar API like

with write_ogr_batches("file.gpkg", arrow_schema) as writer:
    writer.write_batch(batch)

could make sense. But for my own needs I think I'm more likely to only write to geoparquet and thus not use OGR as much for writing

jorisvandenbossche · 2023-01-19T21:47:07Z

@kylebarron thanks for looking into this! That seems like a really nice idea, I didn't think about a contextmanager to keep the dataset alive.

jorisvandenbossche · 2023-04-30T14:15:27Z

This was closed by #206

kylebarron mentioned this issue Jan 18, 2023

Streaming Arrow reader #206

Merged

jorisvandenbossche closed this as completed Apr 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Yield batches from ogr_read_arrow #205

Yield batches from ogr_read_arrow #205

kylebarron commented Jan 18, 2023

brendan-ward commented Jan 19, 2023

kylebarron commented Jan 19, 2023

kylebarron commented Jan 19, 2023

jorisvandenbossche commented Jan 19, 2023

jorisvandenbossche commented Apr 30, 2023

Yield batches from ogr_read_arrow #205

Yield batches from ogr_read_arrow #205

Comments

kylebarron commented Jan 18, 2023

brendan-ward commented Jan 19, 2023

kylebarron commented Jan 19, 2023

kylebarron commented Jan 19, 2023

jorisvandenbossche commented Jan 19, 2023

jorisvandenbossche commented Apr 30, 2023