Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yield batches from ogr_read_arrow #205

Closed
kylebarron opened this issue Jan 18, 2023 · 5 comments
Closed

Yield batches from ogr_read_arrow #205

kylebarron opened this issue Jan 18, 2023 · 5 comments

Comments

@kylebarron
Copy link
Contributor

I'm playing around more with reading arrow tables from pyogrio and it's really exciting. It does feel like having some API to yield batches would be helpful to work with larger datasets. @jorisvandenbossche wrote in #155:

Returning a materialized Table (from read_all()) is fine for our current use case, but I can imagine that in the future, we might want to expose an iterative RecordBatchReader as well (eg as batch-wise input to query engine).
When we want to do that, I assume that we somehow need to keep the GDAL Dataset alive (putting it in a Python object (wrapping in a small class, or putting in a PyCapsule with destructor), and keeping a reference to that object from the RecordBatchReader).

I've never touched ogr bindings before, but naively it seems the easiest way to do this is by using a context manager:

with open_arrow("file.shp") as reader:
    for record_batch in reader:
		table

would that work? just putting a yield here?

@brendan-ward
Copy link
Member

Do you envision that there would be a counterpart function to write data in batches, via the Arrow I/O API (once available)?

@kylebarron
Copy link
Contributor Author

That seems entirely dependent on GDAL? https://gdal.org/development/rfc/rfc86_column_oriented_api.html says

Potential future work, not in the scope of this RFC, could be the addition of a column-oriented method to write new features, a WriteRecordBatch() method.

I would of course love for that to be added to GDAL, and getting greater adoption of RFC 86 seems very helpful for that.

#206 appeared to work on my local machine 🤷‍♂️. If a maintainer could enable CI on that PR that would be helpful!

@kylebarron
Copy link
Contributor Author

If GDAL were to add it, I think a similar API like

with write_ogr_batches("file.gpkg", arrow_schema) as writer:
    writer.write_batch(batch)

could make sense. But for my own needs I think I'm more likely to only write to geoparquet and thus not use OGR as much for writing

@jorisvandenbossche
Copy link
Member

@kylebarron thanks for looking into this! That seems like a really nice idea, I didn't think about a contextmanager to keep the dataset alive.

@jorisvandenbossche
Copy link
Member

This was closed by #206

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants