Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add support for skip_features, max_features for read_arrow #282

Merged
merged 7 commits into from
Sep 28, 2023

Conversation

brendan-ward
Copy link
Member

Progress toward feature parity between Arrow and non-Arrow read interfaces.

This adds support for skip_features and max_features to read_arrow, which enables these to be passed through via read_dataframe(path, use_arrow=True, skip_features=10, max_features=2) so that the behavior is the same with and without use_arrow.

I added a note to the introduction to describe the overhead involved:use_arrow is True, skip_features and max_features will incur additional overhead because all features up to the next batch size above max_features (or size of data layer) will be read prior to slicing out the requested range of features. If max_features is less than the maximum Arrow batch size (65,536 features) only max_features will be read. All features up to skip_features are read from the data source and later discarded because the Arrow interface does not support randomly seeking a starting feature.

This overhead is relative to reading via Arrow; based on my limited tests so far, it is still generally a lot faster to use Arrow even with these parameters than without Arrow.

This also drops a validation requirement that skip_features be less than the number of features available to read (originally we raised a ValueError). Since using a .slice on a pyarrow Table with a value larger than the size of the original table happily returns an empty table, it made sense to take this approach throughout: if you ask for more features than available, you get back empty arrays / pyarrow Tables / (Geo)DataFrames.

@brendan-ward brendan-ward added the enhancement New feature or request label Sep 5, 2023
@brendan-ward brendan-ward added this to the 0.7.0 milestone Sep 5, 2023
@theroggy
Copy link
Member

I think this PR adds support for this in gdal 3.8: OSGeo/gdal#8306

@jorisvandenbossche
Copy link
Member

Specifically one commit (OSGeo/gdal@248cf60): " OGRLayer::GetArrowStream(): do not issue ResetReading() at beginning of iteration, but at end instead, so SetNextByIndex() can be honoured".
That's nice! (although of course will still take some time before we can use in released version, but we can already start using dependent on the GDAL version if we want)

@jorisvandenbossche
Copy link
Member

This also drops a validation requirement that skip_features be less than the number of features available to read

+1

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Sep 24, 2023

Something else: if max_features is small, you can set the MAX_FEATURES_IN_BATCH option to avoid to get too much data from GDAL. We still need the current code to iterate until we have enough data and slice the last batch, because it is not guaranteed that that option is honored (I think it can depend on the driver)

Forget what I said, we already do that through the batch_size keyword, and you already set that keyword to max_features if it is smaller.

pyogrio/raw.py Outdated Show resolved Hide resolved
)

elif skip_features > 0:
table = reader.read_all().slice(skip_features).combine_chunks()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side thought: we should have a better way to release the sliced memory, without having to copy the whole table (and with "we", I mean pyarrow should enable this).
Because right now, assume you have a million rows, and you skip the first 10,000, then only the first batch is sliced, and we unnecessarily copy all the other data.

@brendan-ward
Copy link
Member Author

Latest commit passes skip_features to GDAL >= 3.8 instead of handling on our end, which should cut down on some of the overhead.

@brendan-ward brendan-ward merged commit d53efb7 into main Sep 28, 2023
15 checks passed
@brendan-ward brendan-ward deleted the arrow_max_features branch September 28, 2023 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants