ENH: Add support for skip_features, max_features for read_arrow #282

brendan-ward · 2023-09-05T17:39:22Z

Progress toward feature parity between Arrow and non-Arrow read interfaces.

This adds support for skip_features and max_features to read_arrow, which enables these to be passed through via read_dataframe(path, use_arrow=True, skip_features=10, max_features=2) so that the behavior is the same with and without use_arrow.

I added a note to the introduction to describe the overhead involved:use_arrow is True, skip_features and max_features will incur additional overhead because all features up to the next batch size above max_features (or size of data layer) will be read prior to slicing out the requested range of features. If max_features is less than the maximum Arrow batch size (65,536 features) only max_features will be read. All features up to skip_features are read from the data source and later discarded because the Arrow interface does not support randomly seeking a starting feature.

This overhead is relative to reading via Arrow; based on my limited tests so far, it is still generally a lot faster to use Arrow even with these parameters than without Arrow.

This also drops a validation requirement that skip_features be less than the number of features available to read (originally we raised a ValueError). Since using a .slice on a pyarrow Table with a value larger than the size of the original table happily returns an empty table, it made sense to take this approach throughout: if you ask for more features than available, you get back empty arrays / pyarrow Tables / (Geo)DataFrames.

theroggy · 2023-09-22T04:10:10Z

I think this PR adds support for this in gdal 3.8: OSGeo/gdal#8306

jorisvandenbossche · 2023-09-22T20:04:19Z

Specifically one commit (OSGeo/gdal@248cf60): " OGRLayer::GetArrowStream(): do not issue ResetReading() at beginning of iteration, but at end instead, so SetNextByIndex() can be honoured".
That's nice! (although of course will still take some time before we can use in released version, but we can already start using dependent on the GDAL version if we want)

jorisvandenbossche · 2023-09-22T20:04:40Z

This also drops a validation requirement that skip_features be less than the number of features available to read

+1

jorisvandenbossche · 2023-09-24T14:40:47Z

Something else: if max_features is small, you can set the MAX_FEATURES_IN_BATCH option to avoid to get too much data from GDAL. We still need the current code to iterate until we have enough data and slice the last batch, because it is not guaranteed that that option is honored (I think it can depend on the driver)

Forget what I said, we already do that through the batch_size keyword, and you already set that keyword to max_features if it is smaller.

pyogrio/raw.py

jorisvandenbossche · 2023-09-26T06:31:42Z

pyogrio/raw.py

+            )
+
+        elif skip_features > 0:
+            table = reader.read_all().slice(skip_features).combine_chunks()


Side thought: we should have a better way to release the sliced memory, without having to copy the whole table (and with "we", I mean pyarrow should enable this).
Because right now, assume you have a million rows, and you skip the first 10,000, then only the first batch is sliced, and we unnecessarily copy all the other data.

pyogrio/tests/test_arrow.py

brendan-ward · 2023-09-26T17:37:44Z

Latest commit passes skip_features to GDAL >= 3.8 instead of handling on our end, which should cut down on some of the overhead.

Add support for skip_features, max_features for read_arrow

64f7f14

brendan-ward added the enhancement New feature or request label Sep 5, 2023

brendan-ward added this to the 0.7.0 milestone Sep 5, 2023

brendan-ward added 3 commits September 5, 2023 10:40

Update changelog

06a1938

Update docstrings

62ec69b

Update docstrings (again)

430e510

Merge remote-tracking branch 'origin/main' into arrow_max_features

d764930

jorisvandenbossche reviewed Sep 26, 2023

View reviewed changes

pyogrio/raw.py Outdated Show resolved Hide resolved

jorisvandenbossche reviewed Sep 26, 2023

View reviewed changes

pyogrio/tests/test_arrow.py Show resolved Hide resolved

brendan-ward added 2 commits September 26, 2023 08:58

Fix bug in reading arrow with both skip_features and max_features

ae34101

Support skipping features internally for Arrow on GDAL >= 3.8

ed8f5a7

jorisvandenbossche approved these changes Sep 28, 2023

View reviewed changes

brendan-ward merged commit d53efb7 into main Sep 28, 2023
15 checks passed

brendan-ward deleted the arrow_max_features branch September 28, 2023 22:07

brendan-ward mentioned this pull request Oct 3, 2023

Add support for force_2d with use_arrow=True in read_dataframe #300

Merged

brendan-ward mentioned this pull request Oct 10, 2023

FIX: Disallow negative skip_features #312

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add support for skip_features, max_features for read_arrow #282

ENH: Add support for skip_features, max_features for read_arrow #282

brendan-ward commented Sep 5, 2023

theroggy commented Sep 22, 2023

jorisvandenbossche commented Sep 22, 2023

jorisvandenbossche commented Sep 22, 2023

jorisvandenbossche commented Sep 24, 2023 •

edited

Loading

jorisvandenbossche Sep 26, 2023

brendan-ward commented Sep 26, 2023

ENH: Add support for skip_features, max_features for read_arrow #282

ENH: Add support for skip_features, max_features for read_arrow #282

Conversation

brendan-ward commented Sep 5, 2023

theroggy commented Sep 22, 2023

jorisvandenbossche commented Sep 22, 2023

jorisvandenbossche commented Sep 22, 2023

jorisvandenbossche commented Sep 24, 2023 • edited Loading

jorisvandenbossche Sep 26, 2023

Choose a reason for hiding this comment

brendan-ward commented Sep 26, 2023

jorisvandenbossche commented Sep 24, 2023 •

edited

Loading