Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add test of Arrow roundtrip of boolean values, raise exception on GDAL <3.8.3 for bool values via Arrow API #335

Merged
merged 7 commits into from
Jan 25, 2024

Conversation

brendan-ward
Copy link
Member

@brendan-ward brendan-ward commented Dec 22, 2023

Test to reproduce #334 incorrect handling of boolean values using Arrow I/O (mostly to see if this also fails on GDAL >= 3.8.1).

As the root cause has been solved and released in GDAL, this PR now resolves #334

@theroggy
Copy link
Member

theroggy commented Jan 8, 2024

GDAL 3.8.3 fixes this and has just been released...

@brendan-ward
Copy link
Member Author

Because misreading boolean values is a bad error, this adds an exception to read_arrow if boolean values are detected in the pyarrow table read from GDAL. I didn't add an exception to open_arrow because we don't know the contents of the stream it returns. I'm assuming that given the severity of this bug, it is better to raise an exception and instruct users to avoid the arrow API for GDAL < 3.8.3 than to do something less intrusive like raise a warning that might be ignored by the user.

I used an xfail for the failing test case so that we can detect if this fix gets backported to other GDAL versions (e.g., if there is a 3.7.4 that includes it)

@brendan-ward brendan-ward changed the title TST: Add test of Arrow roundtrip of boolean values ENH: Add test of Arrow roundtrip of boolean values, raise exception on GDAL <3.8.3 for bool values via Arrow API Jan 23, 2024
@theroggy
Copy link
Member

theroggy commented Jan 23, 2024

Because misreading boolean values is a bad error, this adds an exception to read_arrow if boolean values are detected in the pyarrow table read from GDAL. I didn't add an exception to open_arrow because we don't know the contents of the stream it returns. I'm assuming that given the severity of this bug, it is better to raise an exception and instruct users to avoid the arrow API for GDAL < 3.8.3 than to do something less intrusive like raise a warning that might be ignored by the user.

Good idea!

I used an xfail for the failing test case so that we can detect if this fix gets backported to other GDAL versions (e.g., if there is a 3.7.4 that includes it)

I'm not sure if the xfail is that interesting as the test testing for the exception for GDAL < 3.8 will fail anyway. It feels like it's a pity that it is not possible to specify a match= in the xfail decorator. If that would be possible I think a single test with xfail could get the following behaviour which is I think the ideal behaviour to get?

  • if GDAL >= 3.8.3 and no exception is thrown, test passes
  • if GDAL >= 3.8.3 and an exception is thrown, test fails
  • if GDAL < 3.8.3 and the exception message matches the xfail match=, test "passes" with xfail
  • if GDAL < 3.8.3 and no exception, test "passes" with xpass
  • if GDAL < 3.8.3 and the exception message doesn't match the xfail match=, test fails

Or am I missing something?

@brendan-ward
Copy link
Member Author

No, you're right: I didn't have my logic correct for the xfail because we also specifically check the GDAL version in the implementation and always raise an exception. I don't want to make that more granular by preemptively adding GDAL versions that don't yet exist to try and catch this, e.g., in read_arrow:

if not (gdal_version >= (3, 8, 3) or gdal_version > (3,7,3)) and any(
        [col_type == "bool" for col_type in table.schema.types]
    )

Instead, we probably just have to watch the changelog for GDAL and update the implementation if the fix gets backported (it may not; I recall that some of the other Arrow fixes were not going to be backported from the 3.8.x lineage).

Will remove the xfail cases...

@brendan-ward brendan-ward merged commit 9deaba5 into main Jan 25, 2024
19 checks passed
@brendan-ward brendan-ward deleted the add_arrow_bool_roundtrip_test branch January 25, 2024 14:09
@jorisvandenbossche
Copy link
Member

According to the GDAL PR fixing this, this only affects reading GPKG/FlatGeoBuf files. Would we have a way to check for that? (we don't know the driver because that gets inferred while reading, except if we explicitly get that info (read_info returns it?). But maybe we could then only do that when we detect there is a boolean column)

@brendan-ward
Copy link
Member Author

Indeed; I had assumed it applied to all the bool drivers because it was in OGRArrowArrayHelper::SetBoolOn() (even though it did specifically mention only those drivers).

It looks like boolean values roundtripped properly for SQLite / GeoJSON / GML (not able to easily test postgres) - so our exception would be raised when the functionality is just fine. Will add a fix for this.

@jorisvandenbossche
Copy link
Member

Maybe it is because that utility function is only used by drivers that have a custom implementation of GetArrowStream that sidesteps the internal GDAL data model (which are only GPKG and FlatGeoBuf, AFAIK, apart from Parquet/Arrow, but those already are already in arrow format after reading, of course). Other drivers fall back to a default implementation based on the standard APIs to read data (like we also use)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: use_arrow=True changes boolean values when reading Geopackage
3 participants