Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support pandas nullable dtypes such as boolean and string #219

Closed
jtmiclat opened this issue Feb 13, 2023 · 5 comments · Fixed by #232
Closed

ENH: Support pandas nullable dtypes such as boolean and string #219

jtmiclat opened this issue Feb 13, 2023 · 5 comments · Fixed by #232

Comments

@jtmiclat
Copy link

jtmiclat commented Feb 13, 2023

Hi, I was getting an error when using the write_dataframe() when using geodataframes containing booleans from google-cloud-bigquery

from google.cloud import bigquery
import pyogrio
client = bigquery.Client()

gdf = client.query("""SELECT ST_GEOGPOINT(1, 1) as geometry, True as bool_field """).to_geodataframe()
pyogrio.write_dataframe(gdf, "test2.json", driver="GeoJSON" )

# usr/local/lib/python3.8/dist-packages/pyogrio/_io.pyx in pyogrio._io.infer_field_types()
# 
# NotImplementedError: field type is not supported boolean (field index: 0)

After some digging I figured out that the returned dtype for bool_field was a pandas.BooleanDtype/boolean instead of bool and was able to replicated it without bigquery. It seems to be able to work fine with fiona but breaks with pyogrio

import geopandas as gpd
from shapely import Point
from pandas import BooleanDtype
import pyogrio

gdf = gpd.GeoDataFrame([{'geometry': Point(1,1), "bool_field":True}])
gdf.dtypes

# geometry      geometry
# bool_field        bool
# dtype: object

gdf2 = gdf.astype({'bool_field': BooleanDtype()})
gdf2.dtypes

# geometry      geometry
# bool_field     boolean
# dtype: object

# Works with bool dtype 
pyogrio.write_dataframe(gdf, "1.json", driver="GeoJSON" )

# Works with boolean dtype 
gdf2.to_file("test.json", driver="GeoJSON")

# This throws the same error
pyogrio.write_dataframe(gdf2, "test2.json", driver="GeoJSON" )

# usr/local/lib/python3.8/dist-packages/pyogrio/_io.pyx in pyogrio._io.infer_field_types()
# 
# NotImplementedError: field type is not supported boolean (field index: 0)

My hunch is to add boolean to

pyogrio/pyogrio/_io.pyx

Lines 60 to 84 in 75e8f13

DTYPE_OGR_FIELD_TYPES = {
'int8': (OFTInteger, OFSTInt16),
'int16': (OFTInteger, OFSTInt16),
'int32': (OFTInteger, OFSTNone),
'int': (OFTInteger64, OFSTNone),
'int64': (OFTInteger64, OFSTNone),
# unsigned ints have to be converted to ints; these are converted
# to the next largest integer size
'uint8': (OFTInteger, OFSTInt16),
'uint16': (OFTInteger, OFSTNone),
'uint32': (OFTInteger64, OFSTNone),
# TODO: these might get truncated, check maximum value and raise error
'uint': (OFTInteger64, OFSTNone),
'uint64': (OFTInteger64, OFSTNone),
# bool is handled as integer with boolean subtype
'bool': (OFTInteger, OFSTBoolean),
'float32': (OFTReal,OFSTFloat32),
'float': (OFTReal, OFSTNone),
'float64': (OFTReal, OFSTNone),
'datetime64[D]': (OFTDate, OFSTNone),
'datetime64': (OFTDateTime, OFSTNone),
}

Thanks for the wonderful work!

@jorisvandenbossche
Copy link
Member

@jtmiclat Thanks for the report!
In general, we don't yet support the pandas nullable dtypes such as boolean.

As long as there are no missing values, adding boolean to the DTYPE_OGR_FIELD_TYPES mapping might be sufficient, but for missing values we will certainly need to add support for recognizing pd.NA as missing value. It might also be more efficient to add a support for having field data as both values + mask array.

@jtmiclat
Copy link
Author

@jorisvandenbossche I did some initial testing and adding boolean to DTYPE_OGR_FIELD_TYPES does address my issue but fails when there is a pd.NA in the column. The error message isn't super clear for the user

>   OGR_F_SetFieldInteger(ogr_feature, field_idx, field_value)
E   TypeError: an integer is required

pyogrio/_io.pyx:1631: TypeError

I think it is best to wait for support for recognizing pd.NA. Thanks!

@m-richards
Copy link
Member

I imagine that supporting writing dataframes with dtype="string" falls into a similar category? - as that is also nullable I've been introducing pyogrio to some colleagues who are super impressed at the speed difference compared to fiona for reading large networks, and we came across the this behaviour difference with fiona.

@Oreilles
Copy link

Seems like dtypes string and analogous (string[python], string[pyarrow] as well as category don't work out of the box, and need to be casted to object.

Maybe we should change the title of this issue to indicate that it is a broader issue, or open another one ?

Some documentation in that regard would be welcome too.

@jtmiclat jtmiclat changed the title BUG: Cannot write column with boolean dtype ENH: Support pandas nullable dtypes such as boolean and string Mar 11, 2023
@jtmiclat
Copy link
Author

jtmiclat commented Mar 11, 2023

@Oreilles renamed the issue to an ENH request to support nullable fields. I think category and other custom dtype support would be a separate issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants