Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get dataset bounds via pyogrio.read_info #274

Closed
kylebarron opened this issue Aug 21, 2023 · 8 comments · Fixed by #281
Closed

Get dataset bounds via pyogrio.read_info #274

kylebarron opened this issue Aug 21, 2023 · 8 comments · Fixed by #281
Assignees
Labels
enhancement New feature or request

Comments

@kylebarron
Copy link
Contributor

Is it possible to access a dataset's spatial extent for supported drivers without fetch all the features with read_bounds? Some drivers like FlatGeobuf know this in advance. E.g. if I run

ogrinfo /vsicurl/https://data.source.coop/cholmes/eurocrops/unprojected/flatgeobuf/FR_2018_EC21.fgb FR_2018_EC21 -so

then I see

...
Extent: (-6.047022, -3.916365) - (68.890504, 51.075101)
...

without waiting for the 6.5GB file to download. Can we expose that information through pyogrio.read_info?

@theroggy
Copy link
Member

This is the function in gdal to get that info: OGR_L_GetExtent

Something to think about is what to do with the bForce flag: for file types where getting this info isn't fast, the info is only retrieved if bForce=1. However, in pyogrio.read_info it is not clear if the user needs the bounds/extent or not... So possibly it is better to expose a flag force_bounds/force_extent in pyogrio.read_info as well, with default value False or ?.

@kylebarron
Copy link
Contributor Author

Possibly the better approach is to have pyogrio.read_bounds use the metadata if it exists? I used read_bounds on that file but after a minute I assumed that it was downloading and iterating over all the features instead of reading the FlatGeobuf metadata.

It would be nice to have some way to say "get me the dataset bounds but only if it isn't expensive". Maybe include_bounds=True in read_info is the best way to achieve that?

@theroggy
Copy link
Member

theroggy commented Aug 21, 2023

Possibly the better approach is to have pyogrio.read_bounds use the metadata if it exists? I used read_bounds on that file but after a minute I assumed that it was downloading and iterating over all the features instead of reading the FlatGeobuf metadata.

It would be nice to have some way to say "get me the dataset bounds but only if it isn't expensive". Maybe include_bounds=True in read_info is the best way to achieve that?

That's exactly what OGR_L_GetExtent does. If bForce=0 it will only use metadata/header info/... (if implemented properly in the gdal driver) and return ~None if there is no "cheap" way to get the info. If bForce=1 it will try to use metadata as well, but will calculate it if there is no "cheap" way to get the info.

I doublechecked in the FlatGeoBuf driver code and it looks OK at first sight.

@theroggy
Copy link
Member

I did a small test, and using the gdal python bindings that use the same function under the hood it takes 3.5 seconds.

from datetime import datetime
from osgeo import gdal
gdal.UseExceptions()

url = "/vsicurl/https://data.source.coop/cholmes/eurocrops/unprojected/flatgeobuf/FR_2018_EC21.fgb"

start = datetime.now()
datasource = gdal.OpenEx(url)
datasource_layer = datasource.GetLayer("FR_2018_EC21")
print(datasource_layer.GetExtent())
print(f"took {datetime.now() - start}")
datasource = None

@brendan-ward
Copy link
Member

We may also want to set this property in the layer's capabilities returned by read_info via OGR_L_TestCapability(ogr_layer, OLCFastGetExtent).

It looks like GPKG, FGB, SHP, and even OSM support fast getting of bounds. GeoJSON does not.

I wonder if we want to make the two potentially computationally expensive values opt-in to forcing them by default for read_info. We could make the feature count return -1 if it can't be done via a fast count (force_bounds=1) and likewise for bounds (force_bounds=True); return None if can't be fast. If so, we'd want to add fast_feature_count to capabilities returned by read_info, and then the user could check those capabilities to determine if they want to force either of these operations.

@jorisvandenbossche
Copy link
Member

Possibly the better approach is to have pyogrio.read_bounds use the metadata if it exists? I used read_bounds on that file but after a minute I assumed that it was downloading and iterating over all the features instead of reading the FlatGeobuf metadata.

Note that read_bounds gives you the bounds of every individual geometry, not the bounds of the full dataset. So for that usecase, I don't think there is a way around getting the actual data?

But +1 to Brendan's last comment as well

@theroggy
Copy link
Member

Possibly the better approach is to have pyogrio.read_bounds use the metadata if it exists? I used read_bounds on that file but after a minute I assumed that it was downloading and iterating over all the features instead of reading the FlatGeobuf metadata.

Note that read_bounds gives you the bounds of every individual geometry, not the bounds of the full dataset. So for that usecase, I don't think there is a way around getting the actual data?

No indeed, everything I wrote only applies to getting the "total_bounds" of the layer to use geopandas terminology. To get all bounds for each feature individually at least a lot more data needs to be read and the bforce/force_bounds/force_total_bounds parameter I mentioned also doesn't apply to the bounds of individual features.

@kylebarron
Copy link
Contributor Author

Note that read_bounds gives you the bounds of every individual geometry, not the bounds of the full dataset

Yes, you're right. I had forgotten that. For this issue I'm concerned about total bounds in order to provide sensible bbox arguments to the later read_dataframe call.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants