Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Calculate feature count ourselves if GDAL returns unknown count (-1), better support OSM data #271

Merged
merged 9 commits into from
Aug 22, 2023

Conversation

brendan-ward
Copy link
Member

@brendan-ward brendan-ward commented Aug 15, 2023

Resolves #269, #272

GDAL will return -1 for some drivers if it has specifically disabled the ability to get a feature count (e.g., OSM driver).

Since we need to know this number in order to allocate arrays before reading features, we need to get this count in order to be able to read from certain drivers.

pyogrio/_io.pyx Outdated Show resolved Hide resolved
@theroggy
Copy link
Member

Ideally there is also a test to cover this code path? But possibly there is no write support for a driver behaving this way?

pyogrio/_io.pyx Outdated Show resolved Hide resolved
@brendan-ward
Copy link
Member Author

The fix for #272 can be verified manually using a sufficiently large (probably > 100 MB) OSM file from Geofabrik, e.g., Greece; it will return a count of 0 without the fix added in 31d7df3

from pyogrio import read_info

assert read_info('greece-latest.osm.pbf', layer='lines')['features'] > 0

It fixes both reading with use_arrow=True and without.

@brendan-ward brendan-ward mentioned this pull request Aug 17, 2023
Comment on lines +103 to +104
We recommend the following to sidestep performance issues:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also mention use_arrow as an option. Based on my 1 test this is still not fast, but half as slow as without use_arrow=True. Or is it still too exprimental?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm seeing different results for Alabama (https://download.geofabrik.de/north-america/us/alabama-latest.osm.pbf) reading the lines layer:

local, with arrow: 4.49s
local, without arrow: 3.46s
remote, with arrow: 23.55s
remote, without arrow: 20.26s

times are greater for multipolygons layer but follow similar trend locally but not for remote (likely because of much more network I/O, but it is remote enough from me that there could be huge variability in network I/O):

local, with arrow: 6.82s
local, without arrow: 5.50s
remote, with arrow: 33.85s
remote, without arrow: 39.06s

whereas for points, the first and fastest layer:
local, with arrow 1.12s
local, without arrow: 0.68s
remote, with arrow: 40.36s
remote, without arrow: 20.09s

I had expected the arrow option to be somewhat faster, but that doesn't seem to be the case for me.

Copy link
Member

@theroggy theroggy Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... my timings are indeed different. They are consistently faster for use_arrow, and most of the time twice as fast.

Then again, the timings are also spectacularly longer for me for local files, most likely the difference between Windows and Linux... because CI tests also run a lot slower on Windows vs Linux :-(... To conclude, arrow seems less dramatically slow on Windows ;-).

from datetime import datetime
import itertools
import pyogrio

url_de_bw = "https://download.geofabrik.de/europe/germany/baden-wuerttemberg-latest.osm.pbf"
path_de_bw = "C:/temp/baden-wuerttemberg-latest.osm.pbf"

url_us_al = "https://download.geofabrik.de/north-america/us/alabama-latest.osm.pbf"
path_us_al = "C:/temp/alabama-latest.osm.pbf"

urls = [path_de_bw, path_us_al]
layers = ["multipolygons", "points"]

for url, layer, use_arrow in itertools.product(urls, layers, [True, False]):
    start = datetime.now()
    pgdf = pyogrio.read_dataframe(
        url, use_arrow=use_arrow, layer=layer, encoding="utf8"
    )
    print(f"url={url}, layer={layer}, use_arrow={use_arrow}: {datetime.now()-start}")

Results:

url=C:/temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:01:46.250570
url=C:/temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:03:40.812138
url=C:/temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True: 0:00:18.412509
url=C:/temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False: 0:00:24.491693
url=C:/temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:00:16.943871
url=C:/temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:00:30.859146
url=C:/temp/alabama-latest.osm.pbf, layer=points, use_arrow=True: 0:00:07.560858
url=C:/temp/alabama-latest.osm.pbf, layer=points, use_arrow=False: 0:00:08.725899

Copy link
Member

@theroggy theroggy Aug 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS: I had to specify encoding="utf8" to be able to run it with use_arrow=False, probably because it is German data and they use quite some accents and stuff. Based on the following post and others, it seems osm data is always in "UTF-8":
https://help.openstreetmap.org/questions/5308/character-encoding

So I think it might be better to default the encoding to "UTF-8" for .osm and .pbf?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a fix to always set UTF-8.

It looks like using the interleaved encoding option has an impact on performance:

With default (no interleaved reading):

url=baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:01:12.460372
url=baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:03:11.110123
url=baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True: 0:00:08.757978
url=baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False: 0:00:15.286818
url=alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:00:06.517222
url=alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:00:15.456438
url=alabama-latest.osm.pbf, layer=points, use_arrow=True: 0:00:01.148906
url=alabama-latest.osm.pbf, layer=points, use_arrow=False: 0:00:02.335114

with pyogrio.set_gdal_config_options({"OGR_INTERLEAVED_READING": True}):

url=baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:00:52.878035
url=baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:01:05.556191
url=baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True: 0:00:06.217804
url=baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False: 0:00:04.641645
url=alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:00:06.525850
url=alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:00:05.703674
url=alabama-latest.osm.pbf, layer=points, use_arrow=True: 0:00:00.852143
url=alabama-latest.osm.pbf, layer=points, use_arrow=False: 0:00:00.724238

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... interleaved reading is consitently quite a bit faster... and often use_arrow=False becomes faster than use_arrow=True because of it. Kind of weird.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On windows, a similar pattern:

url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True, interleaved=True: 0:01:49.743053
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True, interleaved=False: 0:01:51.280849
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False, interleaved=True: 0:01:28.155260
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False, interleaved=False: 0:03:56.258161
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True, interleaved=True: 0:00:20.615849
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True, interleaved=False: 0:00:15.530900
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False, interleaved=True: 0:00:10.094106
url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False, interleaved=False: 0:00:28.546472
url=C:/Temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True, interleaved=True: 0:00:19.117897
url=C:/Temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True, interleaved=False: 0:00:19.194775
url=C:/Temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False, interleaved=True: 0:00:17.296830
url=C:/Temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False, interleaved=False: 0:00:35.047402
url=C:/Temp/alabama-latest.osm.pbf, layer=points, use_arrow=True, interleaved=True: 0:00:07.781301
url=C:/Temp/alabama-latest.osm.pbf, layer=points, use_arrow=True, interleaved=False: 0:00:07.374248
url=C:/Temp/alabama-latest.osm.pbf, layer=points, use_arrow=False, interleaved=True: 0:00:06.801308
url=C:/Temp/alabama-latest.osm.pbf, layer=points, use_arrow=False, interleaved=False: 0:00:09.139016

Copy link
Member

@theroggy theroggy Aug 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a reason why interleaved isn't the default... maybe it is also useful to mention that it can improve performance?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am seeing the same behaviour. It's indeed a bit strange that with the interleaved reading turned on, reading with arrow is slower than without. Might be worth reporting to GDAL

@brendan-ward brendan-ward changed the title ENH: Calculate feature count ourselves if GDAL returns unknown count (-1) ENH: Calculate feature count ourselves if GDAL returns unknown count (-1), better support OSM data Aug 19, 2023
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice investigations!

Comment on lines +198 to +199
# if the driver is OSM, we need to execute SQL to set the layer to read in
# order to read it properly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something to report to GDAL? Or is it expected we need to set this differently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ogr2ogr does this (which is how I figured out how to do this), so I think it is fair to say that GDAL knows about this and that it is intentional it is left for the user to call instead of being embedded in the driver. It is something we could suggest be added to the documentation for the driver, however.

@brendan-ward brendan-ward merged commit 94c2bb4 into main Aug 22, 2023
17 checks passed
@brendan-ward brendan-ward deleted the force_feature_count branch August 22, 2023 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read osm.pbf
3 participants