ENH: Calculate feature count ourselves if GDAL returns unknown count (-1), better support OSM data #271

brendan-ward · 2023-08-15T19:16:46Z

GDAL will return -1 for some drivers if it has specifically disabled the ability to get a feature count (e.g., OSM driver).

Since we need to know this number in order to allocate arrays before reading features, we need to get this count in order to be able to read from certain drivers.

pyogrio/_io.pyx

theroggy · 2023-08-16T10:44:45Z

Ideally there is also a test to cover this code path? But possibly there is no write support for a driver behaving this way?

pyogrio/_io.pyx

brendan-ward · 2023-08-17T20:30:37Z

The fix for #272 can be verified manually using a sufficiently large (probably > 100 MB) OSM file from Geofabrik, e.g., Greece; it will return a count of 0 without the fix added in 31d7df3

from pyogrio import read_info

assert read_info('greece-latest.osm.pbf', layer='lines')['features'] > 0

It fixes both reading with use_arrow=True and without.

theroggy · 2023-08-17T20:30:35Z

docs/source/known_issues.md

+We recommend the following to sidestep performance issues:
+


You could also mention use_arrow as an option. Based on my 1 test this is still not fast, but half as slow as without use_arrow=True. Or is it still too exprimental?

I'm seeing different results for Alabama (https://download.geofabrik.de/north-america/us/alabama-latest.osm.pbf) reading the lines layer:

local, with arrow: 4.49s
local, without arrow: 3.46s
remote, with arrow: 23.55s
remote, without arrow: 20.26s

times are greater for multipolygons layer but follow similar trend locally but not for remote (likely because of much more network I/O, but it is remote enough from me that there could be huge variability in network I/O):

local, with arrow: 6.82s
local, without arrow: 5.50s
remote, with arrow: 33.85s
remote, without arrow: 39.06s

whereas for points, the first and fastest layer:
local, with arrow 1.12s
local, without arrow: 0.68s
remote, with arrow: 40.36s
remote, without arrow: 20.09s

I had expected the arrow option to be somewhat faster, but that doesn't seem to be the case for me.

Hmm... my timings are indeed different. They are consistently faster for use_arrow, and most of the time twice as fast.

Then again, the timings are also spectacularly longer for me for local files, most likely the difference between Windows and Linux... because CI tests also run a lot slower on Windows vs Linux :-(... To conclude, arrow seems less dramatically slow on Windows ;-).

from datetime import datetime import itertools import pyogrio url_de_bw = "https://download.geofabrik.de/europe/germany/baden-wuerttemberg-latest.osm.pbf" path_de_bw = "C:/temp/baden-wuerttemberg-latest.osm.pbf" url_us_al = "https://download.geofabrik.de/north-america/us/alabama-latest.osm.pbf" path_us_al = "C:/temp/alabama-latest.osm.pbf" urls = [path_de_bw, path_us_al] layers = ["multipolygons", "points"] for url, layer, use_arrow in itertools.product(urls, layers, [True, False]): start = datetime.now() pgdf = pyogrio.read_dataframe( url, use_arrow=use_arrow, layer=layer, encoding="utf8" ) print(f"url={url}, layer={layer}, use_arrow={use_arrow}: {datetime.now()-start}")

Results:

url=C:/temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:01:46.250570 url=C:/temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:03:40.812138 url=C:/temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True: 0:00:18.412509 url=C:/temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False: 0:00:24.491693 url=C:/temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:00:16.943871 url=C:/temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:00:30.859146 url=C:/temp/alabama-latest.osm.pbf, layer=points, use_arrow=True: 0:00:07.560858 url=C:/temp/alabama-latest.osm.pbf, layer=points, use_arrow=False: 0:00:08.725899

PS: I had to specify encoding="utf8" to be able to run it with use_arrow=False, probably because it is German data and they use quite some accents and stuff. Based on the following post and others, it seems osm data is always in "UTF-8":
https://help.openstreetmap.org/questions/5308/character-encoding

So I think it might be better to default the encoding to "UTF-8" for .osm and .pbf?

Added a fix to always set UTF-8.

It looks like using the interleaved encoding option has an impact on performance:

With default (no interleaved reading):

url=baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:01:12.460372 url=baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:03:11.110123 url=baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True: 0:00:08.757978 url=baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False: 0:00:15.286818 url=alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:00:06.517222 url=alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:00:15.456438 url=alabama-latest.osm.pbf, layer=points, use_arrow=True: 0:00:01.148906 url=alabama-latest.osm.pbf, layer=points, use_arrow=False: 0:00:02.335114

with pyogrio.set_gdal_config_options({"OGR_INTERLEAVED_READING": True}):

url=baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:00:52.878035 url=baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:01:05.556191 url=baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True: 0:00:06.217804 url=baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False: 0:00:04.641645 url=alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True: 0:00:06.525850 url=alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False: 0:00:05.703674 url=alabama-latest.osm.pbf, layer=points, use_arrow=True: 0:00:00.852143 url=alabama-latest.osm.pbf, layer=points, use_arrow=False: 0:00:00.724238

Hmm... interleaved reading is consitently quite a bit faster... and often use_arrow=False becomes faster than use_arrow=True because of it. Kind of weird.

On windows, a similar pattern:

url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True, interleaved=True: 0:01:49.743053 url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=True, interleaved=False: 0:01:51.280849 url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False, interleaved=True: 0:01:28.155260 url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=multipolygons, use_arrow=False, interleaved=False: 0:03:56.258161 url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True, interleaved=True: 0:00:20.615849 url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=True, interleaved=False: 0:00:15.530900 url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False, interleaved=True: 0:00:10.094106 url=C:/Temp/baden-wuerttemberg-latest.osm.pbf, layer=points, use_arrow=False, interleaved=False: 0:00:28.546472 url=C:/Temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True, interleaved=True: 0:00:19.117897 url=C:/Temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=True, interleaved=False: 0:00:19.194775 url=C:/Temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False, interleaved=True: 0:00:17.296830 url=C:/Temp/alabama-latest.osm.pbf, layer=multipolygons, use_arrow=False, interleaved=False: 0:00:35.047402 url=C:/Temp/alabama-latest.osm.pbf, layer=points, use_arrow=True, interleaved=True: 0:00:07.781301 url=C:/Temp/alabama-latest.osm.pbf, layer=points, use_arrow=True, interleaved=False: 0:00:07.374248 url=C:/Temp/alabama-latest.osm.pbf, layer=points, use_arrow=False, interleaved=True: 0:00:06.801308 url=C:/Temp/alabama-latest.osm.pbf, layer=points, use_arrow=False, interleaved=False: 0:00:09.139016

I wonder if there is a reason why interleaved isn't the default... maybe it is also useful to mention that it can improve performance?

I am seeing the same behaviour. It's indeed a bit strange that with the interleaved reading turned on, reading with arrow is slower than without. Might be worth reporting to GDAL

jorisvandenbossche

Nice investigations!

jorisvandenbossche · 2023-08-19T21:18:47Z

pyogrio/_io.pyx

+    # if the driver is OSM, we need to execute SQL to set the layer to read in
+    # order to read it properly


Is this something to report to GDAL? Or is it expected we need to set this differently?

ogr2ogr does this (which is how I figured out how to do this), so I think it is fair to say that GDAL knows about this and that it is intentional it is left for the user to call instead of being embedded in the driver. It is something we could suggest be added to the documentation for the driver, however.

brendan-ward added 2 commits August 15, 2023 12:09

Calculate feature count ourselves if GDAL returns -1

ee02917

Update changelog

cfa3409

theroggy reviewed Aug 16, 2023

View reviewed changes

pyogrio/_io.pyx Outdated Show resolved Hide resolved

Add test for OSM PBF feature count

d8711b3

brendan-ward commented Aug 16, 2023

View reviewed changes

pyogrio/_io.pyx Outdated Show resolved Hide resolved

brendan-ward added 2 commits August 16, 2023 11:47

Remove obviated check for SQL error in feature iteration

f1ca42e

Handle GPKG where clause error in get_feature_count

eaf75e9

theroggy approved these changes Aug 16, 2023

View reviewed changes

brendan-ward mentioned this pull request Aug 17, 2023

Large OSM files require different approach to reading features #272

Open

brendan-ward added 2 commits August 17, 2023 11:25

set interest layers for OSM, release features read when counting

31d7df3

Update changelog / docs, fix tests

03e0e77

brendan-ward mentioned this pull request Aug 17, 2023

read osm.pbf #269

Closed

theroggy reviewed Aug 17, 2023

View reviewed changes

brendan-ward changed the title ~~ENH: Calculate feature count ourselves if GDAL returns unknown count (-1)~~ ENH: Calculate feature count ourselves if GDAL returns unknown count (-1), better support OSM data Aug 19, 2023

Always set encoding to UTF-8 for OSM data

d4630f3

theroggy approved these changes Aug 19, 2023

View reviewed changes

jorisvandenbossche approved these changes Aug 19, 2023

View reviewed changes

Merge branch 'main' into force_feature_count

4701cc2

brendan-ward merged commit 94c2bb4 into main Aug 22, 2023
17 checks passed

brendan-ward deleted the force_feature_count branch August 22, 2023 01:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Calculate feature count ourselves if GDAL returns unknown count (-1), better support OSM data #271

ENH: Calculate feature count ourselves if GDAL returns unknown count (-1), better support OSM data #271

brendan-ward commented Aug 15, 2023 •

edited

Loading

theroggy commented Aug 16, 2023

brendan-ward commented Aug 17, 2023

theroggy Aug 17, 2023

brendan-ward Aug 17, 2023

theroggy Aug 18, 2023 •

edited

Loading

theroggy Aug 18, 2023 •

edited

Loading

brendan-ward Aug 19, 2023

theroggy Aug 19, 2023

theroggy Aug 19, 2023

theroggy Aug 19, 2023 •

edited

Loading

jorisvandenbossche Aug 19, 2023

jorisvandenbossche left a comment

jorisvandenbossche Aug 19, 2023

brendan-ward Aug 21, 2023

		# if the driver is OSM, we need to execute SQL to set the layer to read in
		# order to read it properly

ENH: Calculate feature count ourselves if GDAL returns unknown count (-1), better support OSM data #271

ENH: Calculate feature count ourselves if GDAL returns unknown count (-1), better support OSM data #271

Conversation

brendan-ward commented Aug 15, 2023 • edited Loading

theroggy commented Aug 16, 2023

brendan-ward commented Aug 17, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theroggy Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

theroggy Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theroggy Aug 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brendan-ward commented Aug 15, 2023 •

edited

Loading

theroggy Aug 18, 2023 •

edited

Loading

theroggy Aug 18, 2023 •

edited

Loading

theroggy Aug 19, 2023 •

edited

Loading