Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support writing geopackage files without index #62

Closed
theroggy opened this issue Mar 26, 2022 · 4 comments · Fixed by #67
Closed

ENH: support writing geopackage files without index #62

theroggy opened this issue Mar 26, 2022 · 4 comments · Fixed by #67

Comments

@theroggy
Copy link
Member

Writing a geopackage without index is 20% faster based on a test I did using gdal. So if you don't need an index in the output file, this is an interesting option.

Also, based on my experience, if you need to eg. merge several geopackages by appending them one by one to a new file, it is also faster to first add them all one by one without index, and then add the index in one go. Because pyogrio doesn't support appending yet this isn't a usable case yet, but... might become one in the future.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 2, 2022

It's already supported in general to pass through GDAL options, so according to https://gdal.org/drivers/vector/gpkg.html we can do:

df = pyogrio.read_dataframe("/tmp/geobenchmark/agriprc_2018.gpkg")

In [16]: %time pyogrio.write_dataframe(df, "test_index1.gpkg", SPATIAL_INDEX="YES")
CPU times: user 15.6 s, sys: 3.78 s, total: 19.3 s
Wall time: 21.9 s

In [17]: %time pyogrio.write_dataframe(df, "test_index2.gpkg", SPATIAL_INDEX="NO") # or `spatial_index=False`
CPU times: user 13.6 s, sys: 2.53 s, total: 16.1 s
Wall time: 16.5 s

and that indeed seems to give a significant difference in this case.

So it might be mostly a matter of better document this general ability, and give a few specific useful examples like this one.

@jorisvandenbossche
Copy link
Member

What I find unexpected that it seems that those files without index are also quite a bit faster to read:

In [2]: %time pyogrio.read_dataframe("test_index1.gpkg")
CPU times: user 5.15 s, sys: 1.05 s, total: 6.2 s
Wall time: 12.3 s

In [3]: %time pyogrio.read_dataframe("test_index2.gpkg")
CPU times: user 3.56 s, sys: 419 ms, total: 3.98 s
Wall time: 4 s

@theroggy
Copy link
Member Author

theroggy commented Apr 3, 2022

Ah, great to be able to just pass on GDAL options!

I wasn't able to reproduce it immediately though: the index was still created. So I had a quick look at the code and it seems like the kwargs parameter is missing in the OGR write() call. I created a pull request with a proposal to fix (#67). After the fix, the SPATIAL_INDEX="NO" indeed does the trick!

On my windows system, writing the agriprc_2018.gpkg file takes 15 s without index, 40 s with index. Only double the time you get on linux :-) (~20s with index).

@jorisvandenbossche
Copy link
Member

Ah, good catch, so my timings were just nonsense :) (I think the difference in writing was because of the one including the time to "overwrite" and thus delete an existing file. The reading time just seems quite unstable)

Doing again with your fix, there is now even a bigger difference (and now the resulting filesize is also actually different):

In [2]: df = pyogrio.read_dataframe("/tmp/geobenchmark/agriprc_2018.gpkg")

In [3]: %time pyogrio.write_dataframe(df, "test_index_with.gpkg", SPATIAL_INDEX="YES")
CPU times: user 15 s, sys: 3.43 s, total: 18.4 s
Wall time: 18.7 s

In [4]: %time pyogrio.write_dataframe(df, "test_index_without.gpkg", SPATIAL_INDEX="NO")
CPU times: user 6.85 s, sys: 419 ms, total: 7.27 s
Wall time: 7.4 s

And for reading there isn't a significant difference (as can be expected for reading the full file without spatial query)

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants