Add param arrow_to_pandas_kwargs to read_dataframe + decrease memory usage #273

theroggy · 2023-08-17T15:12:10Z

Add option to pass parameters to to_pandas to influence e.g. the way data is returned.
Based on the arrow documentation, passing split_blocks=True and self_destruct=True should decrease peak memory usage of to_pandas in some cases. In read_dataframe it seems more logical to use those defaults:
https://arrow.apache.org/docs/python/pandas.html#reducing-memory-use-in-table-to-pandas

Related to #262 and resolves #241

jorisvandenbossche · 2023-08-17T15:33:18Z

There are a lot of options in the arrow -> pandas to_pandas conversion that users might want to tweak. So maybe a more general solution is to provide a way to pass through keywords for to_pandas.

For this split_blocks keyword explicitly, I would prefer to follow the default of pyarrow (maybe pyarrow should consider changing that default, though).

The self_destruct should indeed be possible to always call. Although I wonder how much difference that makes, since the table variable goes out of scope anyway when leaving the function, so I would expect that this should get cleaned-up that way.

theroggy · 2023-08-17T17:28:14Z

For this split_blocks keyword explicitly, I would prefer to follow the default of pyarrow (maybe pyarrow should consider changing that default, though).

The self_destruct should indeed be possible to always call. Although I wonder how much difference that makes, since the table variable goes out of scope anyway when leaving the function, so I would expect that this should get cleaned-up that way.

I did a quick test to see the impact of this change on the peak memory usage of read_dataframe... I restarted the python process between each test, to avoid any influence of the order I ran the tests.

Obviously it is just one specific case, so not sure if it is comparable for other files, but it is better to have one test than no test :-).

I used the following script/file, as it was the motive to make the change: the script crashed with memory errors on my laptop. I ran the test on a real computer.
The file read is an openstreetmap (.pbf) file of 540 MB with 5.732.130 rows and 26 columns.

import psutil
import pyogrio

url = "https://download.geofabrik.de/europe/germany/baden-wuerttemberg-latest.osm.pbf"
pgdf = pyogrio.read_dataframe(url, use_arrow=True, sql="SELECT * FROM multipolygons")
print(psutil.Process().memory_info())

Results:

# split_blocks=True, self_destruct=True: peak_wset=9.296.629.760 -> 9,3 GB
# split_blocks=False, self_destruct=True: peak_wset=10430320640 -> 10,4 GB
# split_blocks=False, self_destruct=False: peak_wset=12530679808 -> 12,5 GB

@jorisvandenbossche do you see any disadvantages to use split_blocks=True, as it does seem to make a measurable difference in peak memory usage.

theroggy · 2023-08-17T17:53:14Z

There are a lot of options in the arrow -> pandas to_pandas conversion that users might want to tweak. So maybe a more general solution is to provide a way to pass through keywords for to_pandas.

Something like a an extra parameter for read_dataframe called e.g. arrow_to_pandas_kwargs: Dict[str, Any] we can pass to the to_pandas call?

I gave it a try like that in the latest commit.

…ge-for-use_arrow=True

jorisvandenbossche

Sorry for the late reply, and thanks for the update!

do you see any disadvantages to use split_blocks=True, as it does seem to make a measurable difference in peak memory usage.

For typical usage, I don't expect much difference, but with many columns there can be a benefit of having consolidated columns in pandas (so for actually benchmarking the impact, you also need to consider potential follow-up operations on the pandas DataFrame ..). Now I personally think we should consider switching the default, but I would prefer to follow pyarrow on this for consistency, and see if we we want to change this on the pyarrow side.

pyogrio/geopandas.py

…ge-for-use_arrow=True

theroggy · 2023-09-23T20:33:12Z

Sorry for the late reply, and thanks for the update!

do you see any disadvantages to use split_blocks=True, as it does seem to make a measurable difference in peak memory usage.

For typical usage, I don't expect much difference, but with many columns there can be a benefit of having consolidated columns in pandas (so for actually benchmarking the impact, you also need to consider potential follow-up operations on the pandas DataFrame ..). Now I personally think we should consider switching the default, but I would prefer to follow pyarrow on this for consistency, and see if we we want to change this on the pyarrow side.

OK, I removed the overrule of the split_blocks param.

…ge-for-use_arrow=True

jorisvandenbossche

Thanks, looks good to me!

cc @brendan-ward are you ok with the arrow_to_pandas_kwargs? It's quite long, but explicit

pyogrio/geopandas.py

brendan-ward

Thanks @theroggy !

arrow_to_pandas_kwargs is fine by me.

…ge-for-use_arrow=True

theroggy added 3 commits August 17, 2023 17:09

Decrease memory usage for use_arrow=True

7b78ff2

Update CHANGES.md

fe691a7

Update CHANGES.md

6a9ee20

theroggy marked this pull request as draft August 17, 2023 15:13

theroggy marked this pull request as ready for review August 17, 2023 15:30

Add parameter to read_dataframe to pass kwargs to arrow.to_pandas

9106f2d

theroggy changed the title ~~Decrease memory usage in read_dataframe with use_arrow=true~~ Add param arrow_to_pandas_kwargs to read_dataframe to pass kwargs to arrow.to_pandas + decrease memory usage Aug 17, 2023

theroggy changed the title ~~Add param arrow_to_pandas_kwargs to read_dataframe to pass kwargs to arrow.to_pandas + decrease memory usage~~ Add param arrow_to_pandas_kwargs to read_dataframe + decrease memory usage Aug 17, 2023

theroggy added 2 commits August 25, 2023 16:07

Merge remote-tracking branch 'upstream/main' into Decrease-memory-usa…

52ca176

…ge-for-use_arrow=True

Merge remote-tracking branch 'upstream/main' into Decrease-memory-usa…

f6fb7cd

…ge-for-use_arrow=True

jorisvandenbossche reviewed Sep 23, 2023

View reviewed changes

pyogrio/geopandas.py Outdated Show resolved Hide resolved

pyogrio/geopandas.py Outdated Show resolved Hide resolved

theroggy added 3 commits September 23, 2023 22:21

Merge remote-tracking branch 'upstream/main' into Decrease-memory-usa…

c1c606e

…ge-for-use_arrow=True

Incorporate feedback

f8e86a9

Remove overrule of split_blocks, so the default of to_pandas is used

1c1b737

theroggy added 4 commits September 23, 2023 22:40

Update geopandas.py

e7e3f3f

Update geopandas.py

12643bd

Update geopandas.py

95a7233

Merge remote-tracking branch 'upstream/main' into Decrease-memory-usa…

19e06cf

…ge-for-use_arrow=True

jorisvandenbossche added this to the 0.7.0 milestone Sep 28, 2023

Merge remote-tracking branch 'upstream/main' into Decrease-memory-usa…

8bdd51b

…ge-for-use_arrow=True

jorisvandenbossche approved these changes Sep 29, 2023

View reviewed changes

pyogrio/geopandas.py Outdated Show resolved Hide resolved

brendan-ward approved these changes Sep 29, 2023

View reviewed changes

theroggy added 2 commits September 30, 2023 07:44

Merge remote-tracking branch 'upstream/main' into Decrease-memory-usa…

17ec8c3

…ge-for-use_arrow=True

Include feedback on docstring

5ab17df

theroggy merged commit dffb502 into geopandas:main Sep 30, 2023
15 checks passed

theroggy deleted the Decrease-memory-usage-for-use_arrow=True branch September 30, 2023 06:01

jorisvandenbossche mentioned this pull request Sep 30, 2023

Use Arrow pandas types by default when use_arrow=True? #303

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add param arrow_to_pandas_kwargs to read_dataframe + decrease memory usage #273

Add param arrow_to_pandas_kwargs to read_dataframe + decrease memory usage #273

theroggy commented Aug 17, 2023 •

edited

Loading

jorisvandenbossche commented Aug 17, 2023

theroggy commented Aug 17, 2023 •

edited

Loading

theroggy commented Aug 17, 2023 •

edited

Loading

jorisvandenbossche left a comment

theroggy commented Sep 23, 2023 •

edited

Loading

jorisvandenbossche left a comment

brendan-ward left a comment

Add param arrow_to_pandas_kwargs to read_dataframe + decrease memory usage #273

Add param arrow_to_pandas_kwargs to read_dataframe + decrease memory usage #273

Conversation

theroggy commented Aug 17, 2023 • edited Loading

jorisvandenbossche commented Aug 17, 2023

theroggy commented Aug 17, 2023 • edited Loading

theroggy commented Aug 17, 2023 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

theroggy commented Sep 23, 2023 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

brendan-ward left a comment

Choose a reason for hiding this comment

theroggy commented Aug 17, 2023 •

edited

Loading

theroggy commented Aug 17, 2023 •

edited

Loading

theroggy commented Aug 17, 2023 •

edited

Loading

theroggy commented Sep 23, 2023 •

edited

Loading