Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micro optimizations to improve indexing #9002

Merged
merged 20 commits into from
Jun 11, 2024
Merged

Conversation

hmaarrfk
Copy link
Contributor

@hmaarrfk hmaarrfk commented May 5, 2024

These are some micro optimizaitons to improve indexing within xarray objects.

Generally speaking, the use of isinstance in python is pretty slow. Its pretty hard to avoid with the amount of "niceness" that xarray does.

However, I feel like these changes are "ok" and kick ups the performance of
#9001

from

h5netcdf  lazy      : 100%|███████████| 100000/100000 [00:11<00:00, 8951.20it/s]

to

h5netcdf  lazy      : 100%|███████████| 100000/100000 [00:09<00:00, 10079.44it/s]

With my "controverisal" fix for _apply_indexes_fast:

h5netcdf  lazy      : 100%|█████████████████| 100000/100000 [00:06<00:00, 14410.26it/s]

without it:

h5netcdf  lazy      : 100%|█████████████████| 100000/100000 [00:07<00:00, 12895.45it/s]

a nice 40% boost!

Benchmarked on

model name      : AMD Ryzen Threadripper 2950X 16-Core Processor

its not the best processor, but it doesn't throttle!

I'll mark this as not draft when it is ready for review. done!

Concepts used:

  • Fastpath added for BasicIndexer construction
  • Tuple extension used instead of list + tuple cast. I believe that CPython improved tuple performance somewhere around 3.9/3.10 era so this is now "faster" than the old 3.6 tricks.
  • LazilyIndexedArray now store the shape at creation time. Otherwise this was causing the shape to be recomputed many times during indexing operations that use operators like len or shape to broadcast the indexing tuple.
  • isel now avoids the _id_coord_names by avoidining xindexes.group_by_index() giving a 20% speedup.

Findings

  • _id_coord_names has terrible performance... I had to avoid it to get isel to speed up. It is likely that other code paths can be sped up the same way too. I just don't know how to use those code paths so i'm hesitant to go down that rabbit hole myself.

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

xref: #2799
xref: #7045

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

xarray/core/indexing.py Outdated Show resolved Hide resolved
xarray/core/indexing.py Outdated Show resolved Hide resolved
xarray/core/indexing.py Outdated Show resolved Hide resolved
@hmaarrfk hmaarrfk force-pushed the fastestpath branch 4 times, most recently from d975fba to c2a065d Compare May 5, 2024 18:39
return self.__id_coord_names

def _create_id_coord_names(self) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its strange that the function _create_id_coord_names gets hit all the time by the profiler. I would have thought that once the dataset is created this function would only be called once.

@hmaarrfk hmaarrfk force-pushed the fastestpath branch 2 times, most recently from 8c49aba to 2b5e936 Compare May 5, 2024 19:27
@kmuehlbauer kmuehlbauer marked this pull request as ready for review May 5, 2024 19:49
@kmuehlbauer
Copy link
Contributor

@hmaarrfk Sorry Marc, moving to ready for review was by accident. Still have problems to properly use GitHub Android app.

@hmaarrfk hmaarrfk marked this pull request as draft May 5, 2024 19:52
@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented May 5, 2024

GitHub Android app.

the github app is terrible..... i keep making similar mistakes on my phone.

@hmaarrfk hmaarrfk force-pushed the fastestpath branch 8 times, most recently from 5d4e9b2 to f958953 Compare May 5, 2024 22:48
@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented May 5, 2024

Ok enough for today.

But generally speaking, I think the the main problem, large compatibility with Pandas is a known issue.

However, I think that some things are simply strange in the sense that certain isel operations just seem too slow for comfort. It may be that we can add enough fastpath variables to remove alot of the expensive validation

image

You can see some of the numbers that I use.

I use a 100_000 selection because it amplifies the problem and makes the profiler find the problematic hotspots.

The function _apply_indexes_fast might be controversial. I am happy to remove it and just document it hmaarrfk#29 for when i have time again.

It drops the performance on my Threadripper 2950x from 14kits/sec to 12kits/sec. still an improvement from the 9kits/sec where this pull request #9001 leaves off.

@hmaarrfk hmaarrfk force-pushed the fastestpath branch 4 times, most recently from 2c202ff to 1a659bc Compare May 5, 2024 23:01
@hmaarrfk hmaarrfk marked this pull request as ready for review May 5, 2024 23:01
@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented May 5, 2024

feel free to push any cleanup.s

xarray/core/indexes.py Show resolved Hide resolved
xarray/core/indexes.py Show resolved Hide resolved
hmaarrfk added a commit to hmaarrfk/xarray that referenced this pull request May 6, 2024
This targets optimization for datasets with many "scalar" variables
(that is variables without any dimensions). This can happen in the
context where you have many pieces of small metadata that relate to
various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to
incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we
often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to
doing anything with our dataset, but that isn't quite possible with you
want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of
the benchmark, I think they are quite worthwhile:

* main (as of pydata#9001) - 2.5k its/s
* With pydata#9002 - 4.2k its/s
* With this Pull Request (on top of pydata#9002) -- 6.1k its/s

Thanks for considering.
@@ -2624,6 +2624,10 @@ def __init__(self, dims, data, attrs=None, encoding=None, fastpath=False):
if self.ndim != 1:
raise ValueError(f"{type(self).__name__} objects must be 1-dimensional")

# Avoid further checks if fastpath is True
if fastpath:
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is causing:

=============================================================================== FAILURES ===============================================================================
___________________________________________________________________ test_field_access[365_day-year] ____________________________________________________________________

data = <xarray.DataArray 'data' (lon: 10, lat: 10, time: 100)> Size: 80kB
array([[[0.5488135 , 0.71518937, 0.60276338, ..., 0....222 4.444 6.667 ... 13.33 15.56 17.78 20.0
  * time     (time) object 800B 2000-01-01 00:00:00 ... 2000-01-05 03:00:00
field = 'year'

    @requires_cftime
    @pytest.mark.parametrize(
        "field", ["year", "month", "day", "hour", "dayofyear", "dayofweek"]
    )
    def test_field_access(data, field) -> None:
        result = getattr(data.time.dt, field)
        expected = xr.DataArray(
            getattr(xr.coding.cftimeindex.CFTimeIndex(data.time.values), field),
            name=field,
            coords=data.time.coords,
            dims=data.time.dims,
        )

>       assert_equal(result, expected)
E       AssertionError: Left and right DataArray objects are not equal

/home/mark/git/xarray/xarray/tests/test_accessor_dt.py:442: AssertionError

xarray/core/indexes.py Outdated Show resolved Hide resolved
@hmaarrfk
Copy link
Contributor Author

With the removal of the fastpath

| Change   | Before [95c1ef72] <main>   | After [4a7411b9] <fastestpath>   |   Ratio | Benchmark (Parameter)                                         |
|----------|----------------------------|----------------------------------|---------|---------------------------------------------------------------|
| -        | 167±0.7μs                  | 149±0.9μs                        |    0.89 | indexing.Assignment.time_assignment_basic('1scalar')          |
| -        | 164±1μs                    | 146±0.4μs                        |    0.89 | indexing.Assignment.time_assignment_basic('2slicess-1scalar') |
| -        | 182±1μs                    | 162±0.9μs                        |    0.89 | indexing.Indexing.time_indexing_basic('1slice-1scalar')       |
| -        | 161±0.5μs                  | 142±0.4μs                        |    0.88 | indexing.Assignment.time_assignment_basic('1slice-1scalar')   |
| -        | 149±0.6μs                  | 129±1μs                          |    0.86 | indexing.Indexing.time_indexing_basic('1slice')               |
| -        | 250±1μs                    | 209±1μs                          |    0.84 | indexing.Indexing.time_indexing_basic('2slicess-1scalar')     |
| -        | 165±0.6μs                  | 133±1μs                          |    0.81 | indexing.Assignment.time_assignment_basic('1slice')           |
| -        | 86.9±0.1μs                 | 64.4±0.9μs                       |    0.74 | indexing.HugeAxisSmallSliceIndexing.time_indexing             |
| -        | 607±3μs                    | 326±2μs                          |    0.54 | indexing.AssignmentOptimized.time_assign_identical_indexes    |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

We get most of the speedups, We seem to lose out on '2d-1scalar'.

Honestly, I'll take it. I understand that fastpaths are not always good to add for maintainability sake.

Let me cleanup, and rerun the benchamrks again

@hmaarrfk
Copy link
Contributor Author

Running the benchmarks again:

| Change   | Before [95c1ef72] <main>   | After [d526ae43] <fastestpath>   |   Ratio | Benchmark (Parameter)                                         |
|----------|----------------------------|----------------------------------|---------|---------------------------------------------------------------|
| -        | 244±0.7μs                  | 222±0.8μs                        |    0.91 | indexing.Assignment.time_assignment_outer('2d-1scalar')       |
| -        | 183±1μs                    | 163±0.8μs                        |    0.89 | indexing.Indexing.time_indexing_basic('1slice-1scalar')       |
| -        | 171±0.7μs                  | 150±0.8μs                        |    0.88 | indexing.Assignment.time_assignment_basic('1scalar')          |
| -        | 168±0.4μs                  | 147±0.3μs                        |    0.88 | indexing.Assignment.time_assignment_basic('2slicess-1scalar') |
| -        | 166±0.8μs                  | 144±0.6μs                        |    0.87 | indexing.Assignment.time_assignment_basic('1slice-1scalar')   |
| -        | 150±0.5μs                  | 131±0.4μs                        |    0.87 | indexing.Indexing.time_indexing_basic('1slice')               |
| -        | 253±1μs                    | 220±1μs                          |    0.87 | indexing.Indexing.time_indexing_basic('2slicess-1scalar')     |
| -        | 168±0.6μs                  | 138±0.5μs                        |    0.82 | indexing.Assignment.time_assignment_basic('1slice')           |
| -        | 87.2±0.5μs                 | 66.7±0.3μs                       |    0.76 | indexing.HugeAxisSmallSliceIndexing.time_indexing             |
| -        | 617±5μs                    | 331±2μs                          |    0.54 | indexing.AssignmentOptimized.time_assign_identical_indexes    |

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

xarray/core/indexing.py Outdated Show resolved Hide resolved
xarray/core/indexes.py Outdated Show resolved Hide resolved
@hmaarrfk
Copy link
Contributor Author

Thanks for the fix, i guess that was a merge conflict?

@dcherian
Copy link
Contributor

Ah last request: can you add a whats-new entry

@hmaarrfk
Copy link
Contributor Author

done.

@dcherian dcherian added the plan to merge Final call for comments label Jun 11, 2024
@hmaarrfk hmaarrfk closed this Jun 11, 2024
@hmaarrfk hmaarrfk reopened this Jun 11, 2024
@dcherian dcherian merged commit d9e4de6 into pydata:main Jun 11, 2024
40 of 47 checks passed
hmaarrfk added a commit to hmaarrfk/xarray that referenced this pull request Jun 12, 2024
This targets optimization for datasets with many "scalar" variables
(that is variables without any dimensions). This can happen in the
context where you have many pieces of small metadata that relate to
various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to
incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we
often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to
doing anything with our dataset, but that isn't quite possible with you
want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of
the benchmark, I think they are quite worthwhile:

* main (as of pydata#9001) - 2.5k its/s
* With pydata#9002 - 4.2k its/s
* With this Pull Request (on top of pydata#9002) -- 6.1k its/s

Thanks for considering.
hmaarrfk added a commit to hmaarrfk/xarray that referenced this pull request Jun 12, 2024
This targets optimization for datasets with many "scalar" variables
(that is variables without any dimensions). This can happen in the
context where you have many pieces of small metadata that relate to
various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to
incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we
often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to
doing anything with our dataset, but that isn't quite possible with you
want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of
the benchmark, I think they are quite worthwhile:

* main (as of pydata#9001) - 2.5k its/s
* With pydata#9002 - 4.2k its/s
* With this Pull Request (on top of pydata#9002) -- 6.1k its/s

Thanks for considering.
dcherian added a commit to dcherian/xarray that referenced this pull request Jun 13, 2024
* upstream/main:
  [skip-ci] Try fixing hypothesis CI trigger (pydata#9112)
  Undo custom padding-top. (pydata#9107)
  add remaining core-dev citations [skip-ci][skip-rtd] (pydata#9110)
  Add user survey announcement to docs (pydata#9101)
  skip the `pandas` datetime roundtrip test with `pandas=3.0` (pydata#9104)
  Adds Matt Savoie to CITATION.cff (pydata#9103)
  [skip-ci] Fix skip-ci for hypothesis (pydata#9102)
  open_datatree performance improvement on NetCDF, H5, and Zarr files (pydata#9014)
  Migrate datatree io.py and common.py into xarray/core (pydata#9011)
  Micro optimizations to improve indexing (pydata#9002)
  (fix): don't handle time-dtypes as extension arrays in `from_dataframe` (pydata#9042)
andersy005 pushed a commit that referenced this pull request Jun 14, 2024
* conda instead of mamba

* Make speedups using fastpath

* Change core logic to apply_indexes_fast

* Always have fastpath=True in one path

* Remove basicindexer fastpath=True

* Duplicate a comment

* Add comments

* revert asv changes

* Avoid fastpath=True assignment

* Remove changes to basicindexer

* Do not do fast fastpath for IndexVariable

* Remove one unecessary change

* Remove one more fastpath

* Revert uneeded change to PandasIndexingAdapter

* Update xarray/core/indexes.py

* Update whats-new.rst

* Update whats-new.rst

* fix whats-new

---------

Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Deepak Cherian <deepak@cherian.net>
hmaarrfk added a commit to hmaarrfk/xarray that referenced this pull request Jun 19, 2024
This targets optimization for datasets with many "scalar" variables
(that is variables without any dimensions). This can happen in the
context where you have many pieces of small metadata that relate to
various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to
incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we
often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to
doing anything with our dataset, but that isn't quite possible with you
want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of
the benchmark, I think they are quite worthwhile:

* main (as of pydata#9001) - 2.5k its/s
* With pydata#9002 - 4.2k its/s
* With this Pull Request (on top of pydata#9002) -- 6.1k its/s

Thanks for considering.
hmaarrfk added a commit to hmaarrfk/xarray that referenced this pull request Jun 22, 2024
This targets optimization for datasets with many "scalar" variables
(that is variables without any dimensions). This can happen in the
context where you have many pieces of small metadata that relate to
various facts about an experimental condition.

For example, we have about 80 of these in our datasets (and I want to
incrase this number)

Our datasets are quite large (On the order of 1TB uncompresed) so we
often have one dimension that is in the 10's of thousands.

However, it has become quite slow to index in the dataset.

We therefore often "carefully slice out the matadata we need" prior to
doing anything with our dataset, but that isn't quite possible with you
want to orchestrate things with a parent application.

These optimizations are likely "minor" but considering the results of
the benchmark, I think they are quite worthwhile:

* main (as of pydata#9001) - 2.5k its/s
* With pydata#9002 - 4.2k its/s
* With this Pull Request (on top of pydata#9002) -- 6.1k its/s

Thanks for considering.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments run-benchmark Run the ASV benchmark workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants