Multi-file and Parquet-aware prefetching from remote storage #16657

rjzamora · 2024-08-26T17:08:45Z

Description

Follow up to #16613
Supersedes #16166

Improves remote-IO read performance when multiple files are read at once. Also enables partial IO for remote Parquet files (previously removed in 24.10 by #16589).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

python/cudf/cudf/utils/ioutils.py

…lti-files

python/cudf/cudf/io/parquet.py

python/cudf/cudf/utils/ioutils.py

Adds new benchmark for parquet read performance using a `LocalCUDACluster`. The user can pass in `--key` and `--secret` options to specify S3 credentials. E.g. ``` $ python ./local_read_parquet.py --devs 0,1,2,3,4,5,6,7 --filesystem fsspec --type gpu --file-count 48 --aggregate-files Parquet read benchmark -------------------------------------------------------------------------------- Path | s3://dask-cudf-parquet-testing/dedup_parquet Columns | None Backend | cudf Filesystem | fsspec Blocksize | 244.14 MiB Aggregate files | True Row count | 372066 Size on disk | 1.03 GiB Number of workers | 8 ================================================================================ Wall clock | Throughput -------------------------------------------------------------------------------- 36.75 s | 28.78 MiB/s 21.29 s | 49.67 MiB/s 17.91 s | 59.05 MiB/s ================================================================================ Throughput | 41.77 MiB/s +/- 7.81 MiB/s Bandwidth | 0 B/s +/- 0 B/s Wall clock | 25.32 s +/- 8.20 s ================================================================================ ... ``` **Notes**: - S3 Performance generally scales with the number of workers (multiplied the number of threads per worker) - The example shown above was not executed from an EC2 instance - The example shown above *should* perform better after rapidsai/cudf#16657 - Using `--filesystem arrow` together with `--type gpu` performs well, but depends on rapidsai/cudf#16684 Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Mads R. B. Kristensen (https://github.com/madsbk) - Peter Andreas Entschev (https://github.com/pentschev) URL: #1371

wence-

A few small questions

python/cudf/cudf/io/parquet.py

python/cudf/cudf/utils/ioutils.py

…lti-files

vyasr

Some small suggestions but overall this LGTM now.

python/cudf/cudf/io/parquet.py

python/cudf/cudf/utils/ioutils.py

vyasr · 2024-09-03T22:58:36Z

python/cudf/cudf/utils/ioutils.py

+                "all": _get_remote_bytes_all,
+            }[method]
+        except KeyError:
+            raise NotImplementedError(


nit: Since this is an internal function I wouldn't bother with exception handling. The only callers should be internal, so if we provide an invalid method we can be responsible for tracking down the problem when the KeyError is observed. Alternatively, convert the method to an enum.

The user can technically pass in prefetch_options={"method": "foo"}, and it's probably best to return a clear error message. (Though, ValueError seems better than NotImplementedError in this case)

…lti-files

wence- · 2024-09-04T16:50:26Z

python/cudf/cudf/utils/ioutils.py

+        unique_count = dict(zip(*np.unique(paths, return_counts=True)))
+        offset = np.cumsum([0] + [unique_count[p] for p in remote_paths])
+        buffers = [
+            functools.reduce(operator.add, chunks[offset[i] : offset[i + 1]])


nit (non-blocking): I thought reduce(add, foo) is just sum(foo), what am I missing?

Yeah, this had me a bit confused as well. It turns out that operator.add will effectively join byte strings, but sum will require the intermediate values to be numeric values:

import operator assert operator.add(b"asdf", b"jkl;") == b'asdfjkl;' # Assertion passes assert sum([b"asdf", b"jkl;"]) == b'asdfjkl;' # Raises

TypeError: unsupported operand type(s) for +: 'int' and 'bytes'

wence-

To the best of my understanding, looks good

rjzamora · 2024-09-04T17:02:35Z

/merge

…i#16657) Follow up to rapidsai#16613 Supersedes rapidsai#16166 Improves remote-IO read performance when multiple files are read at once. Also enables partial IO for remote Parquet files (previously removed in `24.10` by rapidsai#16589). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Lawrence Mitchell (https://github.com/wence-) URL: rapidsai#16657

add multi-file and parquet-aware prefetching

cd8d73c

rjzamora added 2 - In Progress Currently a work in progress Performance Performance related issue non-breaking Non-breaking change labels Aug 26, 2024

rjzamora self-assigned this Aug 26, 2024

github-actions bot added the Python Affects Python cuDF API. label Aug 26, 2024

Merge branch 'branch-24.10' into prefetch-multi-files

b2f3319

rjzamora added the improvement Improvement / enhancement to an existing function label Aug 26, 2024

Merge branch 'branch-24.10' into prefetch-multi-files

1f98a8d

rjzamora marked this pull request as ready for review August 28, 2024 15:00

rjzamora requested a review from a team as a code owner August 28, 2024 15:00

rjzamora requested review from wence- and brandon-b-miller August 28, 2024 15:00

Merge branch 'branch-24.10' into prefetch-multi-files

4e2ff79

rjzamora commented Aug 28, 2024

View reviewed changes

python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved

rjzamora added 2 commits August 28, 2024 19:24

add test coverage

d12b8bc

Merge remote-tracking branch 'upstream/branch-24.10' into prefetch-mu…

a2273cc

…lti-files

rjzamora mentioned this pull request Aug 29, 2024

[Benchmark] Add parquet read benchmark rapidsai/dask-cuda#1371

Merged

vyasr requested changes Aug 29, 2024

View reviewed changes

rjzamora added 2 commits August 30, 2024 07:48

address code review

b9d060c

Merge branch 'branch-24.10' into prefetch-multi-files

578c772

wence- reviewed Sep 3, 2024

View reviewed changes

python/cudf/cudf/io/parquet.py Show resolved Hide resolved

python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved

python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved

rjzamora added 2 commits September 3, 2024 07:56

Merge remote-tracking branch 'upstream/branch-24.10' into prefetch-mu…

d468e39

…lti-files

code review and test coverage update

9428b9d

vyasr approved these changes Sep 3, 2024

View reviewed changes

rjzamora added 2 commits September 4, 2024 07:03

Merge remote-tracking branch 'upstream/branch-24.10' into prefetch-mu…

3f5b35b

…lti-files

Update comments to address code review

ae9a71e

wence- reviewed Sep 4, 2024

View reviewed changes

wence- approved these changes Sep 4, 2024

View reviewed changes

rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Sep 4, 2024

rapids-bot bot merged commit 1b6f02d into rapidsai:branch-24.10 Sep 4, 2024
92 checks passed

rjzamora deleted the prefetch-multi-files branch September 4, 2024 17:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-file and Parquet-aware prefetching from remote storage #16657

Multi-file and Parquet-aware prefetching from remote storage #16657

rjzamora commented Aug 26, 2024 •

edited

Loading

wence- left a comment

vyasr left a comment

vyasr Sep 3, 2024

rjzamora Sep 4, 2024

wence- Sep 4, 2024

rjzamora Sep 4, 2024

wence- left a comment

rjzamora commented Sep 4, 2024

Multi-file and Parquet-aware prefetching from remote storage #16657

Multi-file and Parquet-aware prefetching from remote storage #16657

Conversation

rjzamora commented Aug 26, 2024 • edited Loading

Description

Checklist

wence- left a comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

vyasr Sep 3, 2024

Choose a reason for hiding this comment

rjzamora Sep 4, 2024

Choose a reason for hiding this comment

wence- Sep 4, 2024

Choose a reason for hiding this comment

rjzamora Sep 4, 2024

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

rjzamora commented Sep 4, 2024

rjzamora commented Aug 26, 2024 •

edited

Loading