Make calls to smart_open.open() for GCS 1000x faster by avoiding unnecessary GCS API call #788

JohnHBrock · 2023-10-28T01:59:36Z

Motivation

Bucket.get_blob requires a roundtrip to GCS, but Bucket.blob doesn't, so let's use that instead. In my benchmarking, this change reduces the time taken to call smart_open.open() from milliseconds to microseconds. This is especially nice when you're repeatedly calling smart_open.open for a large list of files.

Tests

To run the new benchmark test:

pytest integration-tests/test_gcs.py::test_gcs_performance_open

pytest-benchmark comparison for old code vs. new code:

`Bucket.get_blob` requires a roundtrip to GCS, but `Bucket.blob` doesn't, so let's use that instead. In my benchmarking, this change reduces the time taken to call `smart_open.open` by >1000x.

cadnce · 2024-01-09T03:46:12Z

Reasonable suggestion, the performance increase in the benchmark conditions is indeed significant! However I would question if the accuracy of the benchmark compared to real life usage (as noted by your comment in the tests "we don't need to use a uri that actually exists in order to call GCS's open()") - to make it more accurate you should read/write a byte or two so that the same work happens in both scenarios.

Ignoring the benchmark for a second, when making this change there was 2 relevant considerations:

The use of the Bucket.get_blob interface allows this to adhere to the existing smart_open interface api inc exceptions.
As I understand it, the gcs blob.open() interface is unsafe if the underlying blob in GCS is updated/replaced https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.blob.Blob#google_cloud_storage_blob_Blob_open.

This means to ensure there is no unexpected behaviour with this change you would need to get the generation id for the blob. As I understand it, this is typically done using the blob.reload() function which is the overhead you're noticing in the Bucket.get_blob() call - https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/bucket.py#L756-L800 vs https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/bucket.py#L1176-L1281

I guess my point is, based on the docs, I don't believe that the GCS api call is unnessecary.
But please correct me If you think my understanding is wrong!

mpenkov

LGTM, thank you!

…open into patch-2 * 'develop' of https://github.com/RaRe-Technologies/smart_open: Propagate __exit__ call to underlying filestream (piskvorky#786) Retry finalizing multipart s3 upload (piskvorky#785) Fix `KeyError: 'ContentRange'` when received full content from S3 (piskvorky#789) Add support for SSH connection via aliases from `~/.ssh/config` (piskvorky#790) Make calls to smart_open.open() for GCS 1000x faster by avoiding unnecessary GCS API call (piskvorky#788) Add zstandard compression feature (piskvorky#801) Support moto 4 & 5 (piskvorky#802) Secure the connection using SSL when connecting to the FTPS server (piskvorky#793) upgrade dev status classifier to stable (piskvorky#798) Fix formatting of python code (piskvorky#795)

SimonBiggs · 2024-03-11T00:35:50Z

In my testing this results in a file that doesn't exist to no longer return an error when opened.

ddelange · 2024-03-21T05:47:05Z

@SimonBiggs what do you get instead?

ddelange · 2024-03-21T05:51:32Z

I would tend to agree with @cadnce:

I guess my point is, based on the docs, I don't believe that the GCS api call is unnessecary.

mpenkov · 2024-03-21T06:45:41Z

In retrospect, merging this may have been premature. Here is the trade-off we're dealing with:

Pro: calling smart_open.open on a GCS URL is 1000x faster (*)
Con: breaking API change, as opening a non-existing file no longer raises an error (like the built-in open function does)

I think the benefit of the open function being 1000x faster is questionable, too. Presumably, the user is calling open because they want to do something with the returned file-like object, like reading or writing. These subsequent operations will require network access and thus consume time. So really, this PR isn't about isn't speeding things up by improving performance; it is about postponing the cost of the network transfer from open-time to write-time.

I think the con is greater than the pro here. Yes, in general, faster is better, but being correct is far more important (if you take the behavior of the built-in open as the definition of "correct").

What do you think @piskvorky? Should we keep this behavior or roll it back?

piskvorky · 2024-03-21T07:58:54Z

Consistency with open is smart_open's "selling point". Our documentation explicitly promises:

smart_open is a drop-in replacement for Python's built-in open(): it can do anything open can (100% compatible, falls back to native open wherever possible), plus lots of nifty extra stuff on top.

So to me the answer is clear :)

…ing unnecessary GCS API call (#788)" This reverts commit bcc2335.

mpenkov · 2024-03-21T08:04:20Z

OK, reverted.

mpenkov · 2024-03-21T09:02:29Z

Thank you @SimonBiggs for pointing out the issue. I've made a bugfix release 7.0.3 that reverts to the original behavior.

cadnce · 2024-03-21T12:38:26Z

Glad y’all agree. We’ve been holding off upgrading until this was reverted :)

JohnHBrock added 2 commits October 27, 2023 18:51

Improve performance of smart_open.open() for GCS by 1000x

18afbed

`Bucket.get_blob` requires a roundtrip to GCS, but `Bucket.blob` doesn't, so let's use that instead. In my benchmarking, this change reduces the time taken to call `smart_open.open` by >1000x.

add missing pytest-benchmark dependency

bccc306

mpenkov approved these changes Feb 20, 2024

View reviewed changes

mpenkov merged commit bcc2335 into piskvorky:develop Feb 20, 2024
21 checks passed

edgarrmondragon mentioned this pull request Feb 26, 2024

chore(deps): bump smart-open from 6.4.0 to 7.0.1 meltano/meltano#8423

Merged

mpenkov added a commit that referenced this pull request Mar 21, 2024

Revert "Make calls to smart_open.open() for GCS 1000x faster by avoid…

1b271bf

…ing unnecessary GCS API call (#788)" This reverts commit bcc2335.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make calls to smart_open.open() for GCS 1000x faster by avoiding unnecessary GCS API call #788

Make calls to smart_open.open() for GCS 1000x faster by avoiding unnecessary GCS API call #788

JohnHBrock commented Oct 28, 2023

cadnce commented Jan 9, 2024 •

edited

Loading

mpenkov left a comment

SimonBiggs commented Mar 11, 2024 •

edited

Loading

ddelange commented Mar 21, 2024

ddelange commented Mar 21, 2024

mpenkov commented Mar 21, 2024 •

edited

Loading

piskvorky commented Mar 21, 2024 •

edited

Loading

mpenkov commented Mar 21, 2024

mpenkov commented Mar 21, 2024

cadnce commented Mar 21, 2024

Make calls to smart_open.open() for GCS 1000x faster by avoiding unnecessary GCS API call #788

Make calls to smart_open.open() for GCS 1000x faster by avoiding unnecessary GCS API call #788

Conversation

JohnHBrock commented Oct 28, 2023

Motivation

Tests

cadnce commented Jan 9, 2024 • edited Loading

mpenkov left a comment

Choose a reason for hiding this comment

SimonBiggs commented Mar 11, 2024 • edited Loading

ddelange commented Mar 21, 2024

ddelange commented Mar 21, 2024

mpenkov commented Mar 21, 2024 • edited Loading

piskvorky commented Mar 21, 2024 • edited Loading

mpenkov commented Mar 21, 2024

mpenkov commented Mar 21, 2024

cadnce commented Mar 21, 2024

cadnce commented Jan 9, 2024 •

edited

Loading

SimonBiggs commented Mar 11, 2024 •

edited

Loading

mpenkov commented Mar 21, 2024 •

edited

Loading

piskvorky commented Mar 21, 2024 •

edited

Loading