Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make calls to smart_open.open() for GCS 1000x faster by avoiding unnecessary GCS API call #788

Merged
merged 2 commits into from
Feb 20, 2024

Conversation

JohnHBrock
Copy link
Contributor

Motivation

Bucket.get_blob requires a roundtrip to GCS, but Bucket.blob doesn't, so let's use that instead. In my benchmarking, this change reduces the time taken to call smart_open.open() from milliseconds to microseconds. This is especially nice when you're repeatedly calling smart_open.open for a large list of files.

Tests

To run the new benchmark test:

pytest integration-tests/test_gcs.py::test_gcs_performance_open

pytest-benchmark comparison for old code vs. new code:
image

`Bucket.get_blob` requires a roundtrip to GCS, but `Bucket.blob` doesn't, so let's use that instead. In my benchmarking, this change reduces the time taken to call `smart_open.open` by >1000x.
@cadnce
Copy link
Contributor

cadnce commented Jan 9, 2024

Reasonable suggestion, the performance increase in the benchmark conditions is indeed significant! However I would question if the accuracy of the benchmark compared to real life usage (as noted by your comment in the tests "we don't need to use a uri that actually exists in order to call GCS's open()") - to make it more accurate you should read/write a byte or two so that the same work happens in both scenarios.

Ignoring the benchmark for a second, when making this change there was 2 relevant considerations:

This means to ensure there is no unexpected behaviour with this change you would need to get the generation id for the blob. As I understand it, this is typically done using the blob.reload() function which is the overhead you're noticing in the Bucket.get_blob() call - https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/bucket.py#L756-L800 vs https://github.com/googleapis/python-storage/blob/main/google/cloud/storage/bucket.py#L1176-L1281

I guess my point is, based on the docs, I don't believe that the GCS api call is unnessecary.
But please correct me If you think my understanding is wrong!

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@mpenkov mpenkov merged commit bcc2335 into piskvorky:develop Feb 20, 2024
21 checks passed
ddelange added a commit to ddelange/smart_open that referenced this pull request Feb 21, 2024
…open into patch-2

* 'develop' of https://github.com/RaRe-Technologies/smart_open:
  Propagate __exit__ call to underlying filestream (piskvorky#786)
  Retry finalizing multipart s3 upload (piskvorky#785)
  Fix `KeyError: 'ContentRange'` when received full content from S3 (piskvorky#789)
  Add support for SSH connection via aliases from `~/.ssh/config` (piskvorky#790)
  Make calls to smart_open.open() for GCS 1000x faster by avoiding unnecessary GCS API call (piskvorky#788)
  Add zstandard compression feature (piskvorky#801)
  Support moto 4 & 5 (piskvorky#802)
  Secure the connection using SSL when connecting to the FTPS server (piskvorky#793)
  upgrade dev status classifier to stable (piskvorky#798)
  Fix formatting of python code (piskvorky#795)
@SimonBiggs
Copy link

SimonBiggs commented Mar 11, 2024

In my testing this results in a file that doesn't exist to no longer return an error when opened.

@ddelange
Copy link
Contributor

@SimonBiggs what do you get instead?

@ddelange
Copy link
Contributor

I would tend to agree with @cadnce:

I guess my point is, based on the docs, I don't believe that the GCS api call is unnessecary.

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 21, 2024

In retrospect, merging this may have been premature. Here is the trade-off we're dealing with:

Pro: calling smart_open.open on a GCS URL is 1000x faster (*)
Con: breaking API change, as opening a non-existing file no longer raises an error (like the built-in open function does)

I think the benefit of the open function being 1000x faster is questionable, too. Presumably, the user is calling open because they want to do something with the returned file-like object, like reading or writing. These subsequent operations will require network access and thus consume time. So really, this PR isn't about isn't speeding things up by improving performance; it is about postponing the cost of the network transfer from open-time to write-time.

I think the con is greater than the pro here. Yes, in general, faster is better, but being correct is far more important (if you take the behavior of the built-in open as the definition of "correct").

What do you think @piskvorky? Should we keep this behavior or roll it back?

@piskvorky
Copy link
Owner

piskvorky commented Mar 21, 2024

Consistency with open is smart_open's "selling point". Our documentation explicitly promises:

smart_open is a drop-in replacement for Python's built-in open(): it can do anything open can (100% compatible, falls back to native open wherever possible), plus lots of nifty extra stuff on top.

So to me the answer is clear :)

mpenkov added a commit that referenced this pull request Mar 21, 2024
…ing unnecessary GCS API call (#788)"

This reverts commit bcc2335.
@mpenkov
Copy link
Collaborator

mpenkov commented Mar 21, 2024

OK, reverted.

@mpenkov
Copy link
Collaborator

mpenkov commented Mar 21, 2024

Thank you @SimonBiggs for pointing out the issue. I've made a bugfix release 7.0.3 that reverts to the original behavior.

@cadnce
Copy link
Contributor

cadnce commented Mar 21, 2024

Glad y’all agree. We’ve been holding off upgrading until this was reverted :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants