-
-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make calls to smart_open.open() for GCS 1000x faster by avoiding unnecessary GCS API call #788
Make calls to smart_open.open() for GCS 1000x faster by avoiding unnecessary GCS API call #788
Conversation
`Bucket.get_blob` requires a roundtrip to GCS, but `Bucket.blob` doesn't, so let's use that instead. In my benchmarking, this change reduces the time taken to call `smart_open.open` by >1000x.
Reasonable suggestion, the performance increase in the benchmark conditions is indeed significant! However I would question if the accuracy of the benchmark compared to real life usage (as noted by your comment in the tests "we don't need to use a uri that actually exists in order to call GCS's open()") - to make it more accurate you should read/write a byte or two so that the same work happens in both scenarios. Ignoring the benchmark for a second, when making this change there was 2 relevant considerations:
This means to ensure there is no unexpected behaviour with this change you would need to get the generation id for the blob. As I understand it, this is typically done using the I guess my point is, based on the docs, I don't believe that the GCS api call is unnessecary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you!
…open into patch-2 * 'develop' of https://github.com/RaRe-Technologies/smart_open: Propagate __exit__ call to underlying filestream (piskvorky#786) Retry finalizing multipart s3 upload (piskvorky#785) Fix `KeyError: 'ContentRange'` when received full content from S3 (piskvorky#789) Add support for SSH connection via aliases from `~/.ssh/config` (piskvorky#790) Make calls to smart_open.open() for GCS 1000x faster by avoiding unnecessary GCS API call (piskvorky#788) Add zstandard compression feature (piskvorky#801) Support moto 4 & 5 (piskvorky#802) Secure the connection using SSL when connecting to the FTPS server (piskvorky#793) upgrade dev status classifier to stable (piskvorky#798) Fix formatting of python code (piskvorky#795)
In my testing this results in a file that doesn't exist to no longer return an error when opened. |
@SimonBiggs what do you get instead? |
I would tend to agree with @cadnce:
|
In retrospect, merging this may have been premature. Here is the trade-off we're dealing with: Pro: calling smart_open.open on a GCS URL is 1000x faster (*) I think the benefit of the open function being 1000x faster is questionable, too. Presumably, the user is calling open because they want to do something with the returned file-like object, like reading or writing. These subsequent operations will require network access and thus consume time. So really, this PR isn't about isn't speeding things up by improving performance; it is about postponing the cost of the network transfer from open-time to write-time. I think the con is greater than the pro here. Yes, in general, faster is better, but being correct is far more important (if you take the behavior of the built-in open as the definition of "correct"). What do you think @piskvorky? Should we keep this behavior or roll it back? |
Consistency with
So to me the answer is clear :) |
OK, reverted. |
Thank you @SimonBiggs for pointing out the issue. I've made a bugfix release 7.0.3 that reverts to the original behavior. |
Glad y’all agree. We’ve been holding off upgrading until this was reverted :) |
Motivation
Bucket.get_blob
requires a roundtrip to GCS, butBucket.blob
doesn't, so let's use that instead. In my benchmarking, this change reduces the time taken to callsmart_open.open()
from milliseconds to microseconds. This is especially nice when you're repeatedly callingsmart_open.open
for a large list of files.Tests
To run the new benchmark test:
pytest-benchmark
comparison for old code vs. new code: