Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blob.download_to_file() downloads to RAM instead #3703

Closed
dw opened this issue Jul 31, 2017 · 2 comments
Closed

Blob.download_to_file() downloads to RAM instead #3703

dw opened this issue Jul 31, 2017 · 2 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API. performance

Comments

@dw
Copy link

dw commented Jul 31, 2017

Hello,

Blob.download_to_file() with no chunk size set causes the file to be pulled entirely into RAM before starting to write it to the supplied file object. This is due to google/resumable_media/download.py's "sans I/O" approach.

I can work around it by setting the chunk size parameter, however this changes both what I am billed for (multiple GETs rather than a single GET) and application's latency profile to paper over an obvious deficiency in the client library, so if you don't consider this a design flaw then at least consider it a performance bug.

Either way, I've lost 24 hours of data due to an OOM triggered by this, the docstring should at a minimum be updated to warn the user that the library cares more about some trend from last week than doing what you'd expect the simplest possible HTTP client talking to Gcloud could have done in the early 90s.

Apologies for the angst, but I'm leaving it in here, because you really need to know how disappointing it is to have a crash of this sort caused by a library function so conceptually simple.

pip freeze:

gapic-google-cloud-pubsub-v1==0.15.4
gapic-google-cloud-speech-v1==0.15.3
google-api-python-client==1.6.2
google-auth==1.0.1
google-auth-httplib2==0.0.2
google-cloud-core==0.25.0
google-cloud-language==0.25.0
google-cloud-pubsub==0.26.0
google-cloud-speech==0.25.1
google-cloud-storage==1.1.1
google-cloud-translate==0.24.0
google-gax==0.15.13
google-resumable-media==0.0.2
googleapis-common-protos==1.5.2
grpc-google-iam-v1==0.11.1
proto-google-cloud-pubsub-v1==0.15.4
proto-google-cloud-speech-v1==0.15.3
@tseaver tseaver added api: storage Issues related to the Cloud Storage API. performance labels Jul 31, 2017
@dhermes
Copy link
Contributor

dhermes commented Jul 31, 2017

the library cares more about some trend from last week than doing what you'd expect the simplest possible HTTP client

I don't think that's called for, though I do appreciate you talking the time to file the issue. I designed google-resumable-media-python, so you can put the blame 100% on my shoulders.

The sans-I/O approach has nothing to do with the "missing feature" of piping the HTTP payload directly into a file. What's more, the behavior of the old code (i.e. google.cloud.streaming) is mostly the same, and that was written 3+ years ago.

The real issue is that the google-cloud-storage library doesn't do some kind of "smart chunking". It's definitely worth discussing. The reason there is no smart chunking is because we may not "a priori" know the size of the file, and it would be an extra round trip to determine the size before the first request. (Right now the google-cloud-storage package does chunked or full download, depending only on the chunk size.)

We have discussed strategies to reduce the amount of a file that ends up in RAM, but haven't had a chance to implement it yet.

@dw
Copy link
Author

dw commented Jul 31, 2017

It seems the sans-IO approach requires a logical abstraction for the entire response, which necessitates it living temporarily on disk or in RAM

I'm confused about this concept of smart chunking -- both producer and consumer of the stream already apply chunking (libc / Python io lib's buffering on the consumer side), and kernel's TCP windowing on the producer side, it doesn't seem like you should be implementing any kind of policy in your code, considering both kernel and IO library are trivial to adjust as needs require.

To repeat, your library should not be buffering whatsoever, and the current implementation should never have been checked in.

Apologies for the grumpy bug report earlier, but this has cost me dearly and most of that code should not have existed in the first instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. performance
Projects
None yet
Development

No branches or pull requests

3 participants