Blob.download_to_file() downloads to RAM instead #3703

dw · 2017-07-31T11:59:38Z

Hello,

Blob.download_to_file() with no chunk size set causes the file to be pulled entirely into RAM before starting to write it to the supplied file object. This is due to google/resumable_media/download.py's "sans I/O" approach.

I can work around it by setting the chunk size parameter, however this changes both what I am billed for (multiple GETs rather than a single GET) and application's latency profile to paper over an obvious deficiency in the client library, so if you don't consider this a design flaw then at least consider it a performance bug.

Either way, I've lost 24 hours of data due to an OOM triggered by this, the docstring should at a minimum be updated to warn the user that the library cares more about some trend from last week than doing what you'd expect the simplest possible HTTP client talking to Gcloud could have done in the early 90s.

Apologies for the angst, but I'm leaving it in here, because you really need to know how disappointing it is to have a crash of this sort caused by a library function so conceptually simple.

pip freeze:

gapic-google-cloud-pubsub-v1==0.15.4
gapic-google-cloud-speech-v1==0.15.3
google-api-python-client==1.6.2
google-auth==1.0.1
google-auth-httplib2==0.0.2
google-cloud-core==0.25.0
google-cloud-language==0.25.0
google-cloud-pubsub==0.26.0
google-cloud-speech==0.25.1
google-cloud-storage==1.1.1
google-cloud-translate==0.24.0
google-gax==0.15.13
google-resumable-media==0.0.2
googleapis-common-protos==1.5.2
grpc-google-iam-v1==0.11.1
proto-google-cloud-pubsub-v1==0.15.4
proto-google-cloud-speech-v1==0.15.3

The text was updated successfully, but these errors were encountered:

dhermes · 2017-07-31T15:31:19Z

the library cares more about some trend from last week than doing what you'd expect the simplest possible HTTP client

I don't think that's called for, though I do appreciate you talking the time to file the issue. I designed google-resumable-media-python, so you can put the blame 100% on my shoulders.

The sans-I/O approach has nothing to do with the "missing feature" of piping the HTTP payload directly into a file. What's more, the behavior of the old code (i.e. google.cloud.streaming) is mostly the same, and that was written 3+ years ago.

The real issue is that the google-cloud-storage library doesn't do some kind of "smart chunking". It's definitely worth discussing. The reason there is no smart chunking is because we may not "a priori" know the size of the file, and it would be an extra round trip to determine the size before the first request. (Right now the google-cloud-storage package does chunked or full download, depending only on the chunk size.)

We have discussed strategies to reduce the amount of a file that ends up in RAM, but haven't had a chance to implement it yet.

dw · 2017-07-31T16:05:45Z

It seems the sans-IO approach requires a logical abstraction for the entire response, which necessitates it living temporarily on disk or in RAM

I'm confused about this concept of smart chunking -- both producer and consumer of the stream already apply chunking (libc / Python io lib's buffering on the consumer side), and kernel's TCP windowing on the producer side, it doesn't seem like you should be implementing any kind of policy in your code, considering both kernel and IO library are trivial to adjust as needs require.

To repeat, your library should not be buffering whatsoever, and the current implementation should never have been checked in.

Apologies for the grumpy bug report earlier, but this has cost me dearly and most of that code should not have existed in the first instance.

tseaver added api: storage Issues related to the Cloud Storage API. performance labels Jul 31, 2017

dhermes mentioned this issue Aug 1, 2017

Streaming directly into file on storage downloads. #3713

Merged

dhermes closed this as completed in #3713 Aug 1, 2017

dhermes mentioned this issue Aug 8, 2017

Set a 10 MB chunk size default. #3759

Closed

JustinBeckwith assigned dhermes Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blob.download_to_file() downloads to RAM instead #3703

Blob.download_to_file() downloads to RAM instead #3703

dw commented Jul 31, 2017

dhermes commented Jul 31, 2017

dw commented Jul 31, 2017

Blob.download_to_file() downloads to RAM instead #3703

Blob.download_to_file() downloads to RAM instead #3703

Comments

dw commented Jul 31, 2017

dhermes commented Jul 31, 2017

dw commented Jul 31, 2017