Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 streaming with unknown content size #139

Closed
shorea opened this issue Sep 5, 2017 · 24 comments
Closed

S3 streaming with unknown content size #139

shorea opened this issue Sep 5, 2017 · 24 comments
Labels
2.0 New closed-for-staleness feature-request A feature should be added or improved. p3 This is a minor priority issue

Comments

@shorea
Copy link
Contributor

shorea commented Sep 5, 2017

Unclear what exactly this would mean, some investigation is required. But for uploads I imagine we'd do some buffering of the content, if it's greater than 5MB we would do a multipart upload with some retries of parts. For downloads this might be a special input stream returned to the caller that can reconnect on failure by doing a range GET to retrieve the rest of the bytes.

@rdifalco
Copy link

rdifalco commented Sep 6, 2017

I'd like to vote for automatic reconnect on download. So you'd have something that looks like a BufferedInputStream. In one mode each BufferSize chunk would be a GetRange and it goes right into the byte array. That way if the client is slow to process the stream they will not get any connection reset errors. With a good buffer size (maybe even a buffer array to load some buffers in parallel) it should be super fast while being very stable. If there is a Connection read error in the short time it takes to copy the range into the byte buffer then I would want it to use the client configured Retry configuration to retry.

@MikeFHay
Copy link

Support for retrying with non-resettable InputStreams would be nice. See aws/aws-sdk-java#427

@MikeFHay
Copy link

Additionally, it looks to me like TransferManager doesn't support doing multi-part uploads of data with an unknown size. This makes it pretty useless for big data.

@laymain
Copy link

laymain commented Sep 28, 2018

Any updates on that issue?

@varunnvs92
Copy link
Contributor

@laymain
We are focusing on getting the SDK 2.0 ready with low-level clients. Most of the high level libraries (TransferManager, DynamoDB mapper) won't be available until early 2019.

@laymain
Copy link

laymain commented Sep 28, 2018

I understand, thank you for the answer.

@justnance justnance added feature-request A feature should be added or improved. and removed Feature Request labels Apr 19, 2019
@millems
Copy link
Contributor

millems commented Jul 8, 2019

Tracking in #37

@millems millems closed this as completed Jul 8, 2019
@zoewangg zoewangg reopened this Jun 21, 2022
@zoewangg zoewangg changed the title Better support for InputStream in TransferManager S3 streaming with unknown content size Jun 21, 2022
@djchapm
Copy link

djchapm commented Jun 21, 2022

Adding a use case - code below is based on code currently available in TransferManager 2.17.210-PREVIEW... there is a 'fromPublisher' method that accepts Publisher <ByteBuffer> - so below is simple demo code and my env details.

Another use case, and more of a priority - uploading compressed content (or anything else) via java.nio.OutputStream. i.e. we will create a java GZipOutputStream - write to it using ByteBuffer writes - when we close the output stream, expect TransferManager to wrap things up regarding parts/content-length etc.

Hoping for code like:
... .requestBody(AsyncRequestBody.fromOutputStream(myOutputStream))

Here is demo of native error we get (shown in the commented code) - it's based on no content-length in the put object request.

This works when using AsyncRequestBody.fromXXX where it is something other than a Publisher (File or String). So goal would be to update the content length onClose(), and upload parts based on the S3ClientConfiguration settings.

/**
 * Java 17
 * s3-transfer-manager-2.17.210-PREVIEW.jar
 * sdk-core-2.17.210.jar
 * reactor-core-3.4.13.jar
 * Native CLIv2 on MacOS monterey: 
 *    aws-cli/2.7.7 Python/3.9.13 Darwin/21.5.0 source/x86_64
 */
class AWSTransferManagerTest {

    public static void main (String[] args) {
        S3ClientConfiguration s3TransferConfig = S3ClientConfiguration.builder()
                .targetThroughputInGbps(20.0)
                .minimumPartSizeInBytes(1024L)
                .build();
        S3TransferManager transferManager = S3TransferManager.builder()
                .s3ClientConfiguration(s3TransferConfig)
                .build();
        Flux<ByteBuffer> flux = Flux.just("one", "two", "three")
                .map(val -> ByteBuffer.wrap(val.getBytes()));
        //verify flux:
        //flux.subscribe(System.out::println);

        Log.initLoggingToFile(Log.LogLevel.Trace, "s3.logs");
        //output in s3.logs:
        // [INFO] [2022-06-16T15:19:49Z] [00007000056fd000] [S3Client] - id=0x7fab24928d20 Initiating making of meta request
        // [ERROR] [2022-06-16T15:19:49Z] [00007000056fd000] [S3MetaRequest] - Could not create auto-ranged-put meta request; there is no Content-Length header present.
        // [ERROR] [2022-06-16T15:19:49Z] [00007000056fd000] [S3Client] - id=0x7fab24928d20: Could not create new meta request.
        // [WARN] [2022-06-16T15:19:50Z] [0000700005d0f000] [JavaCrtGeneral] - Not all native threads were successfully joined during gentle shutdown.  Memory may be leaked.

        Upload upload =
                transferManager.upload(UploadRequest.builder()
                        .putObjectRequest(b -> b.bucket("bucket").key("tmp/flux.bin"))
                        .requestBody(AsyncRequestBody.fromPublisher(flux))
                        .overrideConfiguration(b -> b.addListener(LoggingTransferListener.create()))
                        .build());
        CompletedUpload completedUpload = upload.completionFuture().join();
    }
}

@yasminetalby yasminetalby added the p3 This is a minor priority issue label Nov 28, 2022
@djchapm
Copy link

djchapm commented May 12, 2023

Hey check this out - awslabs/aws-c-s3#285
So using AWS S3 V2 API: S3AsyncClient.crtBuilder() and S3TransferManager, we may soon be able to achieve this 8 year old ticket! (8 years due to ticket chaining - all asking for a way to stream data to S3)

@yasminetalby / @zoewangg Can we maybe move up the priority on this considering the action on the CRT Native Client?

Thanks - also I have some updated code, the stuff above was created with s3 v2 beta code, locally I have updated for v2.20.51.

Regards!

@zoewangg
Copy link
Contributor

Hey @djchapm, yes! We are aware of this change and will bump CRT Java binding version once it's available. There should be no additional change needed on the Java SDK side.

@djchapm
Copy link

djchapm commented May 23, 2023

Looks like awslabs/aws-c-s3#285 is merged - @zoewangg can you confirm which version of the sdk will contain this change?

@zoewangg
Copy link
Contributor

Hi @djchapm, the change is not available in the Java binding yet, we'll let you know once it's ready.

@TingDaoK
Copy link

TingDaoK commented Jun 5, 2023

@zoewangg The latest release of aws-crt-java@0.22.1 supports the unknown content-length now.

@zoewangg
Copy link
Contributor

zoewangg commented Jun 5, 2023

Submitted PR to bump CRT version #4065

@djchapm
Copy link

djchapm commented Jun 8, 2023

Hey @zoewangg - it's working great! I haven't done any performance comparisons but will report. Definitely reduces a ton of code though. So far I've only tried with a few MBs for relatively short durations. I'll want to try and keep things open for a few Gigs and hours before we conclude anything.

What do you think about adding the ability to update the metadata just before we trigger the completion of the output stream? I've tried digging into this and obtaining a handle to the metadata map but it is converted to an unmodifiable map in the UploadRequest, and I'm not sure where in the process the meta data is actually uploaded. If it is at the end though - do you think this would be a possibility?

Our only alternative right now is to do a copy to update metadata which in our experience can take from 1 to 20 minutes.

@djchapm
Copy link

djchapm commented Jun 15, 2023

I have an implementation that works, but I can't figure out what is taking so long once I mark the stream complete. I have a TransferListener hooked up and there is maybe only one more call to listener.bytesTransferred after I have closed. But then it seems to go into a black hole for about a minute before calling the transferComplete method.

I don't see any way to forcefully tell the Upload/s3Client/TransferManager that we're done and to get it over with. I can't really debug because it's all happening async in native-land.

Any thoughts? I'm only dealing with a 140MB upload. For our solution/metrics this amounts to about 7000 "messages" per second whereas the old solution in aws sdk 1 where we manage the multiparts and keep track of content size and then update the metadata after the fact using a copy - that whole process achieves 224000 messages per second in the same exact test. So this bottleneck at the end seems like a pretty big issue. :(

@djchapm
Copy link

djchapm commented Jun 16, 2023

Having some random failures etc... and some slowness in completing uploads. I uploaded a java test project and created issue: awslabs/aws-c-s3#317
@zoewangg Would be cool if you could check the zip I uploaded - Not sure if the issue is in TransferManager implementation or in aws-c-s3.

@zoewangg
Copy link
Contributor

Hi @djchapm yeah, we will work with CRT team to investigate this issue.

To answer your previous question (sorry for the delay)

What do you think about adding the ability to update the metadata just before we trigger the completion of the output stream?

Just for my understanding, is there any reason you can't set the metadata in the original request? I think metadata is being sent in CreateMultipartUploadRequest, so it happens before streaming starts.

@djchapm
Copy link

djchapm commented Jun 20, 2023

Yea - we're uploading metadata/indicators to our files based on what's been processed in the file for disaster recovery scenarios in streaming. Those indicators are typically cumulative counts or 'last entry' markers in the object which can't be calculated ahead of time.

@djchapm
Copy link

djchapm commented Jun 20, 2023

Opened Feature Request to support using this feature via an java.io.OutputStream implementation by opening access to BlockingOutputStreamAsyncRequestBody: #4119

My bad... it is open.

@djchapm
Copy link

djchapm commented Jul 6, 2023

Hey @zoewangg - I can confirm that this is working. I've done several rounds of testing with streaming to S3OutputStream of up to 17Gigs in size. Potential drawback for some people is that the meta data for content size is not attached to the object in S3 - might be good thing to add as feature request.

For anyone looking for example code you can see a temp solution I posted on related ticket awslabs/aws-c-s3#317

Thanks!

@debora-ito
Copy link
Member

@djchapm glad that this is working for you.

Marking this to auto-close soon. Please let us know if there's any further questions.

@debora-ito debora-ito added the closing-soon This issue will close in 4 days unless further comments are made. label Aug 28, 2023
@github-actions github-actions bot added closed-for-staleness and removed closing-soon This issue will close in 4 days unless further comments are made. labels Aug 30, 2023
@guss77
Copy link

guss77 commented Sep 2, 2023

Is this issue supposed to be already fixed in the current AWS Java SDK (2.20.140), because in my tests today I still get an error S3Exception: You must provide the Content-Length HTTP header. (Service: S3, Status Code: 411... when not providing the content length in the PutObjectRequest.

@AndreasChristianson
Copy link

I'm still seeing this error on software.amazon.awssdk:s3:2.20.63

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.0 New closed-for-staleness feature-request A feature should be added or improved. p3 This is a minor priority issue
Projects
None yet
Development

No branches or pull requests