Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store MD5 in a separate metadata field and use it when ETag is not an MD5 hash #924

Merged
merged 2 commits into from
Mar 2, 2021

Conversation

andrewshadura
Copy link
Contributor

Description of the Change

The S3 backend relies on ETag S3 returns being equal to the MD5 of the object, but it’s not necessarily true. For that purpose we store the MD5 object in a separate metadata field as well to make sure it isn’t lost. When the value returned clearly doesn’t look like a valid MD5 hash (length isn’t exactly 32 characters), attempt to retrieve the MD5 hash possibly stored in the metadata.

We cannot always do this since user-defined metadata isn’t returned by the ListObjects call, so verifying it for each object is expensive as it requires one HEAD request per each object.

This pull request fixes #923

Requirements

All new code should be covered with tests, documentation should be updated. CI should pass.

Checklist

  • unit-test added (if change is algorithm)
  • functional test added/updated (if change is functional)
  • man page updated (if applicable)
  • bash completion updated (if applicable)
  • documentation updated
  • author name in AUTHORS

@andrewshadura
Copy link
Contributor Author

andrewshadura commented Feb 20, 2021

Ping? @smira, @lbolla?

s3/public.go Outdated Show resolved Hide resolved
s3/public.go Show resolved Hide resolved
@lbolla
Copy link
Contributor

lbolla commented Mar 2, 2021

Can you also rebase against the latest master please, so tests should now pass.

@codecov
Copy link

codecov bot commented Mar 2, 2021

Codecov Report

Merging #924 (599e67a) into master (af2564c) will decrease coverage by 0.12%.
The diff coverage is 30.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #924      +/-   ##
==========================================
- Coverage   61.41%   61.28%   -0.13%     
==========================================
  Files          54       54              
  Lines        5911     5928      +17     
==========================================
+ Hits         3630     3633       +3     
- Misses       1757     1770      +13     
- Partials      524      525       +1     
Impacted Files Coverage Δ
s3/public.go 58.05% <30.00%> (-3.14%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af2564c...599e67a. Read the comment docs.

The S3 backend relies on ETag S3 returns being equal to the MD5 of the
object, but it’s not necessarily true. For that purpose we store the MD5
object in a separate metadata field as well to make sure it isn’t lost.

From https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html:

> The entity tag is a hash of the object. The ETag reflects changes only
> to the contents of an object, not its metadata. The ETag may or may not
> be an MD5 digest of the object data. Whether or not it depends on how
> the object was created and how it is encrypted as described below:
>
> Objects created by the PUT Object, POST Object, or Copy operation,
> or through the AWS Management Console, and are encrypted by SSE-S3 or
> plaintext, have ETags that are an MD5 digest of their object data.
>
> Objects created by the PUT Object, POST Object, or Copy operation,
> or through the AWS Management Console, and are encrypted by SSE-C or
> SSE-KMS, have ETags that are not an MD5 digest of their object data.
>
> If an object is created by either the Multipart Upload or Part Copy
> operation, the ETag is not an MD5 digest, regardless of the method
> of encryption.

Signed-off-by: Andrej Shadura <andrew.shadura@collabora.co.uk>
The S3 backend relies on ETag S3 returns being equal to the MD5 of the
object, but it’s not necessarily true. When the value returned clearly
doesn’t look like a valid MD5 hash (length isn’t exactly 32 characters),
attempt to retrieve the MD5 hash possibly stored in the metadata.

We cannot always do this since user-defined metadata isn’t returned by
the ListObjects call, so verifying it for each object is expensive as it
requires one HEAD request per each object.

This commit fixes aptly-dev#923.

Signed-off-by: Andrej Shadura <andrew.shadura@collabora.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

S3 backend relies on ETag being MD5
3 participants