Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper over GCS backend corruption issues #5804

Conversation

KJTsanaktsidis
Copy link
Contributor

@KJTsanaktsidis KJTsanaktsidis commented Nov 16, 2018

We've been getting occasional issues with corruption in our GCS storage backend for Vault, probably as a result of a pathalogical workload that cancels in-flight lease renewal requests that I described in this comment: #5419 (comment)

One thing I noticed about our workload is that on one host, we have an instance of consul-template in a restart loop. The consul-template task authenticates to Vault with a GCE metadata token, provides consul-template with the Vault token, and uses consul-template to fetch a dynamic MongoDB username/password pair secret. consul-template then tries to render the template to disk but fails because it doesn't have write permissions to the directory.

consul-template appears to renew the secret once immediately after it's fetched in the background. I see trace messages like [TRACE] vault.read(db/mongodb/creds/mongodb-out00-instance-monitor): starting renewer from the consul-template process, but never see the corresponding successfully renewed log message. The process is crashing due to the failed template render before the renewal can complete in most cases.

We noticed the corruption because our Vault instance failed to unseal, printing error messages like the following:

[ERROR] expiration: error restoring leases: error="failed to read lease entry: decryption failed: cipher: message authentication failed"

Upon inspection, we found entries in google cloud storage that seemed to be too small to be correctly formed - whilst most files in this directory of the bucket were ~2k or so, this one was only 512 bytes:

gsutil stat gs://vaultgcs_backend_us1_staging/sys/expire/id/db/mongodb/creds/mongodb-out00-instance-monitor/2AfcHOiE7v8oVSEWKocer2Fm
gs://vaultgcs_backend_us1_staging/sys/expire/id/db/mongodb/creds/mongodb-out00-instance-monitor/2AfcHOiE7v8oVSEWKocer2Fm:
    Creation time:          Thu, 15 Nov 2018 03:14:21 GMT
    Update time:            Thu, 15 Nov 2018 03:14:21 GMT
    Storage class:          REGIONAL
    Content-Length:         512
    Content-Type:           application/octet-stream
    Hash (crc32c):          mQECHw==
    Hash (md5):             UX1Rnde7/sB0KWA5ouv5Tg==
    ETag:                   CKb3sJO31d4CEAE=
    Generation:             1542251661245350
    Metageneration:         1

I thought the 512 byte file size was extremley suspicous. Deleting the file from GCS made Vault correctly start up again.

I couldn't find out exactly what was causing the upload to behave in this way, but I'm making this PR with two improvements that I think would definitely make this less of an issue:

  • Prints out what lease ID failed to be read, so operators can take some action
  • Provides GCS with the md5 of the file being uploaded - GCS will make the upload fail if what it received does not match the MD5 that was provided.

We're having issues with leases in the GCS backend storage being
corrupted and failing MAC checking. When that happens, we need to know
the lease ID so we can address the corruption by hand and take
appropriate action.
This will hopefully prevent any instances of incomplete data being sent
to GSS
@jefferai
Copy link
Member

This seems like a good thing to have generally.

@jefferai jefferai merged commit 7bf3c14 into hashicorp:master Nov 16, 2018
@jefferai jefferai added this to the 1.0 milestone Nov 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants