Paper over GCS backend corruption issues #5804

KJTsanaktsidis · 2018-11-16T04:28:49Z

We've been getting occasional issues with corruption in our GCS storage backend for Vault, probably as a result of a pathalogical workload that cancels in-flight lease renewal requests that I described in this comment: #5419 (comment)

One thing I noticed about our workload is that on one host, we have an instance of consul-template in a restart loop. The consul-template task authenticates to Vault with a GCE metadata token, provides consul-template with the Vault token, and uses consul-template to fetch a dynamic MongoDB username/password pair secret. consul-template then tries to render the template to disk but fails because it doesn't have write permissions to the directory.

consul-template appears to renew the secret once immediately after it's fetched in the background. I see trace messages like [TRACE] vault.read(db/mongodb/creds/mongodb-out00-instance-monitor): starting renewer from the consul-template process, but never see the corresponding successfully renewed log message. The process is crashing due to the failed template render before the renewal can complete in most cases.

We noticed the corruption because our Vault instance failed to unseal, printing error messages like the following:

[ERROR] expiration: error restoring leases: error="failed to read lease entry: decryption failed: cipher: message authentication failed"

Upon inspection, we found entries in google cloud storage that seemed to be too small to be correctly formed - whilst most files in this directory of the bucket were ~2k or so, this one was only 512 bytes:

gsutil stat gs://vaultgcs_backend_us1_staging/sys/expire/id/db/mongodb/creds/mongodb-out00-instance-monitor/2AfcHOiE7v8oVSEWKocer2Fm
gs://vaultgcs_backend_us1_staging/sys/expire/id/db/mongodb/creds/mongodb-out00-instance-monitor/2AfcHOiE7v8oVSEWKocer2Fm:
    Creation time:          Thu, 15 Nov 2018 03:14:21 GMT
    Update time:            Thu, 15 Nov 2018 03:14:21 GMT
    Storage class:          REGIONAL
    Content-Length:         512
    Content-Type:           application/octet-stream
    Hash (crc32c):          mQECHw==
    Hash (md5):             UX1Rnde7/sB0KWA5ouv5Tg==
    ETag:                   CKb3sJO31d4CEAE=
    Generation:             1542251661245350
    Metageneration:         1

I thought the 512 byte file size was extremley suspicous. Deleting the file from GCS made Vault correctly start up again.

I couldn't find out exactly what was causing the upload to behave in this way, but I'm making this PR with two improvements that I think would definitely make this less of an issue:

Prints out what lease ID failed to be read, so operators can take some action
Provides GCS with the md5 of the file being uploaded - GCS will make the upload fail if what it received does not match the MD5 that was provided.

We're having issues with leases in the GCS backend storage being corrupted and failing MAC checking. When that happens, we need to know the lease ID so we can address the corruption by hand and take appropriate action.

This will hopefully prevent any instances of incomplete data being sent to GSS

jefferai · 2018-11-16T13:02:58Z

This seems like a good thing to have generally.

KJTsanaktsidis added 2 commits November 16, 2018 15:05

[CONNECTOPS-111] Log which leases are corrupted

3d1b37d

We're having issues with leases in the GCS backend storage being corrupted and failing MAC checking. When that happens, we need to know the lease ID so we can address the corruption by hand and take appropriate action.

[CONNECTOPS-111] Send an md5 hash of contents to GCS

1f28e66

This will hopefully prevent any instances of incomplete data being sent to GSS

jefferai merged commit 7bf3c14 into hashicorp:master Nov 16, 2018

jefferai added this to the 1.0 milestone Nov 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paper over GCS backend corruption issues #5804

Paper over GCS backend corruption issues #5804

KJTsanaktsidis commented Nov 16, 2018 •

edited

Loading

jefferai commented Nov 16, 2018

Paper over GCS backend corruption issues #5804

Paper over GCS backend corruption issues #5804

Conversation

KJTsanaktsidis commented Nov 16, 2018 • edited Loading

jefferai commented Nov 16, 2018

KJTsanaktsidis commented Nov 16, 2018 •

edited

Loading