[VAULT-17827] Rollback manager worker pool #22567

miagilepner · 2023-08-25T17:49:59Z

(Description is still a WIP)

This PR adds a worker pool to the rollback manager with a default size of 256. The size of the worker pool can be adjusted with the environment variable VAULT_ROLLBACK_WORKERS.

Considerations:

The worker pool removes the goroutine scheduling pressure:
Scheduler latency profile with unlimited workers, with 9000 mounts:

256 workers, with 9000 mounts:

The worker pool queue is limited by the number of mounts, because the rollback manager ensures that there's never more than 1 operation submitted to the worker pool per mount.
If backends take longer than 60 seconds to complete their rollback operation, then the number of workers isn't able to keep up. The queue remains stable in size, but rollbacks are triggered less often. Rollback operations have a request context timeout of 90 seconds, which means that if all of the mounts are timing out, you could end up having rollbacks triggering (# mounts / # workers) * 90 seconds, rather than every 60 seconds.
Rollback operations can cause backends to do 2 things - trigger their PeriodicFunc and call WALRollback with a collection of WAL entries. To be clear, these WAL entries are not the same WAL that Vault uses for replication. This is a separate, namespace/mount-scoped storage location, and the path is only written to by plugins via framework.PutWAL. By default, the WAL entries that get passed to the WALRollback function are any entries older than 10 minutes.
unmount and remount operations trigger a rollback through the rollback manager, then wait for the rollback to complete before continuing. Because we're now using a worker pool it's possible that unmount and remounts will take longer to complete. Note that unmount and remount can be called replication invalidation operations.

github-actions · 2023-08-25T18:05:22Z

CI Results:
All Go tests succeeded! ✅

banks

This looks awesome Mia!

unmount and remount operations trigger a rollback through the rollback manager, then wait for the rollback to complete before continuing. Because we're now using a worker pool it's possible that unmount and remounts will take longer to complete. Note that unmount and remount can be called replication invalidation operations.

This is such a good callout. I had not appreciated the interaction with replication which is certainly an additional side-effect it's good to consider and anticipate.

I guess one additional factor that it makes me think about is that rollbacks might be making calls to external systems e.g. cleaning up external database credentials. In that case limiting concurrency could be more of a big deal. Say for example there are 10,000 database secret mounts but the vault active node suddenly has a few seconds of latency caused by network issues to that database provider. If we also assume that these mounts are all active, and that because of the increased latency there is an elevated failure rate causing there to be rollback operations needed every minute. Now with just 256 concurrent rollbacks, if each one takes say 10s due to the latency to the provider, it will take about 6.5 mins to get through all of them. Even then that's probably not the end of the world unless it blocks replication for that time due to an unmount.

Can you confirm if it would? My mental model is that in this case the secondary would not have anything to rollback at least with an external provider because it's not the primary - i.e. the primary might get in the state above and have slow rollbacks, but on the secondary they'd only ever be rolling back internal state in our own store right? If so then that seems to mitigate the worst risks I could think of above.

Code Considerations

The worker pool package you found looks solid to me and beats re-inventing one or working around the quirks of fairshare for this use-case. I wonder if eventually we might need yet another worker pool implementation that has dynamic pool sizing so we can do adaptive concurrency control for request handling (@mpalmi is working on this). (we already have a few other implicit pools in the code based e.g. the pool of flushers in Consul backend).

I guess we can cross that bridge when we get to it though. It would be kind of a shame to end up with so many different worker pool variants but then I don't think we should scope creep and this one looks great for the task at hand.

I left a few comments inline. I think overall my biggest feedback is that I wonder how important it is to keep the 0 == no pool behaviour at the expense of more code and more things to test. In practice just settting the num workers to 999999999 seems to mitigate virtually all the risks I can think of esp. given the pool implementation chosen is doing virtually the same lines of code as the "no pool" option right up to the point it hits the limit 🤷 . That would simplify code a tiny but but also remove test cases and things to maintain in the future.

banks · 2023-08-29T11:18:03Z

vault/rollback.go

+func (g *goroutineRollbackRunner) StopWait() {
+	g.inflightAll.Wait()
+}


I'm wondering if StopWait should do something that prevents any further Submit calls? We could leave it up to the calling code to ensure it stops calling Submit after this but it seems like it could be subtle and potentially cause live locks where some shutdown process is waiting on this but hasn't yet stopped some other process from submitting new things? Not read all of this yet so it might not be important, just a thought.

Good point, there is a risk of a panic if a submit happens after StopWait(). I've added this to prevent that: https://github.com/hashicorp/vault/pull/22567/files#diff-067fef428afca5813d7e1a68bf8b43f371d6106b77056e97cc07fbb825be65baR219-R235

vault/rollback.go

vault/rollback_test.go

miagilepner · 2023-08-31T13:01:26Z

Can you confirm if it would? My mental model is that in this case the secondary would not have anything to rollback at least with an external provider because it's not the primary - i.e. the primary might get in the state above and have slow rollbacks, but on the secondary they'd only ever be rolling back internal state in our own store right? If so then that seems to mitigate the worst risks I could think of above.

DR secondaries don't start the rollback manager.
Performance secondaries do have a rollback manager, and if a mount is replicated then the WALs for that mount get replicated. That means that both the primary and the secondary cluster have their WALRollback function triggered. Some backends (like the AWS secrets backend) check if they're running on a performance secondary and if so, do nothing. But the database backends don't have any such check, and they use the rollback to clear their connections and update user credentials for the database. I'm not sure if this behavior is intended, but I imagine it's not, since it would mean multiple clusters trying to set the creds in the database.

Actually, many backends exclude the WAL prefix from being replicated: https://github.com/hashicorp/vault/blob/main/builtin/logical/database/backend.go#L97. This means that performance replica's WAL rollbacks would only be duplicated effort if the backend doesn't set the path to local.

The Azure secrets and auth backends are the only builtin plugins that don't set the WAL path to local, and have WALRollbackFunc's.

github-actions · 2023-08-31T14:30:01Z

Build Results:
All builds succeeded! ✅

schavis · 2023-08-31T17:55:31Z

@miagilepner Question for the purposes of the doc review:

and if that variable is set to a value less than 0, then no worker pool is use

If a value less than 0 means "no pool" and a value greater than 0 defines the pool size, what does a value of 0 do?

Also, where were you planning on documenting the new environment variable? I only see metric partials in the docs atm.

miagilepner · 2023-09-01T11:02:15Z

If a value less than 0 means "no pool" and a value greater than 0 defines the pool size, what does a value of 0 do?

I've updated the PR description and the code. VAULT_ROLLBACK_WORKERS must be greater than or equal to 1. If it's less than 1, we use the default value of 256.

Also, where were you planning on documenting the new environment variable? I only see metric partials in the docs atm.

No, I'm not planning to. For the vast majority of Vault users, they shouldn't need to be aware of the worker pool. The environment variable is there as an escape hatch if someone does run into performance difficulties.

schavis · 2023-09-01T18:04:20Z

Content LGTM, feel free to merge once the code review is complete.

banks

This looks awesome Mia, great job.

I think it's ready to go!

banks · 2023-09-04T13:42:24Z

vault/rollback_test.go

+	timeout, cancel := context.WithTimeout(context.Background(), 20*time.Second)
+	defer cancel()
+	got := make(map[string]bool)
+	gotLock := sync.RWMutex{}


This is absolutely fine and does not need to change, but just in case you've not seen this before I thought I'd take the opportunity to share some "wisdom" from the Go authors which is basically "don't use RWMutex unless you've profiled and are pretty sure it make an important difference vs. Mutex" https://github.com/golang/go/wiki/CodeReviewConcurrency#rwmutex

This is a test so none of this matters even a small amount and I don't think you should change it, but in general I tend to suspect almost all new usages of RWMutex are likely better of as the simpler Mutex!

good to know!

* workerpool implementation * rollback tests * website documentation * add changelog * fix failing test

* backport of commit 4e3b91d (#22567) * workerpool implementation * rollback tests * website documentation * add changelog * fix failing test * backport of commit de043d6 (#22754) * fix flaky rollback test * better fix * switch to defer * add comment --------- Co-authored-by: miagilepner <mia.epner@hashicorp.com>

github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Aug 25, 2023

miagilepner added this to the 1.15 milestone Aug 28, 2023

miagilepner requested a review from banks August 29, 2023 09:24

banks reviewed Aug 29, 2023

View reviewed changes

miagilepner force-pushed the miagilepner/rollback-manager-worker-pool branch from 98a1397 to eb18aee Compare August 31, 2023 13:59

vercel bot deployed to Preview August 31, 2023 14:04 View deployment

miagilepner added 3 commits August 31, 2023 16:09

workerpool implementation

a1d5913

rollback tests

03a4a11

website documentation

c03fe11

miagilepner force-pushed the miagilepner/rollback-manager-worker-pool branch from eb18aee to c03fe11 Compare August 31, 2023 14:09

add changelog

747b7c9

vercel bot deployed to Preview August 31, 2023 14:14 View deployment

miagilepner marked this pull request as ready for review August 31, 2023 14:14

miagilepner requested a review from a team as a code owner August 31, 2023 14:14

fix failing test

5e4777f

schavis requested review from schavis and removed request for a team August 31, 2023 17:58

miagilepner requested a review from banks September 1, 2023 11:02

schavis added the content-lgtm Content changes approved. Merge depends on code review label Sep 1, 2023

Merge branch 'main' into miagilepner/rollback-manager-worker-pool

74a3e25

vercel bot deployed to Preview September 4, 2023 13:11 View deployment

banks approved these changes Sep 4, 2023

View reviewed changes

miagilepner merged commit 4e3b91d into main Sep 4, 2023
103 checks passed

miagilepner deleted the miagilepner/rollback-manager-worker-pool branch September 4, 2023 13:48

miagilepner added backport/1.13.x labels Oct 17, 2023

hc-github-team-secure-vault-core mentioned this pull request Oct 17, 2023

Backport of [VAULT-17827] Rollback manager worker pool into release/1.14.x #23691

Merged

miagilepner added backport/1.13.x and removed backport/1.13.x labels Oct 17, 2023

hc-github-team-secure-vault-core mentioned this pull request Oct 17, 2023

Backport of [VAULT-17827] Rollback manager worker pool into release/1.13.x #23692

Merged

miagilepner added a commit that referenced this pull request Oct 17, 2023

backport of commit 4e3b91d (#22567)

f6ae1fd

* workerpool implementation * rollback tests * website documentation * add changelog * fix failing test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VAULT-17827] Rollback manager worker pool #22567

[VAULT-17827] Rollback manager worker pool #22567

miagilepner commented Aug 25, 2023 •

edited

Loading

github-actions bot commented Aug 25, 2023 •

edited

Loading

banks left a comment

banks Aug 29, 2023

miagilepner Aug 31, 2023

miagilepner commented Aug 31, 2023 •

edited

Loading

github-actions bot commented Aug 31, 2023

schavis commented Aug 31, 2023 •

edited

Loading

miagilepner commented Sep 1, 2023

schavis commented Sep 1, 2023

banks left a comment

banks Sep 4, 2023

miagilepner Sep 4, 2023

[VAULT-17827] Rollback manager worker pool #22567

[VAULT-17827] Rollback manager worker pool #22567

Conversation

miagilepner commented Aug 25, 2023 • edited Loading

github-actions bot commented Aug 25, 2023 • edited Loading

banks left a comment

Choose a reason for hiding this comment

Code Considerations

banks Aug 29, 2023

Choose a reason for hiding this comment

miagilepner Aug 31, 2023

Choose a reason for hiding this comment

miagilepner commented Aug 31, 2023 • edited Loading

github-actions bot commented Aug 31, 2023

schavis commented Aug 31, 2023 • edited Loading

miagilepner commented Sep 1, 2023

schavis commented Sep 1, 2023

banks left a comment

Choose a reason for hiding this comment

banks Sep 4, 2023

Choose a reason for hiding this comment

miagilepner Sep 4, 2023

Choose a reason for hiding this comment

miagilepner commented Aug 25, 2023 •

edited

Loading

github-actions bot commented Aug 25, 2023 •

edited

Loading

miagilepner commented Aug 31, 2023 •

edited

Loading

schavis commented Aug 31, 2023 •

edited

Loading