Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VAULT-17827] Rollback manager worker pool #22567

Merged
merged 6 commits into from
Sep 4, 2023

Conversation

miagilepner
Copy link
Contributor

@miagilepner miagilepner commented Aug 25, 2023

(Description is still a WIP)

This PR adds a worker pool to the rollback manager with a default size of 256. The size of the worker pool can be adjusted with the environment variable VAULT_ROLLBACK_WORKERS.

Considerations:

  • The worker pool removes the goroutine scheduling pressure:
    Scheduler latency profile with unlimited workers, with 9000 mounts:
image

256 workers, with 9000 mounts:

image
  • The worker pool queue is limited by the number of mounts, because the rollback manager ensures that there's never more than 1 operation submitted to the worker pool per mount.
  • If backends take longer than 60 seconds to complete their rollback operation, then the number of workers isn't able to keep up. The queue remains stable in size, but rollbacks are triggered less often. Rollback operations have a request context timeout of 90 seconds, which means that if all of the mounts are timing out, you could end up having rollbacks triggering (# mounts / # workers) * 90 seconds, rather than every 60 seconds.
  • Rollback operations can cause backends to do 2 things - trigger their PeriodicFunc and call WALRollback with a collection of WAL entries. To be clear, these WAL entries are not the same WAL that Vault uses for replication. This is a separate, namespace/mount-scoped storage location, and the path is only written to by plugins via framework.PutWAL. By default, the WAL entries that get passed to the WALRollback function are any entries older than 10 minutes.
  • unmount and remount operations trigger a rollback through the rollback manager, then wait for the rollback to complete before continuing. Because we're now using a worker pool it's possible that unmount and remounts will take longer to complete. Note that unmount and remount can be called replication invalidation operations.

@github-actions github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Aug 25, 2023
@github-actions
Copy link

github-actions bot commented Aug 25, 2023

CI Results:
All Go tests succeeded! ✅

@miagilepner miagilepner added this to the 1.15 milestone Aug 28, 2023
@miagilepner miagilepner requested a review from banks August 29, 2023 09:24
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome Mia!

unmount and remount operations trigger a rollback through the rollback manager, then wait for the rollback to complete before continuing. Because we're now using a worker pool it's possible that unmount and remounts will take longer to complete. Note that unmount and remount can be called replication invalidation operations.

This is such a good callout. I had not appreciated the interaction with replication which is certainly an additional side-effect it's good to consider and anticipate.

I guess one additional factor that it makes me think about is that rollbacks might be making calls to external systems e.g. cleaning up external database credentials. In that case limiting concurrency could be more of a big deal. Say for example there are 10,000 database secret mounts but the vault active node suddenly has a few seconds of latency caused by network issues to that database provider. If we also assume that these mounts are all active, and that because of the increased latency there is an elevated failure rate causing there to be rollback operations needed every minute. Now with just 256 concurrent rollbacks, if each one takes say 10s due to the latency to the provider, it will take about 6.5 mins to get through all of them. Even then that's probably not the end of the world unless it blocks replication for that time due to an unmount.

Can you confirm if it would? My mental model is that in this case the secondary would not have anything to rollback at least with an external provider because it's not the primary - i.e. the primary might get in the state above and have slow rollbacks, but on the secondary they'd only ever be rolling back internal state in our own store right? If so then that seems to mitigate the worst risks I could think of above.

Code Considerations

The worker pool package you found looks solid to me and beats re-inventing one or working around the quirks of fairshare for this use-case. I wonder if eventually we might need yet another worker pool implementation that has dynamic pool sizing so we can do adaptive concurrency control for request handling (@mpalmi is working on this). (we already have a few other implicit pools in the code based e.g. the pool of flushers in Consul backend).

I guess we can cross that bridge when we get to it though. It would be kind of a shame to end up with so many different worker pool variants but then I don't think we should scope creep and this one looks great for the task at hand.

I left a few comments inline. I think overall my biggest feedback is that I wonder how important it is to keep the 0 == no pool behaviour at the expense of more code and more things to test. In practice just settting the num workers to 999999999 seems to mitigate virtually all the risks I can think of esp. given the pool implementation chosen is doing virtually the same lines of code as the "no pool" option right up to the point it hits the limit 🤷 . That would simplify code a tiny but but also remove test cases and things to maintain in the future.

Comment on lines 101 to 103
func (g *goroutineRollbackRunner) StopWait() {
g.inflightAll.Wait()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if StopWait should do something that prevents any further Submit calls? We could leave it up to the calling code to ensure it stops calling Submit after this but it seems like it could be subtle and potentially cause live locks where some shutdown process is waiting on this but hasn't yet stopped some other process from submitting new things? Not read all of this yet so it might not be important, just a thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, there is a risk of a panic if a submit happens after StopWait(). I've added this to prevent that: https://github.com/hashicorp/vault/pull/22567/files#diff-067fef428afca5813d7e1a68bf8b43f371d6106b77056e97cc07fbb825be65baR219-R235

vault/rollback.go Outdated Show resolved Hide resolved
vault/rollback.go Outdated Show resolved Hide resolved
vault/rollback_test.go Show resolved Hide resolved
vault/rollback_test.go Outdated Show resolved Hide resolved
vault/rollback_test.go Show resolved Hide resolved
@miagilepner
Copy link
Contributor Author

miagilepner commented Aug 31, 2023

Can you confirm if it would? My mental model is that in this case the secondary would not have anything to rollback at least with an external provider because it's not the primary - i.e. the primary might get in the state above and have slow rollbacks, but on the secondary they'd only ever be rolling back internal state in our own store right? If so then that seems to mitigate the worst risks I could think of above.

DR secondaries don't start the rollback manager.
Performance secondaries do have a rollback manager, and if a mount is replicated then the WALs for that mount get replicated. That means that both the primary and the secondary cluster have their WALRollback function triggered. Some backends (like the AWS secrets backend) check if they're running on a performance secondary and if so, do nothing. But the database backends don't have any such check, and they use the rollback to clear their connections and update user credentials for the database. I'm not sure if this behavior is intended, but I imagine it's not, since it would mean multiple clusters trying to set the creds in the database.

Actually, many backends exclude the WAL prefix from being replicated: https://github.com/hashicorp/vault/blob/main/builtin/logical/database/backend.go#L97. This means that performance replica's WAL rollbacks would only be duplicated effort if the backend doesn't set the path to local.

The Azure secrets and auth backends are the only builtin plugins that don't set the WAL path to local, and have WALRollbackFunc's.

@miagilepner miagilepner force-pushed the miagilepner/rollback-manager-worker-pool branch from eb18aee to c03fe11 Compare August 31, 2023 14:09
@miagilepner miagilepner marked this pull request as ready for review August 31, 2023 14:14
@miagilepner miagilepner requested a review from a team as a code owner August 31, 2023 14:14
@github-actions
Copy link

Build Results:
All builds succeeded! ✅

@schavis
Copy link
Contributor

schavis commented Aug 31, 2023

@miagilepner Question for the purposes of the doc review:

and if that variable is set to a value less than 0, then no worker pool is use

If a value less than 0 means "no pool" and a value greater than 0 defines the pool size, what does a value of 0 do?

Also, where were you planning on documenting the new environment variable? I only see metric partials in the docs atm.

@schavis schavis requested review from schavis and removed request for a team August 31, 2023 17:58
@miagilepner
Copy link
Contributor Author

If a value less than 0 means "no pool" and a value greater than 0 defines the pool size, what does a value of 0 do?

I've updated the PR description and the code. VAULT_ROLLBACK_WORKERS must be greater than or equal to 1. If it's less than 1, we use the default value of 256.

Also, where were you planning on documenting the new environment variable? I only see metric partials in the docs atm.

No, I'm not planning to. For the vast majority of Vault users, they shouldn't need to be aware of the worker pool. The environment variable is there as an escape hatch if someone does run into performance difficulties.

@schavis schavis added the content-lgtm Content changes approved. Merge depends on code review label Sep 1, 2023
@schavis
Copy link
Contributor

schavis commented Sep 1, 2023

Content LGTM, feel free to merge once the code review is complete.

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome Mia, great job.

I think it's ready to go!

timeout, cancel := context.WithTimeout(context.Background(), 20*time.Second)
defer cancel()
got := make(map[string]bool)
gotLock := sync.RWMutex{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely fine and does not need to change, but just in case you've not seen this before I thought I'd take the opportunity to share some "wisdom" from the Go authors which is basically "don't use RWMutex unless you've profiled and are pretty sure it make an important difference vs. Mutex" https://github.com/golang/go/wiki/CodeReviewConcurrency#rwmutex

This is a test so none of this matters even a small amount and I don't think you should change it, but in general I tend to suspect almost all new usages of RWMutex are likely better of as the simpler Mutex!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know!

@miagilepner miagilepner merged commit 4e3b91d into main Sep 4, 2023
103 checks passed
@miagilepner miagilepner deleted the miagilepner/rollback-manager-worker-pool branch September 4, 2023 13:48
miagilepner added a commit that referenced this pull request Oct 17, 2023
* workerpool implementation

* rollback tests

* website documentation

* add changelog

* fix failing test
miagilepner added a commit that referenced this pull request Oct 17, 2023
* backport of commit 4e3b91d (#22567)

* workerpool implementation

* rollback tests

* website documentation

* add changelog

* fix failing test

* backport of commit de043d6 (#22754)

* fix flaky rollback test

* better fix

* switch to defer

* add comment

---------

Co-authored-by: miagilepner <mia.epner@hashicorp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content-lgtm Content changes approved. Merge depends on code review hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants