Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early exit auth check on lease puts #16005

Merged
merged 2 commits into from
Jun 16, 2023
Merged

Conversation

tjungblu
Copy link
Contributor

@tjungblu tjungblu commented Jun 5, 2023

Mitigates #15993 by not checking each key individually for permission when auth is entirely disabled or admin user is calling the method.

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

Copy link
Member

@jmhbnz jmhbnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tjungblu nice work looking into this!

Any chance you could replicate the metrics in the linked issue and post here so we could compare before and after?

@@ -125,6 +125,11 @@ func (aa *authApplierV3) LeaseRevoke(lc *pb.LeaseRevokeRequest) (*pb.LeaseRevoke
func (aa *authApplierV3) checkLeasePuts(leaseID lease.LeaseID) error {
l := aa.lessor.Lookup(leaseID)
if l != nil {
// early return for most-common scenario of either disabled auth or admin user
if aa.as.IsAdminPermitted(&aa.authInfo) == nil {
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test covering this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither apply.go, nor apply_auth.go have unit test coverage- there are no unit tests as far as I can tell.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion would be to add ArePutsPermitted method that will for the range and add tests for it.

@@ -125,6 +125,11 @@ func (aa *authApplierV3) LeaseRevoke(lc *pb.LeaseRevokeRequest) (*pb.LeaseRevoke
func (aa *authApplierV3) checkLeasePuts(leaseID lease.LeaseID) error {
l := aa.lessor.Lookup(leaseID)
if l != nil {
// early return for most-common scenario of either disabled auth or admin user
if aa.as.IsAdminPermitted(&aa.authInfo) == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious from proposed check that it covers case that auth is disabled. From code design it seems strange for me that authApplierV3 will be used at all if auth is disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was surprised too, it even goes through an rw mutex everytime it checks...

#15993 (comment)

I reckon we should discuss a bigger refactoring for the auth system, potentially?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, depends on whether we want to backport this fix. This seems like a trivial but important change. Appliers changed a lot between v3.5 and main.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From code design it seems strange for me that authApplierV3 will be used at all if auth is disabled.

I reckon we should discuss a bigger refactoring for the auth system, potentially?

Because the existing chain of applier is static. If we want to get rid of authApplier completely when auth isn't enabled, then we need to make the chain dynamic. We don't have to do it, please feel free to evaluate the effort and impact separately if anyone is interested.

It's not obvious from proposed check that it covers case that auth is disabled.

It's a common "issue" in all existing (*authStore) IsXXXPermitted(...) error methods, when auth isn't enabled, they return nil.

depends on whether we want to backport this fix

I think we need to backport the fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to make the chain dynamic.

I haven't researched too deeply yet, but the whole system is already dynamic because the whole auth goes through raft (even enable/disable). I would've expected the enable/disable methods to be a configuration for startup, not a dynamic apply request. Hence the static implementation, I guess.

I'll open a discussion issue up as a feature request, will do some benchmarks in the meantime around the impact of either approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out we're already swapping out the applier for corruption and nospace alerts:
https://github.com/etcd-io/etcd/blob/release-3.5/server/etcdserver/apply.go#L752-L760

ofc the whole thing isn't locked 🗡️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link you provide just replaces the whole applier chain with applierV3Capped or applierV3Corrupt. But you can't change (swap out) part of the chain: authApplier -> quotaApplier -> backendApplier.

It's similar to string in golang, it's immutable. You can't modify part of a string; instead, you can only replace the whole string.

Just as I mentioned previously, we don't have to make the applier chain dynamic. But of course, it's open to discussion in a separate session.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No doubt on the immutability of the chain, but the pointer to the applier struct needs to be guarded by a mutex when you swap it out from another goroutine?

https://github.com/etcd-io/etcd/blob/release-3.5/server/etcdserver/server.go#L252

let me try to tease the race detector on this case more specifically. Not that it matters much, if you're corrupted/OOS it's game over anyway, it just tickled my inner race detector.

@ahrtr
Copy link
Member

ahrtr commented Jun 5, 2023

The PR looks good to me. @tjungblu did you compare the performance before and after applying this PR?

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 5, 2023

running this entirely local with the test supplied from the issue, I can see the p99 latency to be consistently good:

image

here's the same result without the patch right afterwards in comparison:

image

The p99 looks very low in absolute by prometheus because of the scraping interval and one minute resolution, it actually degrades almost immediately into 100ms+ latency after 10k entries:

MAX latency: 103.208013ms, entries: 11405

which is basically a few seconds into the unit test.


Just for completeness with a comparison of the same metric used in the issue report also stays flat over the whole execution time:
image

Even though I'm not entirely sure whether it makes sense to sum over the rate duration sum here.

@@ -125,6 +125,11 @@ func (aa *authApplierV3) LeaseRevoke(lc *pb.LeaseRevokeRequest) (*pb.LeaseRevoke
func (aa *authApplierV3) checkLeasePuts(leaseID lease.LeaseID) error {
l := aa.lessor.Lookup(leaseID)
if l != nil {
// early return for most-common scenario of either disabled auth or admin user
if aa.as.IsAdminPermitted(&aa.authInfo) == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it explicit that we check the returned error

Suggested change
if aa.as.IsAdminPermitted(&aa.authInfo) == nil {
if err := aa.as.IsAdminPermitted(&aa.authInfo); err == nil {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, let me update the existing commit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Mitigates etcd-io#15993 by not checking each key individually for permission
when auth is entirely disabled or admin user is calling the method.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks @tjungblu

It's a straightforward change, and also a low-hang fruit.

Could you please double check also backport this PR to 3.5 and 3.4 if needed? @tjungblu Of course, waiting for other maintainers approve and merge this PR.

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 6, 2023

@marseel I assume you would need this at the very least in the next 3.5.x release?

@marseel
Copy link

marseel commented Jun 6, 2023

3.5.x release would be great.

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 6, 2023

cool, then I'll get to write a backport. Thanks everyone!

@serathius
Copy link
Member

One think before merging. @tjungblu I know we don't have tests, however I would feel much safer for backports if we could add some even trivial test.

tjungblu added a commit to tjungblu/etcd that referenced this pull request Jun 6, 2023
Mitigates etcd-io#15993 by not checking each key individually for permission
when auth is entirely disabled or admin user is calling the method.

Backport of etcd-io#16005

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
tjungblu added a commit to tjungblu/etcd that referenced this pull request Jun 6, 2023
Mitigates etcd-io#15993 by not checking each key individually for permission
when auth is entirely disabled or admin user is calling the method.

Backport of etcd-io#16005

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 6, 2023

Totally makes sense, let me try to add one right here as a separate commit. I'll cherry-pick it down for the backport PRs then.

Copy link
Member

@serathius serathius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a trivial test to make backports safer?

@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 6, 2023

Added a new test, it's a little more involved to mock out the required parts to test this properly. Anybody, if there's a slightly easier way to enable that test (while fixing the perf issue and not introducing cyclical dependencies) - please let me know.

@tjungblu
Copy link
Contributor Author

@serathius can we move this forward? Is the testcase sufficient for you or too much refactoring already?

@@ -14,7 +14,39 @@

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't move test file to normal file. Don't share mocks outside of package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

)

func TestCheckLeasePutsKeys(t *testing.T) {
aa := authApplierV3{as: auth.NewAuthStore(zaptest.NewLogger(t), auth.NewBackendMock(), &auth.TokenNop{}, 10)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't borrow mocks just use normal backend.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -18,18 +18,18 @@ import (
"context"
)

type tokenNop struct{}
type TokenNop struct{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't expose internal testing code from package! Use proper TokenProvider in external tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

@serathius
Copy link
Member

Overall looks very good! Tests you added are awesome! Left some small comments.

This contains a slight refactoring to expose enough information
to write meaningful tests for auth applier v3.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
@tjungblu
Copy link
Contributor Author

tjungblu commented Jun 15, 2023

fixed the goimport manually, I reckon make fix doesn't actually do anything?
edit: nvm, the linter just went out of memory 🤣

@serathius
Copy link
Member

fixed the goimport manually, I reckon make fix doesn't actually do anything?

Not all make fix-* are implemented for all make verify-* so yea, make fix needs some attention. cc @jmhbnz

@serathius serathius merged commit cb3730a into etcd-io:main Jun 16, 2023
@tjungblu tjungblu deleted the putauthshort branch June 16, 2023 06:55
wenjiaswe pushed a commit to wenjiaswe/etcd that referenced this pull request Jul 7, 2023
Mitigates etcd-io#15993 by not checking each key individually for permission
when auth is entirely disabled or admin user is calling the method.

Backport of etcd-io#16005

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
@serathius serathius mentioned this pull request Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants