Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3: Add retry for round robin load balancer based on grpc-middleware's retry interceptor #16

Merged

Conversation

jpbetz
Copy link

@jpbetz jpbetz commented May 14, 2018

Per discussion with grpc team, this introduces an grpc interceptor based on go-grpc-middleware/retry for etcd client side retry.

This retry uses grpc interceptors to retry "retriable" errors returned by the etcd server. We've also introduced a backoff routine that attempts a call to quorum count (2 of 3, 3 of 5, ...) of etcd servers between each backoff interval.

This does not yet entirely replace the clientv3/retry.go implementation, which is still required for "auth retry" and for retry of grpc streaming requests that utilize client streams, since go-grpc-middleware/retry does not support client streams, but it does supersede it for the bulk of intermittent call failures, and while adding this we've teased out the future work required to replace clientv3/retry.go entirely.

Future work

  • Move auth-retry into the interceptor chain
  • Adopt grpc A6-client-retries when it's ready
  • Reorganize code for retry of non-trivial clientv3 functions that are currently wrapped with retry
  • Remove clientv3/retry.go once the above are complete..

@@ -461,6 +469,21 @@ func newClient(cfg *Config) (*Client, error) {
return client, nil
}

// roundRobinQuoremBackoff retries against quorem between each backoff.
// This is indended for use with a round robin load balancer.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo s/indended/intended/?

quorem := (n/2 + 1)
if attempt%quorem == 0 {
return backoffutils.JitterUp(waitBetween, jitterFraction)
} else {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need } else {? Just return 0 next line?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I keep forgetting this particular go convention.

// This is indended for use with a round robin load balancer.
func (c *Client) roundRobinQuoremBackoff(waitBetween time.Duration, jitterFraction float64) grpc_retry.BackoffFunc {
return func(attempt uint) time.Duration {
// after each round robin across quorem, backoff for our wait between duration
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo s/quorem/quorum :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also s/roundRobinQuoremBackoff/roundRobinQuorumBackoff

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦‍♂️ Thanks!

@@ -461,6 +469,21 @@ func newClient(cfg *Config) (*Client, error) {
return client, nil
}

// roundRobinQuoremBackoff retries against quorem between each backoff.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/quorem/quorum

@jpbetz jpbetz force-pushed the new-balancer-april-2018-grpc_retry branch 4 times, most recently from de74069 to 365286d Compare May 18, 2018 06:38
@jpbetz jpbetz changed the title [WIP] clientv3: Add retry for round robin load balancer using grpc-middleware's retry interceptor clientv3: Add retry for round robin load balancer using grpc-middleware's retry interceptor May 18, 2018
@jpbetz
Copy link
Author

jpbetz commented May 18, 2018

@gyuho Got two remaining issues (see description) on this and then it should be ready.

I'm thinking of implementing auth retry in an interceptor. There are a couple ways I could do this. But I'm a bit unclear why this is needed. It looks like a band-aid for something.

Retry of clientv3 functions that do more than just a remote call is messy. I don't see any way around keeping a subset of clientv3/retry.go around to retain this functionality. Any thoughts?

@gyuho
Copy link
Owner

gyuho commented May 18, 2018

I'm thinking of implementing auth retry in an interceptor. There are a couple ways I could do this. But I'm a bit unclear why this is needed. It looks like a band-aid for something.

It was introduced to retry on expired tokens and to get assigned a new token, whether it's a simple or JWT token (ref. etcd-io#7110).

Retry of clientv3 functions that do more than just a remote call is messy. I don't see any way around keeping a subset of clientv3/retry.go around to retain this functionality. Any thoughts?

Hmm, I will also take a look what the middleware does today. Maybe just replace manual retries on transient error, and keep rest of the retry logic? As long as the interceptor can cover transient error handling, middleware should be worth.

I agree that the current retry logic is quite messy.

What subset of retry.go is hard to replace with the interceptor? If we can't replace now, I feel like we would still need manual retry logic even with https://github.com/grpc/proposal/blob/master/A6-client-retries.md#retry-policy?

@jpbetz jpbetz force-pushed the new-balancer-april-2018-grpc_retry branch from 365286d to 226d32a Compare May 18, 2018 17:18
@jpbetz
Copy link
Author

jpbetz commented May 18, 2018

It was introduced to retry on expired tokens and to get assigned a new token, whether it's a simple or JWT token (ref. coreos#7110).

Gotcha. Thanks for the ref.

What subset of retry.go is hard to replace with the interceptor? If we can't replace now, I feel like we would still need manual retry logic even with https://github.com/grpc/proposal/blob/master/A6-client-retries.md#retry-policy?

The main gaps:

  1. Retry of clientv3 functions that do more than just a remote call (e.g. LeaseKeepAlive).
  2. Anything requiring auth-retry. This seems do-able. It would just require a bit of time/effort to figure out how/where to incorporate the needed getToken call into the interceptor chain.
  3. grpc calls using client streams (also LeaseKeepAlive). This appears to be a gap in grpc_middleware/retry functionality, I would hope A6-client-retries to handle this case (although I haven't reasoned through feasibility carefully).

#1 will always be our responsibility but there isn't all that much of it, so I don't consider that a big problem.

For now, adding back the manual retry logic does seem to be the safest path forward. So I've done that. The MaxRetries and BackoffInterval configuration options I added to the client will no longer make any sense so I'll remove them. With those changes this seems to be strictly better than what we had before. The main follow up tasks we could take on to improve this over time are:

  • Move auth-retry into the interceptor chain
  • Adopt grpc A6-client-retries when it's ready
  • Reorganize code for retry of non-trivial clientv3 functions that are currently wrapped with retry
  • Remove manual retry once the above are complete..

@jpbetz jpbetz force-pushed the new-balancer-april-2018-grpc_retry branch from 226d32a to 7c6dd55 Compare May 18, 2018 18:05
@gyuho
Copy link
Owner

gyuho commented May 18, 2018

1 will always be our responsibility but there isn't all that much of it, so I don't consider that a big problem.

Move auth-retry into the interceptor chain

Sounds good.

Adopt grpc A6-client-retries when it's ready

Yeah, let's keep it as TODO for now, since it's not available in gRPC upstream yet.

Reorganize code for retry of non-trivial clientv3 functions that are currently wrapped with retry

Yes. Feel free to refactor our codebase as needed! It's too tightly coupled with old balancer interface, so need to be reworked anyway.

@jpbetz jpbetz force-pushed the new-balancer-april-2018-grpc_retry branch 2 times, most recently from 9c962cb to 763b92a Compare May 18, 2018 21:02
@jpbetz
Copy link
Author

jpbetz commented May 18, 2018

@gyuho This is ready for review.

Copy link
Owner

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

While we ended up keeping old retry code, we can always refactor and remove once official gRPC retry policy is ready. Just having built-in interceptor for transient error and backoffs would still help us a lot.

@jpbetz jpbetz force-pushed the new-balancer-april-2018-grpc_retry branch from 763b92a to 9256282 Compare May 21, 2018 17:43
@jpbetz jpbetz force-pushed the new-balancer-april-2018-grpc_retry branch from 9256282 to 6ed45be Compare May 21, 2018 19:05
@jpbetz
Copy link
Author

jpbetz commented May 21, 2018

@gyuho This is ready for another pass. I've removed the dependency on go-grpc-middleware/retry and instead introduced a custom retry interceptor based on it that we can tune to our specific needs, and can be adapted to support auth retry in a subsequent PR (I'll can own this). Tests are all passing.

@jpbetz jpbetz changed the title clientv3: Add retry for round robin load balancer using grpc-middleware's retry interceptor clientv3: Add retry for round robin load balancer based on grpc-middleware's retry interceptor May 21, 2018
@gyuho
Copy link
Owner

gyuho commented May 21, 2018

I've removed the dependency on go-grpc-middleware/retry and instead introduced a custom retry interceptor based on it that we can tune to our specific needs,

+1

support auth retry in a subsequent PR (I'll can own this). Tests are all passing.

Sounds good. Let's do it in a separate PR.

LGTM. Thanks!

Copy link
Owner

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@gyuho gyuho merged commit efdc55f into gyuho:new-balancer-april-2018 May 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants