Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3 doesn't close the existing TCP session when a network failure happens with streams #9212

Closed
yudai opened this issue Jan 24, 2018 · 9 comments
Assignees
Milestone

Comments

@yudai
Copy link
Contributor

yudai commented Jan 24, 2018

What I observed

I found that some of my client processes had two TCP sessions to their etcd endpoints (both of them were in ESTABLISHED state).

I think it's a sort of unexpected state since each instance of clientv3's Client is, probably, supposed to keep only one TCP connection at a time. If we allow clients to keep two TCP sessions each, the cluster can consume resources more than expected.

Reproduction

(with the master branch at 64713e5927)

With some debugging, I discovered that a client starts holding two TCP sessions when;

  • The client has an ongoing stream request (e.g. Watch())
  • The client sends one-shot requests (e.g. Put()) with some context timeout in another goroutine
  • The connectivity to the endpoint is lost for some reason and the client starts retry triggered by a one-shot request

It seems the balancer starts to establish a new TCP session (and fails) when a one-shot request fails due to a connectivity issue (you get context deadline exceeded). After the connectivity has recovered, the balancer succeeds to establish the new TCP session it wanted. However, it doesn't close the previous TCP session at this timing when there is a stream request ongoing and as long as the stream request is alive (usually, it's alive till the process itself dies). Eventually both of the new and previous(existing) TCP sessions persists until the process dies.

I created a script to reproduce the issue by emulating a network trouble with iptables: https://gist.github.com/yudai/a5e1269ef3d484217b10415af6dd08c4

You would see the output like below, with two ESTABLISHED TCP sessions at last:
https://gist.github.com/yudai/a5e1269ef3d484217b10415af6dd08c4#gistcomment-2330187

Is this a bug or expected behavior?

If each client keeps two TCP sessions, endpoint processes need to maintain twice times of some resources, such as file descriptors and memory. I therefore think it's not a good idea to let clients to have two TCP session in any case, even after network troubles.

Given that the stream APIs have its internal retry mechanism and users are not responsible for taking care of endpoint failures, I think that the balancer should do something to close the previous TCP session even when there is ongoing stream requests.

@yudai yudai changed the title clientv3 doesn't close the existing TCP session when an endpoint failover happens with streams clientv3 doesn't close the existing TCP session when a network failure happens with streams Jan 24, 2018
@gyuho gyuho self-assigned this Jan 24, 2018
@gyuho
Copy link
Contributor

gyuho commented Jan 24, 2018

Right now, we let gRPC handle load balancing between endpoints (passing v1 grpc.WithBalancer). We will make sure this be addressed in our balancer rework.

@gyuho
Copy link
Contributor

gyuho commented Jan 24, 2018

Also asked gRPC team whether we need manual handling on the *grpc.ClientConn grpc/grpc-go#1831.

@yudai
Copy link
Contributor Author

yudai commented Jan 24, 2018

@gyuho Thank you for the info and working on the issue.

@yudai
Copy link
Contributor Author

yudai commented Jan 24, 2018

This issue happens with etcd versions 3.2.10. Till 3.2.9, clients keeps a single TCP session at the same situation.

@gyuho
Copy link
Contributor

gyuho commented Jul 20, 2018

@yudai Thanks for reproduce. I was able to see the same issue.

Was seeing lots of handle subconn state change 0xc4200a3e50, SHUTDOWN from grpc/balancer_v1_wrapper.go, but somehow the connections are not being closed even after rules are removed.

Will debug further.

We may have the same issue in master, since underlying connection interface is the same.

@gyuho
Copy link
Contributor

gyuho commented Jul 22, 2018

I can confirm that enabling client-side keep-alive in v3.2 and v3.3 resolves the issue.

Same code as @yudai's but with keepalives:

cli, err := clientv3.New(clientv3.Config{
	Endpoints:            []string{"localhost:2379"},
	DialTimeout:          5 * time.Second,
	DialKeepAliveTime:    2 * time.Second,
	DialKeepAliveTimeout: 2 * time.Second,
})

Once keepalive times out, http2 client gets closed by gRPC, thus no lingering TCP connection. If client-side keep alive is not enabled, gRPC keeps previous connections. This only happens when etcd clientv3 uses old gRPC balancer wrapper, which in this case, asynchronously resets transport on connection failures (WithBlock is not set).

In master branch, we create our sub-connections and do not have this issue even without keep-alive.

@gyuho gyuho removed the type/bug label Jul 22, 2018
@gyuho
Copy link
Contributor

gyuho commented Jul 24, 2018

@yudai Can you try with etcd client keep-alive parameters?

@yudai
Copy link
Contributor Author

yudai commented Jul 25, 2018

@gyuho Confirmed that the stale session is closed when I set DialKeepAliveTime and DialKeepAliveTimeout. Thank you very much for taking a look at this issue.

Closing as resolved!

@yudai yudai closed this as completed Jul 25, 2018
@gyuho
Copy link
Contributor

gyuho commented Jul 25, 2018

@yudai Thanks for report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants