Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd go client fails when querying a cluster with a down node #9949

Closed
gnuoy opened this issue Jul 21, 2018 · 21 comments · Fixed by #10911
Closed

etcd go client fails when querying a cluster with a down node #9949

gnuoy opened this issue Jul 21, 2018 · 21 comments · Fixed by #10911
Assignees
Milestone

Comments

@gnuoy
Copy link

gnuoy commented Jul 21, 2018

Describe the bug
The etcd go client fails if multiple https endpoints are specified when the client is initialised and the first etcd endpoint is unavailable.

To Reproduce
I setup an etcd cluster with http (port 2378) and https (port 2379) listeners. Then used the etcd go client library to query the cluster. Then I took down the first unit listed when the client was established (in my case the client on port 10.53.82.119). The http go client continues to work but the https one fails.

https client:

	ctx, _ := context.WithTimeout(context.Background(), requestTimeout)

	cfg := clientv3.Config{
		Endpoints:   []string{"https://10.53.82.119:2379" ,"https://10.53.82.150:2379", "https://10.53.82.157:2379"},
		DialTimeout: 5 * time.Second,
	}

	cert := "/home/liam/tls_vault_certs/etcd-cert.pem"
	key := "/home/liam/tls_vault_certs/etcd.key"
	ca := "/home/liam/tls_vault_certs/etcd-ca.pem"
	tls := transport.TLSInfo{
		TrustedCAFile: ca,
		CertFile:      cert,
		KeyFile:       key,
	}

	tlscfg, err := tls.ClientConfig()
	cfg.TLS = tlscfg

	cli, err := clientv3.New(cfg)
	if err != nil {
		log.Fatal(err)
	}
	defer cli.Close()
	kv := clientv3.NewKV(cli)

Fails with:
2018-07-21 11:34:05.728613 I | context deadline exceeded

http client:

        ctx, _ := context.WithTimeout(context.Background(), requestTimeout)

        cfg := clientv3.Config{
                Endpoints:   []string{"http://10.53.82.119:2378" ,"http://10.53.82.150:2378", "http://10.53.82.157:2378"},
                DialTimeout: 5 * time.Second,
        }

        cli, err := clientv3.New(cfg)
        if err != nil {
                log.Fatal(err)
        }
        defer cli.Close()
        kv := clientv3.NewKV(cli)
mitsutaka pushed a commit to cybozu-go/cke that referenced this issue Sep 4, 2018
The etcd go client fails if multiple https endpoints are specified when
the client is initialised and the first etcd endpoint is unavailable.

etcd-io/etcd#9949
mitsutaka pushed a commit to cybozu-go/cke that referenced this issue Sep 4, 2018
The etcd go client fails if multiple https endpoints are specified when
the client is initialised and the first etcd endpoint is unavailable.

etcd-io/etcd#9949
@gnuoy
Copy link
Author

gnuoy commented Sep 4, 2018

In my setup each etcd unit has an individual server certificate that includes an IP SAN with the ip address of the unit.

If I generate a new certificate that has IP SAN entries for all units in the cluster and then use then use that certificate on all units the problem goes away. I don't like this workaround though as it means I need to cert every time I scale out.

@gnuoy
Copy link
Author

gnuoy commented Sep 4, 2018

In my current setup I have three etcd nodes: 10.53.82.141, 10.53.82.217 and 10.53.82.233
I have shutdown etcd on 10.53.82.141 and 10.53.82.141 is the first entry in my Endpoints list. Running the client and capturing the error message from https://github.com/etcd-io/etcd/blob/master/vendor/google.golang.org/grpc/credentials/credentials.go#L166 gives:

{"level":"info","ts":1536056536.818413,"caller":"balancer/balancer.go:134","msg":"resolved","balancer-id":"bo4mljc94cxt","addresses":["https://10.53.82.141:2379","https://10.53.82.217:2379","https://10.53.82.233:2379"]}
{"level":"info","ts":1536056536.8184912,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396550","address":"https://10.53.82.141:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"info","ts":1536056536.8185225,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396570","address":"https://10.53.82.233:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"info","ts":1536056536.8185527,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396590","address":"https://10.53.82.217:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"info","ts":1536056536.8217986,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396550","address":"https://10.53.82.141:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}

x509: certificate is valid for 10.53.82.233, not 10.53.82.141
{"level":"info","ts":1536056536.8290203,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396570","address":"https://10.53.82.233:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}

x509: certificate is valid for 10.53.82.217, not 10.53.82.141
{"level":"info","ts":1536056536.8310082,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bo4mljc94cxt","connected":false,"subconn":"0xc420396590","address":"https://10.53.82.217:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}```

@gnuoy
Copy link
Author

gnuoy commented Sep 4, 2018

It looks like the chain of events (when servername is not used):

  1. newClient is called from clientv3/client.go. This first endpoint is
    extracted from the Endpoint list and used to call client.dial.
func newClient(cfg *Config) (*Client, error) {
...
	dialEndpoint := cfg.Endpoints[0]
...
	conn, err := client.dial(dialEndpoint, grpc.WithBalancerName(roundRobinBalancerName))
  1. client.dial in turn calls DialContext from ./vendor/google.golang.org/grpc/clientconn.go
func (c *Client) dial(ep string, dopts ...grpc.DialOption) (*grpc.ClientConn, error) {
...
	conn, err := grpc.DialContext(dctx, target, opts...)
  1. DialContext sets the authority to be url for the first endpoint ./vendor/google.golang.org/grpc/clientconn.go
func DialContext(ctx context.Context, target string, opts ...DialOption) (conn *ClientConn, err error) {
...
        cc := &ClientConn{
                target: target,
                csMgr:  &connectivityStateManager{},
                conns:  make(map[*addrConn]struct{}),
                blockingpicker: newPickerWrapper(),
        }
...
        creds := cc.dopts.copts.TransportCredentials
        if creds != nil && creds.Info().ServerName != "" {
                cc.authority = creds.Info().ServerName
        } else if cc.dopts.insecure && cc.dopts.copts.Authority != "" {
                cc.authority = cc.dopts.copts.Authority
        } else {
                // Use endpoint from "scheme://authority/endpoint" as the default
                // authority for ClientConn.
                cc.authority = cc.parsedTarget.Endpoint
        }
...
  1. Connection attempts to any endpoint other than the first now fail because
    their IP does not match the IP of the first endpoint. ./vendor/google.golang.org/grpc/credentials/credentials.go
func (c *tlsCreds) ClientHandshake(ctx context.Context, authority string, rawConn net.Conn) (_ net.Conn, _ AuthInfo, err error) {
	// use local cfg to avoid clobbering ServerName if using multiple endpoints
	cfg := cloneTLSConfig(c.config)
	if cfg.ServerName == "" {
		colonPos := strings.LastIndex(authority, ":")
		if colonPos == -1 {
			colonPos = len(authority)
		}
		cfg.ServerName = authority[:colonPos]

@gyuho gyuho added this to the etcd-v3.4 milestone Sep 4, 2018
@jsok
Copy link
Contributor

jsok commented Dec 13, 2018

Experiencing the same issue - however with SRV discovery which will use FQDNs for the endpoint list not IP addresses.

Similarly only the first endpoint will ever successfully be balanced.

My client is Hashicorp vault also, so related to hashicorp/vault#4961.
Using env ETCD_CLIENT_DEBUG=on GRPC_GO_LOG_VERBOSITY_LEVEL=INFO has helped debug.

Highlights from the log lines follow...

All endpoints are successfully enumerated:

{"level":"info","ts":1544668240.554902,"caller":"balancer/balancer.go:134","msg":"resolved","balancer-id":"bqhf6xf67v83","addresses":["https://etcd1.nodes.example.com.:2379","https://etcd2.nodes.example.com.:2379","https://etcd3.nodes.example.com.:2379"]}

However only a single endpoint will result in a READY subconn (etcd3 in this case):

Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.581273,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5928783,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5934908,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":true,"subconn":"0xc0008c6990","address":"https://etcd3.nodes.example.com:2379","old-state":"CONNECTING","new-state":"READY"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5935388,"caller":"balancer/balancer.go:261","msg":"generated picker","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","subconn-ready":["https://etcd3.nodes.example.com:2379 (0xc0008c6990)"],"subconn-size":1}

It will generally pick a different etcd[1,2,3] server each time I restart vault. Just depends on the order the SRV record is returned.

TLS CN and SANs

As per the TLS advice for SRV discovery - the SRV domain, fqdn and IP address. Ditto for etcd[2,3] servers.

Subject: CN = etcd1.nodes.example.com
X509v3 Subject Alternative Name:
    DNS:etcd1.nodes.example.com, DNS:*.etcd.example.com, DNS:etcd.example.com, IP Address:10.xx.xx.xx

Full startup logs

Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5091121,"caller":"balancer/balancer.go:39","msg":"registered balancer","policy":"etcd-client-roundrobin-balanced","name":"etcd-etcd-client-roundrobin-balanced"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5548,"caller":"balancer/balancer.go:82","msg":"built balancer","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","resolver-target":"endpoint://client-bqhf6xf64qh7/etcd3.nodes.example.com.:2379"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.554902,"caller":"balancer/balancer.go:134","msg":"resolved","balancer-id":"bqhf6xf67v83","addresses":["https://etcd1.nodes.example.com.:2379","https://etcd2.nodes.example.com.:2379","https://etcd3.nodes.example.com.:2379"]}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5549786,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868b90","address":"https://etcd3.nodes.example.com.:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5550096,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bb0","address":"https://etcd2.nodes.example.com.:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5550244,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bd0","address":"https://etcd1.nodes.example.com.:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.572446,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bd0","address":"https://etcd1.nodes.example.com.:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5726738,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bb0","address":"https://etcd2.nodes.example.com.:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.573283,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":true,"subconn":"0xc000868b90","address":"https://etcd3.nodes.example.com.:2379","old-state":"CONNECTING","new-state":"READY"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5733666,"caller":"balancer/balancer.go:261","msg":"generated picker","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","subconn-ready":["https://etcd3.nodes.example.com.:2379 (0xc000868b90)"],"subconn-size":1}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5749798,"caller":"balancer/balancer.go:134","msg":"resolved","balancer-id":"bqhf6xf67v83","addresses":["https://etcd1.nodes.example.com:2379","https://etcd2.nodes.example.com:2379","https://etcd3.nodes.example.com:2379"]}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5750618,"caller":"balancer/balancer.go:161","msg":"removed subconn","balancer-id":"bqhf6xf67v83","address":"https://etcd3.nodes.example.com.:2379","subconn":"0xc000868b90"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.575098,"caller":"balancer/balancer.go:161","msg":"removed subconn","balancer-id":"bqhf6xf67v83","address":"https://etcd2.nodes.example.com.:2379","subconn":"0xc000868bb0"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.57512,"caller":"balancer/balancer.go:161","msg":"removed subconn","balancer-id":"bqhf6xf67v83","address":"https://etcd1.nodes.example.com.:2379","subconn":"0xc000868bd0"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5751421,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c6990","address":"https://etcd3.nodes.example.com:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.575167,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5751889,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"IDLE","new-state":"CONNECTING"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.57521,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868b90","address":"https://etcd3.nodes.example.com.:2379","old-state":"READY","new-state":"SHUTDOWN"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5752327,"caller":"balancer/balancer.go:261","msg":"generated picker","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.575254,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bb0","address":"https://etcd2.nodes.example.com.:2379","old-state":"TRANSIENT_FAILURE","new-state":"SHUTDOWN"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.575295,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc000868bd0","address":"https://etcd1.nodes.example.com.:2379","old-state":"TRANSIENT_FAILURE","new-state":"SHUTDOWN"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.581273,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5928783,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5934908,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":true,"subconn":"0xc0008c6990","address":"https://etcd3.nodes.example.com:2379","old-state":"CONNECTING","new-state":"READY"}
Dec 13 02:30:40 runc[31079]: {"level":"info","ts":1544668240.5935388,"caller":"balancer/balancer.go:261","msg":"generated picker","balancer-id":"bqhf6xf67v83","policy":"etcd-client-roundrobin-balanced","subconn-ready":["https://etcd3.nodes.example.com:2379 (0xc0008c6990)"],"subconn-size":1}
Dec 13 02:30:40 runc[31079]: ==> Vault server configuration:
Dec 13 02:30:40 runc[31079]:              Api Address: https://etcd2.nodes.example.com:8200
Dec 13 02:30:40 runc[31079]:                      Cgo: disabled
Dec 13 02:30:40 runc[31079]:          Cluster Address: https://etcd2.nodes.example.com:8201
Dec 13 02:30:40 runc[31079]:               Listener 1: tcp (addr: "0.0.0.0:8200", cluster address: "0.0.0.0:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
Dec 13 02:30:40 runc[31079]:                Log Level: (not set)
Dec 13 02:30:40 runc[31079]:                    Mlock: supported: true, enabled: true
Dec 13 02:30:40 runc[31079]:                  Storage: etcd (HA available)
Dec 13 02:30:40 runc[31079]:                  Version: Vault v0.11.5
Dec 13 02:30:40 runc[31079]:              Version Sha: a59ffa4a0f09bbf198241fe6793a96722789b639
Dec 13 02:30:40 runc[31079]: ==> Vault server started! Log data will stream in below:
Dec 13 02:30:41 runc[31079]: {"level":"info","ts":1544668241.575287,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
Dec 13 02:30:41 runc[31079]: {"level":"info","ts":1544668241.5754151,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
Dec 13 02:30:41 runc[31079]: {"level":"info","ts":1544668241.5860226,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69e0","address":"https://etcd2.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dmic-16-dp68-h1-09irec 13 02:30:41 runc[31079]: {"level":"info","ts":1544668241.5864868,"caller":"balancer/balancer.go:192","msg":"state changed","balancer-id":"bqhf6xf67v83","connected":false,"subconn":"0xc0008c69c0","address":"https://etcd1.nodes.example.com:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
Dec 13 02:30:42 runc[31079]: 2018-12-13T02:30:42.721Z [INFO]  core: vault is unsealed
Dec 13 02:30:42 runc[31079]: 2018-12-13T02:30:42.721Z [INFO]  core: entering standby mode

@jsok
Copy link
Contributor

jsok commented Dec 14, 2018

The code tracing done by @gnuoy is illuminating in that the actual sequence of events is:

  1. 3 subconns are spawned
  2. They race to ClientHandshake, the winner will set the ServerName on the TLS config, e.g. etcd1.example.com
  3. The remaining subconns (which seems to share the parent clientconn and therefore TLS config) will now be forced to verify against the same ServerName

So I can only conclude that using the same authority for all subconnections in the balancer is fundamentally flawed when used with TLS.

Under the current implementation there are 3 possible solutions:

  1. Disable TLS verification - Giant 👎
  2. Add *.example.com as a SAN to each member - not great
  3. All members must list every other member as a SAN - defeats the purpose of a HA cluster since all members must have their certs rolled each time the cluster is reconfigured

@alexbrand
Copy link

/subscribe

@xiang90
Copy link
Contributor

xiang90 commented Feb 6, 2019

/cc @gyuho @jpbetz

@jsok Is the TLS config in etcd client side or in the gRPC side? Can we switch to use a fresh config if the endpoint is different from the previous through balancer? Can you take look if we can fix this problem from etcd client side?

@gyuho
Copy link
Contributor

gyuho commented Feb 19, 2019

@xiang90 @jpbetz I can reproduce this. Let me see if I can fix this in etcd client side.

@dunjut
Copy link

dunjut commented Feb 22, 2019

We're experiencing exactly the same issue as @gnuoy :

One certificate per server IP address and etcdv3 client fails if first endpoint unavailable.

@pravinsinghal
Copy link

@gyuho I am also running into the same issue - From the comments above it looks you have been working on a patch. Any update on it.

@timothysc
Copy link

@gyuho - It's difficult to decipher if this has been fixed or if it's still pending now as there are about a dozen cross-referenced issues and PRs.

@alexbrand
Copy link

I looked at the status of this last week. I believe this is still pending on #10489. I could be wrong though.

@bahar-p
Copy link

bahar-p commented May 15, 2019

Any updates on this? We're still running into this with Kubernetes v1.14.1 and etcd 3.3.12

@timothysc
Copy link

We should really get an update on this for k8s - v1.15

@gyuho
Copy link
Contributor

gyuho commented Jun 13, 2019

Just discussed with gRPC team, and got some good feedback.
I will rework on this in the next few weeks.

@Protopopys
Copy link

Any updates on this? We're still running into this with Kubernetes v1.15.0 and etcd 3.3.13

@ymmt2005
Copy link
Contributor

ymmt2005 commented Jul 17, 2019

FWIW, we work around this problem by placing TCP reverse proxies
on each node that connect to etcd. Each client can connect etcd
via localhost:12379. Since TLS certificates of etcd servers have "localhost"
SAN and "127.0.0.1" IP-SAN, the problem can be avoided.

A possible better workaround would be to place TLS-terminating
TCP reverse proxies. That is, it terminates TLS both for client connections
and for etcd servers while validating server certificates with their public
IP addresses.

@gyuho
Copy link
Contributor

gyuho commented Jul 27, 2019

The fix #10911 has been merged, and will be released in 3.4 #10943.

@frittentheke
Copy link

@bahar-p @Protopopys ... I was just wondering the same thing.
Question still is ... will Kubernetes / kubeadm or other installer tools move etcd 3.4 (when released) or will etcd do a backport for etcd 3.3.xx and the tools just got to jump the point release.

See my question at #10911 (comment)

@abezard
Copy link

abezard commented Aug 7, 2019

@gyuho I tried this code on my staging environment and I still face the same issue:

ETCD_CLIENT_DEBUG=1 ETCDCTL_API=3 /usr/local/bin/etcdctl --endpoints 10.10.10.242:2379,10.10.10.252:2379,10.10.10.139:2379 --cacert /etc/ssl/etcd-root-ca.pem --cert /etc/ssl/client.pem --key /etc/ssl/client-key.pem put foo bar
{"level":"debug","ts":"2019-08-07T22:54:09.070Z","caller":"balancer/balancer.go:60","msg":"registered balancer","policy":"picker-roundrobin-balanced","name":"etcd-picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:09.074Z","caller":"balancer/balancer.go:97","msg":"built balancer","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","resolver-target":"endpoint://client-3d04f4c4-1cb3-4ad3-87c1-6e7577af040f/10.10.10.242:2379"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:148","msg":"resolved","picker":"picker-error","balancer-id":"bw3rmsvwfygd","addresses":["10.10.10.139:2379","10.10.10.242:2379","10.10.10.252:2379"]}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:166","msg":"created subconn","address":"10.10.10.242:2379"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:166","msg":"created subconn","address":"10.10.10.252:2379"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:166","msg":"created subconn","address":"10.10.10.139:2379"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"warn","ts":"2019-08-07T22:54:09.075Z","caller":"connectivity/connectivity.go:81","msg":"connectivity recorder received unknown state","connectivity-state":"IDLE"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"warn","ts":"2019-08-07T22:54:09.075Z","caller":"connectivity/connectivity.go:81","msg":"connectivity recorder received unknown state","connectivity-state":"IDLE"}
{"level":"info","ts":"2019-08-07T22:54:09.075Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"IDLE","new-state":"CONNECTING"}
{"level":"warn","ts":"2019-08-07T22:54:09.075Z","caller":"connectivity/connectivity.go:81","msg":"connectivity recorder received unknown state","connectivity-state":"IDLE"}
{"level":"info","ts":"2019-08-07T22:54:09.076Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:09.086Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:09.088Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:09.088Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:10.076Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:10.076Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:10.077Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:10.077Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:10.086Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:10.086Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:10.088Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:10.098Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:10.101Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:10.101Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:11.405Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:11.405Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:11.416Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:11.416Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:11.601Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:11.601Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:11.602Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:11.603Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:11.809Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:11.809Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:11.822Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3d0","subconn-size":3,"address":"10.10.10.252:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:11.822Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:13.497Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:13.497Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:13.509Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3f0","subconn-size":3,"address":"10.10.10.139:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:13.509Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"info","ts":"2019-08-07T22:54:14.044Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-error","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"TRANSIENT_FAILURE","new-state":"CONNECTING"}
{"level":"info","ts":"2019-08-07T22:54:14.044Z","caller":"balancer/balancer.go:278","msg":"updated picker","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced","subconn-ready":[],"subconn-size":0}
{"level":"info","ts":"2019-08-07T22:54:14.045Z","caller":"balancer/balancer.go:214","msg":"state changed","picker":"picker-roundrobin-balanced","balancer-id":"bw3rmsvwfygd","connected":false,"subconn":"0xc00029f3b0","subconn-size":3,"address":"10.10.10.242:2379","old-state":"CONNECTING","new-state":"TRANSIENT_FAILURE"}
{"level":"info","ts":"2019-08-07T22:54:14.045Z","caller":"balancer/balancer.go:256","msg":"updated picker to transient error picker","picker":"picker-error","balancer-id":"bw3rmsvwfygd","policy":"picker-roundrobin-balanced"}
{"level":"warn","ts":"2019-08-07T22:54:14.074Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-3d04f4c4-1cb3-4ad3-87c1-6e7577af040f/10.10.10.242:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.10.10.242:2379: connect: connection refused\""}

10.10.10.242 is down on purpose.

And from the same machine I can check the health of those etcd nodes just fine:

root@etcd-client-1:/etc/ssl# etcdctl --endpoints https://10.10.10.252:2379 --cacert=/etc/ssl/etcd-root-ca.pem --cert=/etc/ssl/client.pem  --key=/etc/ssl/client-key.pem --debug endpoint health
https://10.10.10.252:2379 is healthy: successfully committed proposal: took = 18.506072ms
root@etcd-client-1:/etc/ssl# etcdctl --endpoints https://10.10.10.139:2379 --cacert=/etc/ssl/etcd-root-ca.pem --cert=/etc/ssl/client.pem  --key=/etc/ssl/client-key.pem --debug endpoint health
https://10.10.10.139:2379 is healthy: successfully committed proposal: took = 16.068855ms

So I'm pretty confused here. I double checked to make sure I was working off a branch with your changes and I'm...

root@etcd-client-1:/etc/ssl# etcdctl --endpoints https://10.10.10.139:2379 --cacert=/etc/ssl/etcd-root-ca.pem --cert=/etc/ssl/client.pem  --key=/etc/ssl/client-key.pem --debug endpoint status
[...]
https://10.10.10.139:2379, 146b0b16a4192d16, 3.5.0-pre, 20 kB, false, false, 48, 71, 71, 
~/go/src/github.com/etcd-io/etcd (release-3.4)$ make build
GO_BUILD_FLAGS="-v" ./build
./bin/etcd --version
etcd Version: 3.5.0-pre
Git SHA: 524278c18
Go Version: go1.12
Go OS/Arch: linux/amd64
./bin/etcdctl version
etcdctl version: 3.5.0-pre
API version: 3.5

(sorry for commenting a closed ticket)

@abezard
Copy link

abezard commented Aug 9, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment