Bug 1698456 *: use wildcard domain in DNS: SAN for etcd server certs #676

hexfusion · 2019-04-27T19:19:58Z

This PR resolves an issue with client balancer and etcd. The balancer is populated with a list of etcd peer endpoints. When we dial endpoint[0] it is used as the target and the other endpoints are dialed using subconnections. I have verified that each connection, the target and subs all make a proper TLS handshake with Wireshark.

The issue we see which is painted well in the below logs, when etcd-0 fails and the balancer failsover to etcd-1 the connection will fail because the TLS context of the balancer assumes target (etcd-0).

1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 0 }. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting...

The solution, for now, is to populate the DNS: SAN of server certs with a wildcard. This will allow TLS auth to complete successfully and the balancer can properly work. This is because the target etcd-0 will now authenticate against the *.clustername.domain.com in SAN.

Check openshift-kube-apiserver logs
I0423 06:53:53.612060       1 resolver_conn_wrapper.go:116] ccResolverWrapper: sending new addresses to cc: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 0  <nil>}]
I0423 06:53:53.612145       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-2.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}]
W0423 06:53:53.654818       1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting...
I0423 06:53:53.660209       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}]
W0423 06:53:53.660243       1 clientconn.go:953] Failed to dial etcd-1.jliu-demo.qe.devcluster.openshift.com:2379: context canceled; please retry.
I0423 06:53:53.686866       1 master.go:228] Using reconciler: lease
I0423 06:53:53.687309       1 clientconn.go:551] parsed scheme: ""
I0423 06:53:53.687324       1 clientconn.go:557] scheme "" not registered, fallback to default scheme
I0423 06:53:53.687354       1 resolver_conn_wrapper.go:116] ccResolverWrapper: sending new addresses to cc: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 0  <nil>}]
I0423 06:53:53.687387       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 <nil>} {etcd-2.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}]
W0423 06:53:53.688678       1 clientconn.go:953] Failed to dial etcd-2.jliu-demo.qe.devcluster.openshift.com:2379: grpc: the connection is closing; please retry.
W0423 06:53:53.693322       1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.jliu-demo.qe.devcluster.openshift.com:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting...
W0423 06:53:53.693536       1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-2.jliu-demo.qe.devcluster.openshift.com:2379 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-2.jliu-demo.qe.devcluster.openshift.com, not etcd-0.jliu-demo.qe.devcluster.openshift.com". Reconnecting...
I0423 06:53:53.733572       1 balancer_v1_wrapper.go:125] balancerWrapper: got update addr from Notify: [{etcd-0.jliu-demo.qe.devcluster.openshift.com:2379 <nil>}]
W0423 06:53:53.733612       1 clientconn.go:953] Failed to dial etcd-1.jliu-demo.qe.devcluster.openshift.com:2379: context canceled; please retry.
W0423 06:53:53.733621       1 clientconn.go:953] Failed to dial etcd-2.jliu-demo.qe.devcluster.openshift.com:2379: context canceled; please retry.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1698456

Ref:

/hold

hexfusion · 2019-04-27T19:21:52Z

/cc @deads2k @abhinavdahiya @smarterclayton

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>

hexfusion · 2019-04-27T19:28:22Z

/cc @ericavonb

smarterclayton · 2019-04-28T04:16:43Z

This is reasonable to me, and the failure scenario is bad.

/approve

for 4.1 (the code looks correct to me, but want others to review deeper)

smarterclayton · 2019-04-28T04:17:28Z

As a side note, we should make sure you are in OWNERS under cmd/setup-etcd-environment and a few other dirs.

runcom · 2019-04-28T11:58:00Z

/approve

from mco point on view, will leave to others as well

smarterclayton · 2019-04-29T16:30:49Z

This lgtm, and I've heard no comment, and this is a huge blocker

/lgtm

openshift-ci-robot · 2019-04-29T16:30:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hexfusion, runcom, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2019-04-29T16:31:13Z

/hold cancel

openshift-bot · 2019-04-29T17:00:48Z

/retest

Please review the full test history for this PR and help us cut down flakes.

smarterclayton · 2019-04-29T17:13:53Z

/retest

hexfusion · 2019-04-29T17:37:44Z

level=error msg="1 error occurred:"
level=error msg="\t* module.vpc.aws_route.to_nat_gw[1]: 1 error occurred:"
level=error msg="\t* aws_route.to_nat_gw.1: Error creating route: timeout while waiting for state to become 'success' (timeout: 20m0s)"

openshift-bot · 2019-04-29T17:52:57Z

/retest

Please review the full test history for this PR and help us cut down flakes.

hexfusion · 2019-04-29T18:03:46Z

/test e2e-aws

hexfusion · 2019-04-29T19:19:52Z

/test e2e-aws

hexfusion · 2019-04-29T19:56:38Z

level=error msg="1 error occurred:"
level=error msg="\t* module.vpc.aws_route_table_association.route_net[1]: 1 error occurred:"
level=error msg="\t* aws_route_table_association.route_net.1: timeout while waiting for state to become 'success' (timeout: 5m0s)"

/test e2e-aws

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 27, 2019

openshift-ci-robot requested review from ashcrow and runcom April 27, 2019 19:20

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 27, 2019

openshift-ci-robot requested review from abhinavdahiya, deads2k and smarterclayton April 27, 2019 19:21

*: use wildcard domain in DNS: SAN for etcd server certs

a583be1

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>

hexfusion force-pushed the fx_balancer branch from 35e1370 to a583be1 Compare April 27, 2019 19:27

openshift-ci-robot requested a review from ericavonb April 27, 2019 19:28

smarterclayton added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 28, 2019

openshift-ci-robot assigned smarterclayton Apr 29, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 29, 2019

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2019

openshift-merge-robot merged commit 03ff4af into openshift:master Apr 29, 2019

russellb mentioned this pull request May 14, 2019

When a single master goes down the api is no longer available (virt) openshift-metal3/dev-scripts#534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1698456 *: use wildcard domain in DNS: SAN for etcd server certs #676

Bug 1698456 *: use wildcard domain in DNS: SAN for etcd server certs #676

hexfusion commented Apr 27, 2019 •

edited

Loading

hexfusion commented Apr 27, 2019

hexfusion commented Apr 27, 2019

smarterclayton commented Apr 28, 2019

smarterclayton commented Apr 28, 2019

runcom commented Apr 28, 2019

smarterclayton commented Apr 29, 2019

openshift-ci-robot commented Apr 29, 2019

smarterclayton commented Apr 29, 2019

openshift-bot commented Apr 29, 2019

smarterclayton commented Apr 29, 2019

hexfusion commented Apr 29, 2019

openshift-bot commented Apr 29, 2019

hexfusion commented Apr 29, 2019

hexfusion commented Apr 29, 2019

hexfusion commented Apr 29, 2019

Bug 1698456 *: use wildcard domain in DNS: SAN for etcd server certs #676

Bug 1698456 *: use wildcard domain in DNS: SAN for etcd server certs #676

Conversation

hexfusion commented Apr 27, 2019 • edited Loading

hexfusion commented Apr 27, 2019

hexfusion commented Apr 27, 2019

smarterclayton commented Apr 28, 2019

smarterclayton commented Apr 28, 2019

runcom commented Apr 28, 2019

smarterclayton commented Apr 29, 2019

openshift-ci-robot commented Apr 29, 2019

smarterclayton commented Apr 29, 2019

openshift-bot commented Apr 29, 2019

smarterclayton commented Apr 29, 2019

hexfusion commented Apr 29, 2019

openshift-bot commented Apr 29, 2019

hexfusion commented Apr 29, 2019

hexfusion commented Apr 29, 2019

hexfusion commented Apr 29, 2019

hexfusion commented Apr 27, 2019 •

edited

Loading