Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network component should refresh certificates if they expire #17135

Merged

Conversation

smarterclayton
Copy link
Contributor

A single process openshift start node has both kubelet and network.
Kubelet rotates its client certs - network does not. Until we split out
network we need to do something minimal to ensure the cert is refreshed.
Use the same code as the kubelet, but when the cert expires check disk
and see if it was refreshed. That should work for bootstrapped
environments because the file is updated.

@openshift-merge-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 1, 2017
@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 1, 2017
@php-coder
Copy link
Contributor

/unassign

@juanvallejo
Copy link
Contributor

/unassign
/assign mfojtik

@smarterclayton
Copy link
Contributor Author

@deads2k as discussed

@smarterclayton
Copy link
Contributor Author

/retest

@smarterclayton smarterclayton force-pushed the refresh_transport branch 3 times, most recently from 9740cf7 to 64084b5 Compare November 2, 2017 04:36
@openshift-merge-robot openshift-merge-robot added the vendor-update Touching vendor dir or related files label Nov 2, 2017
@smarterclayton
Copy link
Contributor Author

Tested on a bootstrap cluster with cert rotation. This is GTG

Nov 02 04:38:11 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:38:11.381806   90662 certificate_manager.go:319] Rotating certificates
Nov 02 04:38:11 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:38:11.627859   90662 certificate_manager.go:396] Certificate expiration is 2017-11-02 04:53:00 +0000 UTC, rotation deadline is 2017-11-02 04:49:07.034873136 +0000 UTC
Nov 02 04:38:11 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:38:11.627921   90662 certificate_manager.go:256] Waiting 10m55.406954913s for next certificate rotation
Nov 02 04:38:19 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:38:19.438724   90662 transport.go:100] certificate rotation detected, shutting down client connections to start using new credentials
Nov 02 04:42:59 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:42:59.986424   90662 container_manager_linux.go:398] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Nov 02 04:43:09 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:43:09.283743   90662 transport.go:139] Current client certificate is expired, checking from disk
Nov 02 04:43:09 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:43:09.283778   90662 transport.go:148] Refreshing client certificate from store
Nov 02 04:43:09 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:43:09.283835   90662 certificate_store.go:119] Loading cert/key pair from "openshift.local.config/node/certificates/kubelet-client-current.pem".
Nov 02 04:43:09 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:43:09.284246   90662 transport.go:88] certificate rotation detected, shutting down client connections to start using new credentials
Nov 02 04:45:44 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:45:44.996433   90662 certificate_manager.go:319] Rotating certificates
Nov 02 04:45:45 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:45:45.248787   90662 certificate_manager.go:396] Certificate expiration is 2017-11-02 05:01:00 +0000 UTC, rotation deadline is 2017-11-02 04:58:47.19376793 +0000 UTC
Nov 02 04:45:45 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:45:45.248860   90662 certificate_manager.go:256] Waiting 13m1.944917118s for next certificate rotation
Nov 02 04:47:59 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:47:59.986808   90662 container_manager_linux.go:398] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Nov 02 04:49:07 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:49:07.035015   90662 certificate_manager.go:319] Rotating certificates
Nov 02 04:49:07 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:49:07.292141   90662 certificate_manager.go:396] Certificate expiration is 2017-11-02 05:04:00 +0000 UTC, rotation deadline is 2017-11-02 04:59:44.636713952 +0000 UTC
Nov 02 04:49:07 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:49:07.292613   90662 certificate_manager.go:256] Waiting 10m37.34410661s for next certificate rotation
Nov 02 04:49:09 ci-claytontest2-ig-m-q1lk origin-node[90662]: I1102 04:49:09.449191   90662 transport.go:100] certificate rotation detected, shutting down client connections to start using new credentials

@smarterclayton smarterclayton added the kind/bug Categorizes issue or PR as related to a bug. label Nov 2, 2017
@smarterclayton
Copy link
Contributor Author

/retest

}

lastCert := manager.Current()
go wait.Until(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this closes all connection some indeterminant amount of time after the tlsConfig.GetClientCertificate has started returning a different value. However, since there is no synchronization between client users and this, it is possible that an old client certificate was returned, the gofunc is suspended, the gofunc is much later unsuspended, and used for a dial after the connections were closed. Will the connection fail properly and re-establish in that case?

It's not very likely, so as long as it eventually recovers (on some reasonable timescale) it seems ok, but it does seem like a risk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking here:

  1. tlsConfig.GetClientCertificate is called after a connection has been accepted, and therefore any cert returned by this method must also have a connection in the connTracker before it returns a cert
  2. The monitor loop is guaranteed to close any connections registered before the new cert shows up (once the monitor loop sees a new cert, you can only ever close all old certs and some new certs)
  3. A connection opened (and thus registered) which hasn't yet asked for a cert because it is delayed is still cleaned up whenever a new cert shows up

So I think we're ok, but we may want to talk more tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 is not true in this case. Going to do the thing we discussed in person - guarantee a minimum window between cert rotation and refresh.

}, period, stopCh)

clientConfig.Transport = utilnet.SetTransportDefaults(&http.Transport{
Proxy: http.ProxyFromEnvironment,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we used a custom proxy function to handle CIDRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we used a custom proxy function to handle CIDRs.

Yeah, apimachinery/pkg/util/net.NewProxierWithNoProxyCIDR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, why not SetTransportDefaults?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pure copy of the upstream - agree it should be fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, I'll follow up upstream for the other place.

if now.After(cert.Leaf.NotAfter) {
if now.Sub(m.lastCheck) > m.minimumRefresh {
glog.V(2).Infof("Current client certificate is expired, checking from disk")
cert = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After you've set this, you're setting yourself up to hotloop on this aren't you? I'm not seeing another check against lastCheck after the cert is nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sets function cert, not m.cert. We always return the last cert we loaded.

Symlinks must be to absolute paths, or relative to the target. Absolute
is easier here.
A single process openshift start node has both kubelet and network.
Kubelet rotates its client certs - network does not. Until we split out
network we need to do something minimal to ensure the cert is refreshed.
Use the same code as the kubelet, but when the cert expires check disk
and see if it was refreshed. That should work for bootstrapped
environments because the file is updated.
@smarterclayton
Copy link
Contributor Author

All comments addressed, going to test in a real cluster soon and then label

@smarterclayton
Copy link
Contributor Author

/retest

Real bootstrapped cluster test successful, all comments addressed, labelling

@smarterclayton
Copy link
Contributor Author

Nov 03 23:24:41 ci-claytontest-ig-n-d363 origin-node[11150]: I1103 23:24:41.454758   11150 certificate_manager.go:319] Rotating certificates
Nov 03 23:24:41 ci-claytontest-ig-n-d363 origin-node[11150]: I1103 23:24:41.709342   11150 certificate_manager.go:396] Certificate expiration is 2017-11-03 23:40:00 +0000 UTC, rotation deadline is 2017-11-03 23:35:16.515412987 +0000 UTC
Nov 03 23:24:41 ci-claytontest-ig-n-d363 origin-node[11150]: I1103 23:24:41.709417   11150 certificate_manager.go:256] Waiting 10m34.805999278s for next certificate rotation
Nov 03 23:24:48 ci-claytontest-ig-n-d363 origin-node[11150]: I1103 23:24:48.858200   11150 transport.go:100] certificate rotation detected, shutting down client connections to start using new credentials
Nov 03 23:24:52 ci-claytontest-ig-n-d363 origin-node[11150]: I1103 23:24:52.440611   11150 container_manager_linux.go:398] [ContainerManager]: Discovered runtime cgroups name: /system.slice/docker.service
Nov 03 23:27:08 ci-claytontest-ig-n-d363 origin-node[11150]: I1103 23:27:08.737425   11150 transport.go:151] Current client certificate is about to expire, checking from disk
Nov 03 23:27:08 ci-claytontest-ig-n-d363 origin-node[11150]: I1103 23:27:08.737466   11150 transport.go:160] Refreshing client certificate from store
Nov 03 23:27:08 ci-claytontest-ig-n-d363 origin-node[11150]: I1103 23:27:08.737537   11150 certificate_store.go:119] Loading cert/key pair from "openshift.local.config/node/certificates/kubelet-client-current.pem".
Nov 03 23:27:18 ci-claytontest-ig-n-d363 origin-node[11150]: I1103 23:27:18.738160   11150 transport.go:93] certificate rotation detected, shutting down client connections to start using new credentials

Note the 10s gap between detection and refresh

@smarterclayton smarterclayton added the lgtm Indicates that a PR is ready to be merged. label Nov 3, 2017
@smarterclayton
Copy link
Contributor Author

/retest

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue.

@openshift-merge-robot openshift-merge-robot merged commit ceadf12 into openshift:master Nov 4, 2017
@deads2k
Copy link
Contributor

deads2k commented Nov 6, 2017

@smarterclayton SetTransportDefaults might screw you in a real environment. We make use of that CIDR respect in ansible.

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Nov 6, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants