Update probe settings #1028

mogren · 2020-06-15T05:39:40Z

Description of changes:

Reduce readiness probe startup delay to 1s
Increase liveness probe startup delay to 60, since it will only fail if the CNI fails to start, or has crashed.
Reduce shutdown grace period to 10 seconds

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

anguslees

I think updateStrategy.rollingUpdate.maxUnavailable=10% still requires the readiness probe to determine "Available"

mogren · 2020-06-15T06:59:41Z

Oh, you are right! We might have to revisit the time it takes though, rolling updates take a really long time currently.

anguslees · 2020-06-16T07:20:18Z

We might have to revisit the time it takes though

How are the new values arrived at?
(So we can point back at this PR and our original assumptions, when we next want to change these values)

mogren · 2020-06-17T20:49:01Z

config/master/aws-k8s-cni-cn.yaml

@@ -154,6 +155,7 @@
          "name": "cni-bin-dir"
      "priorityClassName": "system-node-critical"
      "serviceAccountName": "aws-node"
+      "terminationGracePeriodSeconds": 10


The default terminationGracePeriodSeconds is 30 seconds. The CNI does not really need any time to shut down, so reducing this to 10s.

why not zero (sigkill the pod immediately)?

Here I did set it at 10s on purpose to avoid leaking ENIs. There is a shutdown hook and that flag is checked before trying to free or add an ENI. If the shutdown happens just when ipamd has detached an ENI, but before it has deleted it, we would "leak" it. It usually takes around 2 seconds for the detach to complete, but p99 is 10s according to EC2. Same for creating ENIs, we want the create to complete and the ENI to be attached to the node (with the delete on termination flag set) so that it gets cleaned up if the node terminates.

That said, i don't think we have done enough testing on this.

(:rocket: This is exactly the sort of analysis I was hoping to capture)

config/master/aws-k8s-cni-cn.yaml

mogren · 2020-06-18T04:58:42Z

config/master/aws-k8s-cni-cn.yaml

@@ -121,7 +121,7 @@
            "command":
            - "/app/grpc-health-probe"
            - "-addr=:50051"
-          "initialDelaySeconds": 35
+          "initialDelaySeconds": 1


Helped a lot! I tested on a cluster with 50 pods on 3 nodes:

kube-system aws-node-2ph8t 0/1 ContainerCreating 0 1s kube-system aws-node-2ph8t 1/1 Running 0 3s

haouc

Looks good to me. Thanks.

Changes addressed 5 days ago :)

anguslees · 2020-06-23T04:18:32Z

config/master/aws-k8s-cni-cn.yaml

@@ -154,6 +155,7 @@
          "name": "cni-bin-dir"
      "priorityClassName": "system-node-critical"
      "serviceAccountName": "aws-node"
+      "terminationGracePeriodSeconds": 10


(:rocket: This is exactly the sort of analysis I was hoping to capture)

anguslees · 2020-06-23T04:25:23Z

config/master/manifests.jsonnet

+                exec: {
+                  command: ["/app/grpc-health-probe", "-addr=:50051"],
+                },
+                initialDelaySeconds: 60,


The PR description says "Increase liveness polling period". I think this PR increases the liveness initial delay period however - which one was intended? (I suspect this should have been periodSeconds=60)

(Separately) jsonnet style tip: You can "merge" values with +. Since liveness/readiness probes are very similar, I would do this as:

livenessProbe: self.readinessProbe + { initialDelaySeconds: 60, // If desired, but see above comment periodSeconds: 60, },

I'll fix the comment. I think, like you said in #1028 (comment), that we should just increase the initial timeout, but keep the period at the default in order to catch issues that happen after startup faster. That should give the aws-node about 90 seconds to get started, after a 60 s startup delay and three 10 s liveness probes.

mogren · 2020-06-23T22:06:22Z

We will need to bump this time as well:

amazon-vpc-cni-k8s/scripts/entrypoint.sh

Lines 49 to 61 in a5d340b

    
           # Checks for IPAM connectivity on localhost port 50051, retrying connectivity 
        
           # check with a timeout of 36 seconds 
        
           wait_for_ipam() { 
        
               local __sleep_time=0 
        
               until [ $__sleep_time -eq 8 ]; do 
        
                   sleep $((__sleep_time++)) 
        
                   if ./grpc-health-probe -addr 127.0.0.1:50051 >/dev/null 2>&1; then 
        
                       return 0 
        
                   fi 
        
               done 
        
               return 1 
        
           }

* Reduce readiness probe startup delay * Increase liveness polling period * Reduce shutdown grace period to 10 seconds

mogren requested a review from anguslees June 15, 2020 05:39

anguslees previously requested changes Jun 15, 2020

View reviewed changes

mogren marked this pull request as draft June 15, 2020 07:06

mogren force-pushed the remove-readiness branch from 7b12924 to 561a38f Compare June 15, 2020 16:39

mogren changed the title ~~Remove readiness probe~~ Update probe settings Jun 15, 2020

mogren force-pushed the remove-readiness branch 2 times, most recently from 46c810c to 2adff93 Compare June 15, 2020 16:46

mogren marked this pull request as ready for review June 15, 2020 16:50

mogren force-pushed the remove-readiness branch from 2adff93 to cfbc3f8 Compare June 15, 2020 16:50

mogren requested a review from anguslees June 15, 2020 17:03

mogren commented Jun 17, 2020

View reviewed changes

mogren force-pushed the remove-readiness branch 2 times, most recently from 3d8d14a to 3c24f2b Compare June 17, 2020 22:44

mogren commented Jun 18, 2020

View reviewed changes

mogren force-pushed the remove-readiness branch from 3c24f2b to 9bd02ec Compare June 18, 2020 06:06

mogren requested review from jaypipes, haouc and jayanthvn June 18, 2020 20:04

haouc approved these changes Jun 18, 2020

View reviewed changes

mogren force-pushed the remove-readiness branch from 9bd02ec to 21bf401 Compare June 19, 2020 20:36

anguslees reviewed Jun 23, 2020

View reviewed changes

mogren force-pushed the remove-readiness branch from 21bf401 to 7b643a2 Compare June 23, 2020 05:29

jaypipes mentioned this pull request Jun 23, 2020

Remove timeout for ipamd startup #874

Merged

Update probe settings

9c28436

* Reduce readiness probe startup delay * Increase liveness polling period * Reduce shutdown grace period to 10 seconds

mogren force-pushed the remove-readiness branch from 7b643a2 to 9c28436 Compare June 24, 2020 06:07

mogren merged commit 9fea153 into aws:master Jun 24, 2020

mogren deleted the remove-readiness branch June 24, 2020 17:07

mogren mentioned this pull request Jun 25, 2020

Container Restarts on 1.6.1 (EKS 1.15) #1054

Closed

achevuru mentioned this pull request Jun 26, 2020

Pod restarting at node startup #1049

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update probe settings #1028

Update probe settings #1028

mogren commented Jun 15, 2020 •

edited

Loading

anguslees left a comment

mogren commented Jun 15, 2020

anguslees commented Jun 16, 2020

mogren Jun 17, 2020

anguslees Jun 17, 2020 •

edited

Loading

mogren Jun 17, 2020 •

edited

Loading

anguslees Jun 23, 2020

mogren Jun 18, 2020

haouc left a comment

anguslees Jun 23, 2020

anguslees Jun 23, 2020

mogren Jun 23, 2020

mogren commented Jun 23, 2020

Update probe settings #1028

Update probe settings #1028

Conversation

mogren commented Jun 15, 2020 • edited Loading

anguslees left a comment

Choose a reason for hiding this comment

mogren commented Jun 15, 2020

anguslees commented Jun 16, 2020

mogren Jun 17, 2020

Choose a reason for hiding this comment

anguslees Jun 17, 2020 • edited Loading

Choose a reason for hiding this comment

mogren Jun 17, 2020 • edited Loading

Choose a reason for hiding this comment

anguslees Jun 23, 2020

Choose a reason for hiding this comment

mogren Jun 18, 2020

Choose a reason for hiding this comment

haouc left a comment

Choose a reason for hiding this comment

anguslees Jun 23, 2020

Choose a reason for hiding this comment

anguslees Jun 23, 2020

Choose a reason for hiding this comment

mogren Jun 23, 2020

Choose a reason for hiding this comment

mogren commented Jun 23, 2020

mogren commented Jun 15, 2020 •

edited

Loading

anguslees Jun 17, 2020 •

edited

Loading

mogren Jun 17, 2020 •

edited

Loading