Support Custom Pod Status for PodReadinessGate to Block Premature Pod Termination #905

BrianChristie · 2019-04-01T09:30:57Z

Presently when using IP target mode on the ALB, it's difficult to guarantee zero downtime rolling updates. This is because there is additional state in the ALB Target Group which is not propagated back to Kubernetes: initial,draining, and healthy.

This was discussed in: #660 & #814

This issue is to discuss adding support to the ALB controller for setting a custom Pod Status which mirrors the ALB TargetGroup. This can then be used with PodReadinessGate and MaxUnavailable / PodDisruptionBudget to ensure a rolling update does not proceed faster than the new Pods becoming healthy on the ALB.

The text was updated successfully, but these errors were encountered:

alfredkrohmer · 2019-05-11T17:52:58Z

I'm working on this.

alfredkrohmer · 2019-05-19T08:47:18Z

I won't have time to continue working on this in the next two weeks, but I have pushed an initial implementation here: https://github.com/devkid/aws-alb-ingress-controller/tree/feature/pod-readiness-gate (it compiles and has documentation, but I didn't test it yet nor did I adjust the unit tests).

It works by reconciling the pod conditions in the same step where it reconciles the ALB target group state. I think it wouldn't work in its current state because right now only the Ready endpoints of a service are registered with the ALB. When a pod declares the ALB readiness gate, it would never turn Ready because it would wait for the readiness gate (turn healthy in ALB), which would never be fulfilled because the pod was never registered with the ALB (because it was not Ready yet) - so endless loop. To solve this, the ALB ingress controller would need to support the publishNotReadyAddresses attribute of a Kubernetes services and it would need to be set to true if one would want to use this pod readiness gate feature (then all pods - also unready ones - would be registered with the target group, pass the health check and only then appear Ready in kubernetes).

If someone feels like picking up from where I left, feel free to do so. Otherwise I would probably continue in ~2-3 weeks.

tecnobrat · 2019-06-17T14:00:07Z

Hey @devkid -- just want to check-in to see if you have had a chance to get back to this change at all.

This is impacting us and we're open to trying to help with any PRs that need to be done if you need help.

Thanks!

alfredkrohmer · 2019-06-18T21:14:10Z

Hi @tecnobrat. I didn't find the time to continue with this significantly, but the functionality should be usable as far as I can tell. I created a draft PR here: #955 and added some TODOs. The controller itself is building, but the target group unit test is broken because we need to pass in another parameter to the targets controller (k8s client). Also it's still missing additional unit tests for the new functionality.

I'll be gone for another 2 weeks now. Maybe you can pick this up if you want to.

boynux · 2019-07-01T08:46:10Z

Hi,
We hit this issue in our new EKS cluster which prevents us moving forward with the production workload, any progress with the proposed PR?

alfredkrohmer · 2019-07-01T18:37:29Z

Well, the functionality is there, I'm still waiting for reviews. You can try the code from my branch if you want to, that would be helpful feedback already.
I'll be meeting some EKS people tomorrow and will try to push them to get this reviewed by some of them.

alfredkrohmer · 2019-07-01T19:04:50Z

Btw a workaround would be to have at least 2+n replicas and maxUnavailable = 0+n and an initialDelaySeconds that is large enough to cover the time it takes to register a target with an target group and for it to become healthy; this should prevent the last pod from terminating until this delay elapses for the second new pod (which is not yet registered with the target group, but in the meantime the first new pod was ready long enough to be healthy in the target group).

gzzo · 2019-07-01T21:13:10Z

@devkid Would a pod be registered on the ALB before it's ready?

I had a similar hypothesis a while ago and I tried with 6 replicas, 0 maxUnavailable, and 60 minReadySeconds but I still saw 504s with IP target mode.

nirnanaaa · 2019-07-02T07:08:43Z

@devkid there are still some issues with your approach. I'm trying to come up with a follow-up PR to yours so we can maybe share the work on this.

Especially the endpoint and reconciliation part is very much open for discussion: https://github.com/kubernetes-sigs/aws-alb-ingress-controller/compare/master...nirnanaaa:feature/pod-readiness-gate#diff-c6555fa828feb3fa0daf4ff60736047dR130

One question @devkid: How do we handle ALB TG changes that kubernetes does not receive immediately (where no reconcile will be triggered). Ingress-Gce uses a poller to query their NEGs, should we follow a similar strategy?

wirescrossed · 2019-07-02T14:01:18Z

We are having a similar issue as well. When pods are terminated the ALB Ingress Controller rapidly updates the ALB target group however the changes take a few seconds to take effect which causes a short period where clients may get 504s.

We are using IP mode meaning each Pod is in the target group since we require the real ALB source IP not the source IP kube-proxy uses in order to use NetworkPolicies.

To workaround this we are taking the approach of running kube-proxy without source NAT (https://kubernetes.io/docs/tutorials/services/source-ip/) though this has draw backs as well but masks the issue to a degree.

What I would really like to see is one of two things:

ALB to keep pace with the rapid changes in k8s
k8s give the ALB time to catch up by delaying termination of the pod.

nirnanaaa · 2019-07-02T14:18:27Z

I would be willing to continue @devkid s and my work on this PR. Maybe in collaboration with someone from AWS that could give us insight on the Polling-thing i was talking about earlier.

/cc @andrewtandoc

iverberk · 2019-07-02T15:17:02Z

We are hitting this as well and it's a blocker for production workloads. @devkid @nirnanaaa I was going to review and finish this PR also, so let's explore the problem space and collaborate on a design that we can implement. It is actually very similar to what Google needed to solve for their loadbalancers (23:00 in this video https://youtu.be/Vw9GmSeomFg)

nirnanaaa · 2019-07-02T15:23:15Z

@iverberk you're absolutely right. We were able to solve most of the challenges by taking a look at ingress-gce (basically this piece is the relevant one: https://github.com/kubernetes/ingress-gce/tree/5b2fd1df8634ca02aaa10072d7769a2544d39cb5/pkg/neg/readiness)

Except for the poller everything works just fine (didn't figure out a way, to make that work).

Maybe you have something to overcome that issue, already? See @devkid s PR and my changes here: https://github.com/kubernetes-sigs/aws-alb-ingress-controller/compare/master...nirnanaaa:feature/pod-readiness-gate#diff-c6555fa828feb3fa0daf4ff60736047dR130

iverberk · 2019-07-02T15:48:47Z

Not necessarily starting of fresh, it seems the implementation is almost there, just didn't want to duplicate efforts. I didn't know about that ingress-gce code, but will check it out.

I guess you suggest polling to detect changes to target groups (targets becoming unhealthy etc.) that happen outside the aws-alb-ingress-controller purview and need reconciling with Pod readiness gates?

alfredkrohmer · 2019-07-02T19:49:19Z

@nirnanaaa @iverberk do we maybe want to meet to discuss the next steps? I would send out an invite tomorrow from my company email.

nirnanaaa · 2019-07-02T19:51:13Z

@devkid

@iverberk and I were already in talks on slack, I could not find you on the k8s slack?

wirescrossed · 2019-07-03T14:58:35Z

We were able to slow down the termination of pods to let the ALB catch up thus preventing 504s by using lifecycle prestop hook

        lifecycle:
          preStop:
            exec:
              command:
              - /bin/bash
              - -c
              - sleep 25
              - apachectl -k graceful-stop

tecnobrat · 2019-07-09T22:30:47Z

@nirnanaaa try maybe reaching out to @M00nF1sh? They are on the k8s slack as well.

nirnanaaa · 2019-07-10T07:10:04Z

will do, thanks!

iverberk · 2019-08-05T16:29:26Z

@nirnanaaa @devkid were you able to make some progress on this? I think we need to pick this up as the ingress controller is unable to do zero-downtime deployments in the current state.

nirnanaaa · 2019-08-05T18:05:24Z

no progress, yet. Also was gone on vacation and had some internal work to do. I was talking to @M00nF1sh to find out if there is something planned in terms of external-resource-watch and he said that there's a similar issue we're facing with the pull request when dealing with security group attachments.

I asked him if it would make sense to work on this and he said (quote):

i have been planning to have such controller to deal with security group attachment(currently under ip mode, the security group is detached immediately when pods is terminating, and breaks on-going connections)
if we have it, we can use it to deal with security group attachment and pod readiness gate together

not sure what that refers to; creating the controller on our own or waiting until the controller is there.

nirnanaaa · 2019-08-07T12:41:26Z

I will setup a new PR, to propose such a controller.

billyshambrook · 2019-08-10T03:57:47Z

Hey, we have ran into the issue described in this issue and have also tested how we should handle this when we are rotating our AMIs which require rotating all of our EC2 instances. We rely on cloudformation update policy to rotate our instances in an autoscaling group. This presents an additional factor that the ALB controller is getting rescheduled at the same time other pods are. We have currently found setting the prestop hook sleep to 60secs mostly avoids 5xx errors.

arminc · 2019-08-26T09:57:56Z

Me, @nirnanaaa and @iverberk were trying to implement something outside of the ALB (code is here: https://github.com/nirnanaaa/kube-readiness) to test the theories and this is what we found out:

You need to set publishNotReadyAddresses otherwise the pods are not registered to the ALB and they will never become ready.
When using the PodReadinessGate the pod status shows ready but it's not until you set the gate to true. When you want to know if the pod is ready you can't rely on the status alone you need to look at all statuses.
You need to put a terminating Pod to draining in ALB to make sure no NEW calls are sent to it before the Pod is gone.
To achieve number 3 your Pod needs to go into terminating state and stay in that state long enough for ALB to finish draining.
Most applications don't do number 4 properly so you end up adding a preStop hook.
Because of number 1 alb ingress controller will re-add the container after you put it to draining.
After fixing all of this we still notice that ALB is sending traffic to the pod (for at least 15 seconds) even while the pod is in draining state.

After doing all of this we were able to make it work eventually with consistent behaviour and achieving zero-downtime rolling update for our end-2-end test.

Unsolved issues:

Why does the ALB send traffic to draining pods? We think it might have to do something with keep-alive between the ALB and the pods but we don't know for sure until AWS explains what is happening. @M00nF1sh Did you already have contact with the ALB team about this?

Conclusion:
If the above issue is resolved it should be possible to tackle the problem

Alternative:
Use the current alb ingress controller and add a long enough preStop hook, mostly 30 seconds. Because the pod is removed from healthy targets in the endpoint the ingress controller will set the pod in draining in ALB. Because the pod is not removed directly (preStop hook) it will allow connections to drain.

stevenpall · 2019-10-30T03:09:18Z

@nirnanaaa et al.: were you able to get a new PR set up? It seems like this issue has stalled and it's a pretty large blocker for anyone wanting to use IP target mode in a production environment. Thanks!

nirnanaaa · 2019-10-30T04:07:58Z

@stevenpall unfortunately not. I am at this point uncertain where to continue. The repo, that @arminc mentioned is in theory working just fine. With the controller and a prestop hook it's working quite well.
If someone has an idea I'd love to join with him.

fejta-bot · 2020-01-28T04:53:42Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

alfredkrohmer · 2020-01-28T13:47:09Z

/remove-lifecycle stale

alfredkrohmer · 2020-02-24T18:56:56Z

Hi all,

I have updated my pull request: #955 It now starts a go routine per target group to reconcile the condition status (it quits when all targets are healthy – see the code for more details).

Testing would be very welcome.

BrianChristie mentioned this issue May 9, 2019

Add healthcheck-enabled annotation and make the default false #927

Closed

alfredkrohmer mentioned this issue Jun 18, 2019

Pod readiness gates #955

Merged

2 tasks

mailjunze mentioned this issue Dec 4, 2019

ALB sending requests to pods after ingress controller deregisters them leading to 504s #1064

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 28, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 28, 2020

k8s-ci-robot closed this as completed in #955 Mar 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Custom Pod Status for PodReadinessGate to Block Premature Pod Termination #905

Support Custom Pod Status for PodReadinessGate to Block Premature Pod Termination #905

BrianChristie commented Apr 1, 2019

alfredkrohmer commented May 11, 2019

alfredkrohmer commented May 19, 2019

tecnobrat commented Jun 17, 2019 •

edited

Loading

alfredkrohmer commented Jun 18, 2019

boynux commented Jul 1, 2019

alfredkrohmer commented Jul 1, 2019

alfredkrohmer commented Jul 1, 2019 •

edited

Loading

gzzo commented Jul 1, 2019

nirnanaaa commented Jul 2, 2019 •

edited

Loading

wirescrossed commented Jul 2, 2019

nirnanaaa commented Jul 2, 2019

iverberk commented Jul 2, 2019

nirnanaaa commented Jul 2, 2019 •

edited

Loading

iverberk commented Jul 2, 2019

alfredkrohmer commented Jul 2, 2019

nirnanaaa commented Jul 2, 2019

wirescrossed commented Jul 3, 2019

tecnobrat commented Jul 9, 2019

nirnanaaa commented Jul 10, 2019

iverberk commented Aug 5, 2019

nirnanaaa commented Aug 5, 2019

nirnanaaa commented Aug 7, 2019

billyshambrook commented Aug 10, 2019

arminc commented Aug 26, 2019

stevenpall commented Oct 30, 2019

nirnanaaa commented Oct 30, 2019

fejta-bot commented Jan 28, 2020

alfredkrohmer commented Jan 28, 2020

alfredkrohmer commented Feb 24, 2020

Support Custom Pod Status for PodReadinessGate to Block Premature Pod Termination #905

Support Custom Pod Status for PodReadinessGate to Block Premature Pod Termination #905

Comments

BrianChristie commented Apr 1, 2019

alfredkrohmer commented May 11, 2019

alfredkrohmer commented May 19, 2019

tecnobrat commented Jun 17, 2019 • edited Loading

alfredkrohmer commented Jun 18, 2019

boynux commented Jul 1, 2019

alfredkrohmer commented Jul 1, 2019

alfredkrohmer commented Jul 1, 2019 • edited Loading

gzzo commented Jul 1, 2019

nirnanaaa commented Jul 2, 2019 • edited Loading

wirescrossed commented Jul 2, 2019

nirnanaaa commented Jul 2, 2019

iverberk commented Jul 2, 2019

nirnanaaa commented Jul 2, 2019 • edited Loading

iverberk commented Jul 2, 2019

alfredkrohmer commented Jul 2, 2019

nirnanaaa commented Jul 2, 2019

wirescrossed commented Jul 3, 2019

tecnobrat commented Jul 9, 2019

nirnanaaa commented Jul 10, 2019

iverberk commented Aug 5, 2019

nirnanaaa commented Aug 5, 2019

nirnanaaa commented Aug 7, 2019

billyshambrook commented Aug 10, 2019

arminc commented Aug 26, 2019

stevenpall commented Oct 30, 2019

nirnanaaa commented Oct 30, 2019

fejta-bot commented Jan 28, 2020

alfredkrohmer commented Jan 28, 2020

alfredkrohmer commented Feb 24, 2020

tecnobrat commented Jun 17, 2019 •

edited

Loading

alfredkrohmer commented Jul 1, 2019 •

edited

Loading

nirnanaaa commented Jul 2, 2019 •

edited

Loading

nirnanaaa commented Jul 2, 2019 •

edited

Loading