Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Custom Pod Status for PodReadinessGate to Block Premature Pod Termination #905

Closed
BrianChristie opened this issue Apr 1, 2019 · 29 comments · Fixed by #955
Closed

Comments

@BrianChristie
Copy link

Presently when using IP target mode on the ALB, it's difficult to guarantee zero downtime rolling updates. This is because there is additional state in the ALB Target Group which is not propagated back to Kubernetes: initial,draining, and healthy.

This was discussed in: #660 & #814

This issue is to discuss adding support to the ALB controller for setting a custom Pod Status which mirrors the ALB TargetGroup. This can then be used with PodReadinessGate and MaxUnavailable / PodDisruptionBudget to ensure a rolling update does not proceed faster than the new Pods becoming healthy on the ALB.

@alfredkrohmer
Copy link
Contributor

I'm working on this.

@alfredkrohmer
Copy link
Contributor

I won't have time to continue working on this in the next two weeks, but I have pushed an initial implementation here: https://github.com/devkid/aws-alb-ingress-controller/tree/feature/pod-readiness-gate (it compiles and has documentation, but I didn't test it yet nor did I adjust the unit tests).

It works by reconciling the pod conditions in the same step where it reconciles the ALB target group state. I think it wouldn't work in its current state because right now only the Ready endpoints of a service are registered with the ALB. When a pod declares the ALB readiness gate, it would never turn Ready because it would wait for the readiness gate (turn healthy in ALB), which would never be fulfilled because the pod was never registered with the ALB (because it was not Ready yet) - so endless loop. To solve this, the ALB ingress controller would need to support the publishNotReadyAddresses attribute of a Kubernetes services and it would need to be set to true if one would want to use this pod readiness gate feature (then all pods - also unready ones - would be registered with the target group, pass the health check and only then appear Ready in kubernetes).

If someone feels like picking up from where I left, feel free to do so. Otherwise I would probably continue in ~2-3 weeks.

@tecnobrat
Copy link

tecnobrat commented Jun 17, 2019

Hey @devkid -- just want to check-in to see if you have had a chance to get back to this change at all.

This is impacting us and we're open to trying to help with any PRs that need to be done if you need help.

Thanks!

@alfredkrohmer
Copy link
Contributor

Hi @tecnobrat. I didn't find the time to continue with this significantly, but the functionality should be usable as far as I can tell. I created a draft PR here: #955 and added some TODOs. The controller itself is building, but the target group unit test is broken because we need to pass in another parameter to the targets controller (k8s client). Also it's still missing additional unit tests for the new functionality.

I'll be gone for another 2 weeks now. Maybe you can pick this up if you want to.

@alfredkrohmer alfredkrohmer mentioned this issue Jun 18, 2019
2 tasks
@boynux
Copy link

boynux commented Jul 1, 2019

Hi,
We hit this issue in our new EKS cluster which prevents us moving forward with the production workload, any progress with the proposed PR?

@alfredkrohmer
Copy link
Contributor

Well, the functionality is there, I'm still waiting for reviews. You can try the code from my branch if you want to, that would be helpful feedback already.
I'll be meeting some EKS people tomorrow and will try to push them to get this reviewed by some of them.

@alfredkrohmer
Copy link
Contributor

alfredkrohmer commented Jul 1, 2019

Btw a workaround would be to have at least 2+n replicas and maxUnavailable = 0+n and an initialDelaySeconds that is large enough to cover the time it takes to register a target with an target group and for it to become healthy; this should prevent the last pod from terminating until this delay elapses for the second new pod (which is not yet registered with the target group, but in the meantime the first new pod was ready long enough to be healthy in the target group).

@gzzo
Copy link

gzzo commented Jul 1, 2019

@devkid Would a pod be registered on the ALB before it's ready?

I had a similar hypothesis a while ago and I tried with 6 replicas, 0 maxUnavailable, and 60 minReadySeconds but I still saw 504s with IP target mode.

@nirnanaaa
Copy link

nirnanaaa commented Jul 2, 2019

@devkid there are still some issues with your approach. I'm trying to come up with a follow-up PR to yours so we can maybe share the work on this.

Especially the endpoint and reconciliation part is very much open for discussion: https://github.com/kubernetes-sigs/aws-alb-ingress-controller/compare/master...nirnanaaa:feature/pod-readiness-gate#diff-c6555fa828feb3fa0daf4ff60736047dR130

One question @devkid: How do we handle ALB TG changes that kubernetes does not receive immediately (where no reconcile will be triggered). Ingress-Gce uses a poller to query their NEGs, should we follow a similar strategy?

@wirescrossed
Copy link

We are having a similar issue as well. When pods are terminated the ALB Ingress Controller rapidly updates the ALB target group however the changes take a few seconds to take effect which causes a short period where clients may get 504s.

We are using IP mode meaning each Pod is in the target group since we require the real ALB source IP not the source IP kube-proxy uses in order to use NetworkPolicies.

To workaround this we are taking the approach of running kube-proxy without source NAT (https://kubernetes.io/docs/tutorials/services/source-ip/) though this has draw backs as well but masks the issue to a degree.

What I would really like to see is one of two things:

  • ALB to keep pace with the rapid changes in k8s
  • k8s give the ALB time to catch up by delaying termination of the pod.

@nirnanaaa
Copy link

I would be willing to continue @devkid s and my work on this PR. Maybe in collaboration with someone from AWS that could give us insight on the Polling-thing i was talking about earlier.

/cc @andrewtandoc

@iverberk
Copy link

iverberk commented Jul 2, 2019

We are hitting this as well and it's a blocker for production workloads. @devkid @nirnanaaa I was going to review and finish this PR also, so let's explore the problem space and collaborate on a design that we can implement. It is actually very similar to what Google needed to solve for their loadbalancers (23:00 in this video https://youtu.be/Vw9GmSeomFg)

@nirnanaaa
Copy link

nirnanaaa commented Jul 2, 2019

@iverberk you're absolutely right. We were able to solve most of the challenges by taking a look at ingress-gce (basically this piece is the relevant one: https://github.com/kubernetes/ingress-gce/tree/5b2fd1df8634ca02aaa10072d7769a2544d39cb5/pkg/neg/readiness)

Except for the poller everything works just fine (didn't figure out a way, to make that work).

Maybe you have something to overcome that issue, already? See @devkid s PR and my changes here: https://github.com/kubernetes-sigs/aws-alb-ingress-controller/compare/master...nirnanaaa:feature/pod-readiness-gate#diff-c6555fa828feb3fa0daf4ff60736047dR130

@iverberk
Copy link

iverberk commented Jul 2, 2019

Not necessarily starting of fresh, it seems the implementation is almost there, just didn't want to duplicate efforts. I didn't know about that ingress-gce code, but will check it out.

I guess you suggest polling to detect changes to target groups (targets becoming unhealthy etc.) that happen outside the aws-alb-ingress-controller purview and need reconciling with Pod readiness gates?

@alfredkrohmer
Copy link
Contributor

@nirnanaaa @iverberk do we maybe want to meet to discuss the next steps? I would send out an invite tomorrow from my company email.

@nirnanaaa
Copy link

@devkid

@iverberk and I were already in talks on slack, I could not find you on the k8s slack?

@wirescrossed
Copy link

We were able to slow down the termination of pods to let the ALB catch up thus preventing 504s by using lifecycle prestop hook

        lifecycle:
          preStop:
            exec:
              command:
              - /bin/bash
              - -c
              - sleep 25
              - apachectl -k graceful-stop

@tecnobrat
Copy link

@nirnanaaa try maybe reaching out to @M00nF1sh? They are on the k8s slack as well.

@nirnanaaa
Copy link

will do, thanks!

@iverberk
Copy link

iverberk commented Aug 5, 2019

@nirnanaaa @devkid were you able to make some progress on this? I think we need to pick this up as the ingress controller is unable to do zero-downtime deployments in the current state.

@nirnanaaa
Copy link

no progress, yet. Also was gone on vacation and had some internal work to do. I was talking to @M00nF1sh to find out if there is something planned in terms of external-resource-watch and he said that there's a similar issue we're facing with the pull request when dealing with security group attachments.

I asked him if it would make sense to work on this and he said (quote):

i have been planning to have such controller to deal with security group attachment(currently under ip mode, the security group is detached immediately when pods is terminating, and breaks on-going connections)
if we have it, we can use it to deal with security group attachment and pod readiness gate together

not sure what that refers to; creating the controller on our own or waiting until the controller is there.

@nirnanaaa
Copy link

I will setup a new PR, to propose such a controller.

@billyshambrook
Copy link

Hey, we have ran into the issue described in this issue and have also tested how we should handle this when we are rotating our AMIs which require rotating all of our EC2 instances. We rely on cloudformation update policy to rotate our instances in an autoscaling group. This presents an additional factor that the ALB controller is getting rescheduled at the same time other pods are. We have currently found setting the prestop hook sleep to 60secs mostly avoids 5xx errors.

@arminc
Copy link

arminc commented Aug 26, 2019

Me, @nirnanaaa and @iverberk were trying to implement something outside of the ALB (code is here: https://github.com/nirnanaaa/kube-readiness) to test the theories and this is what we found out:

  1. You need to set publishNotReadyAddresses otherwise the pods are not registered to the ALB and they will never become ready.
  2. When using the PodReadinessGate the pod status shows ready but it's not until you set the gate to true. When you want to know if the pod is ready you can't rely on the status alone you need to look at all statuses.
  3. You need to put a terminating Pod to draining in ALB to make sure no NEW calls are sent to it before the Pod is gone.
  4. To achieve number 3 your Pod needs to go into terminating state and stay in that state long enough for ALB to finish draining.
  5. Most applications don't do number 4 properly so you end up adding a preStop hook.
  6. Because of number 1 alb ingress controller will re-add the container after you put it to draining.
  7. After fixing all of this we still notice that ALB is sending traffic to the pod (for at least 15 seconds) even while the pod is in draining state.

After doing all of this we were able to make it work eventually with consistent behaviour and achieving zero-downtime rolling update for our end-2-end test.

Unsolved issues:

  1. Why does the ALB send traffic to draining pods? We think it might have to do something with keep-alive between the ALB and the pods but we don't know for sure until AWS explains what is happening. @M00nF1sh Did you already have contact with the ALB team about this?

Conclusion:
If the above issue is resolved it should be possible to tackle the problem

Alternative:
Use the current alb ingress controller and add a long enough preStop hook, mostly 30 seconds. Because the pod is removed from healthy targets in the endpoint the ingress controller will set the pod in draining in ALB. Because the pod is not removed directly (preStop hook) it will allow connections to drain.

@stevenpall
Copy link

@nirnanaaa et al.: were you able to get a new PR set up? It seems like this issue has stalled and it's a pretty large blocker for anyone wanting to use IP target mode in a production environment. Thanks!

@nirnanaaa
Copy link

@stevenpall unfortunately not. I am at this point uncertain where to continue. The repo, that @arminc mentioned is in theory working just fine. With the controller and a prestop hook it's working quite well.
If someone has an idea I'd love to join with him.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 28, 2020
@alfredkrohmer
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 28, 2020
@alfredkrohmer
Copy link
Contributor

Hi all,

I have updated my pull request: #955 It now starts a go routine per target group to reconcile the condition status (it quits when all targets are healthy – see the code for more details).

Testing would be very welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.