Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading to Flagger built on main (commit id #133fdec) causes canary rollouts when no change in deployments #1673

Open
jdgeisler opened this issue Jun 28, 2024 · 6 comments

Comments

@jdgeisler
Copy link

Describe the bug

I've been working on an MR for issue #1646 and ran into this following bug when testing Flagger in my personal kubernetes cluster. I've also reached out in the slack channel here

When building my own docker image from main (commit id #133fdec), I am seeing canary rollouts triggered even though there were no changes to my canary deployment spec. As soon as Flagger was upgraded to this image, the canaries detected a new revision and began the analysis.

{"level":"info","ts":"2024-06-25T15:01:58.687Z","caller":"controller/controller.go:307","msg":"Synced fortio/fortio-server-deployment-2"}
{"level":"info","ts":"2024-06-25T15:02:08.802Z","caller":"controller/events.go:33","msg":"New revision detected! Scaling up fortio-server-deployment-2.fortio","canary":"fortio-server-deployment-2.fortio"}
{"level":"info","ts":"2024-06-25T15:02:38.791Z","caller":"controller/events.go:33","msg":"Starting canary analysis for fortio-server-deployment-2.fortio","canary":"fortio-server-deployment-2.fortio"}

To confirm, I also compared 1:1 the deployment spec and nothing changed. This should mean the calculated hash is the same, but for some reason the lastAppliedSpec hash in the canary was different.

No matter what, the docker image from v1.37 results in this hash in the canary
lastAppliedSpec: 5454669676
lastPromotedSpec: 5454669676

Then, when upgraded to the image built from main, it results in this new hash in the canary
lastAppliedSpec: 5475b95f98
lastPromotedSpec: 5475b95f98

For a sanity check, I also built a custom image from the last tag v1.37, and confirmed the canary analysis is not triggered when upgraded. I also confirmed that the hash remains the same.

To Reproduce

  1. Build docker image from main (commit id #133fdec)
  2. Upgrade Flagger to the new image in a Kubernetes cluster with canary deployments
  3. See that the canary analysis is triggered

Expected behavior

It is expected that upgrading Flagger does not cause canary rollouts to be triggered if nothing changes in the canary deployments.

Additional context

  • Flagger version: main (commit id #133fdec)
  • Kubernetes version: 1.27
  • Service Mesh provider: istio 1.19.7
@aryan9600
Copy link
Member

hey, this is probably because of #1638. The upgrade to k8s 1.30 libs means that the sidecars are now used via the .initContainers[].restartPolicy. Thus, when Flagger restarts, it reads this field and produces a different checksum. Its an unavoidable side effect that's not triggered due to changes in Flagger itself, but because of upstream k8s libs. The release notes for the next version shall include a warning for the same.

cc: @stefanprodan

@jdgeisler
Copy link
Author

jdgeisler commented Jul 1, 2024

Hey @aryan9600, thank you for the follow-up.

We have hundreds of workloads using Flagger, so when we upgrade this would cause all of the canaries to be triggered which is not ideal. Is there any way this can be avoided in the hash calculation of Flagger so that a dependency upgrade doesn't trigger a false rollout?

@aryan9600
Copy link
Member

i can't think of any clean way to avoid this.

@stefanprodan
Copy link
Member

You could pause all canaries with .spec.suspend and enable them one by one when there is a change to the actual deployment. If you use Flux and all things are in Git, the commit that bumps the app version could also remove the suspend field.

@lcooper40
Copy link

lcooper40 commented Jul 30, 2024

Hi @stefanprodan @aryan9600

Is there nothing that can be done around this issue?

I've got the exact same problem with 100's of canaries that can't all be started at once as they burst a rate limit for an external metrics provider used for the analysis.

I've tried setting suspend: true on a canary and we're using ArgoCD which doesn't see the suspend field as something it needs to remove when resyncing.

The only option we have on our hands at the moment is to release the canaries in batches following an upgrade which would be a very time consuming process regularly as we tend to take the very latest release when it becomes available.

@snuggie12
Copy link

I'm not sure if there are other obvious reasons, but we've experienced this through plenty of other upgrades in the past. Maybe they also contained similar updates like 1.30, but I wanted to provide that data point if it is helpful.

Is there anything that could be done on the controller side to perform something similar to #1673 (comment) ??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants