Upgrading to Flagger built on main (commit id #133fdec) causes canary rollouts when no change in deployments #1673

jdgeisler · 2024-06-28T20:40:19Z

Describe the bug

I've been working on an MR for issue #1646 and ran into this following bug when testing Flagger in my personal kubernetes cluster. I've also reached out in the slack channel here

When building my own docker image from main (commit id #133fdec), I am seeing canary rollouts triggered even though there were no changes to my canary deployment spec. As soon as Flagger was upgraded to this image, the canaries detected a new revision and began the analysis.

{"level":"info","ts":"2024-06-25T15:01:58.687Z","caller":"controller/controller.go:307","msg":"Synced fortio/fortio-server-deployment-2"}
{"level":"info","ts":"2024-06-25T15:02:08.802Z","caller":"controller/events.go:33","msg":"New revision detected! Scaling up fortio-server-deployment-2.fortio","canary":"fortio-server-deployment-2.fortio"}
{"level":"info","ts":"2024-06-25T15:02:38.791Z","caller":"controller/events.go:33","msg":"Starting canary analysis for fortio-server-deployment-2.fortio","canary":"fortio-server-deployment-2.fortio"}

To confirm, I also compared 1:1 the deployment spec and nothing changed. This should mean the calculated hash is the same, but for some reason the lastAppliedSpec hash in the canary was different.

No matter what, the docker image from v1.37 results in this hash in the canary
lastAppliedSpec: 5454669676
lastPromotedSpec: 5454669676

Then, when upgraded to the image built from main, it results in this new hash in the canary
lastAppliedSpec: 5475b95f98
lastPromotedSpec: 5475b95f98

For a sanity check, I also built a custom image from the last tag v1.37, and confirmed the canary analysis is not triggered when upgraded. I also confirmed that the hash remains the same.

To Reproduce

Build docker image from main (commit id #133fdec)
Upgrade Flagger to the new image in a Kubernetes cluster with canary deployments
See that the canary analysis is triggered

Expected behavior

It is expected that upgrading Flagger does not cause canary rollouts to be triggered if nothing changes in the canary deployments.

Additional context

Flagger version: main (commit id #133fdec)
Kubernetes version: 1.27
Service Mesh provider: istio 1.19.7

The text was updated successfully, but these errors were encountered:

aryan9600 · 2024-07-01T11:46:02Z

hey, this is probably because of #1638. The upgrade to k8s 1.30 libs means that the sidecars are now used via the .initContainers[].restartPolicy. Thus, when Flagger restarts, it reads this field and produces a different checksum. Its an unavoidable side effect that's not triggered due to changes in Flagger itself, but because of upstream k8s libs. The release notes for the next version shall include a warning for the same.

cc: @stefanprodan

jdgeisler · 2024-07-01T19:24:47Z

Hey @aryan9600, thank you for the follow-up.

We have hundreds of workloads using Flagger, so when we upgrade this would cause all of the canaries to be triggered which is not ideal. Is there any way this can be avoided in the hash calculation of Flagger so that a dependency upgrade doesn't trigger a false rollout?

aryan9600 · 2024-07-17T09:33:06Z

i can't think of any clean way to avoid this.

stefanprodan · 2024-07-17T09:50:24Z

You could pause all canaries with .spec.suspend and enable them one by one when there is a change to the actual deployment. If you use Flux and all things are in Git, the commit that bumps the app version could also remove the suspend field.

lcooper40 · 2024-07-30T14:18:24Z

Hi @stefanprodan @aryan9600

Is there nothing that can be done around this issue?

I've got the exact same problem with 100's of canaries that can't all be started at once as they burst a rate limit for an external metrics provider used for the analysis.

I've tried setting suspend: true on a canary and we're using ArgoCD which doesn't see the suspend field as something it needs to remove when resyncing.

The only option we have on our hands at the moment is to release the canaries in batches following an upgrade which would be a very time consuming process regularly as we tend to take the very latest release when it becomes available.

snuggie12 · 2024-09-04T20:21:44Z

I'm not sure if there are other obvious reasons, but we've experienced this through plenty of other upgrades in the past. Maybe they also contained similar updates like 1.30, but I wanted to provide that data point if it is helpful.

Is there anything that could be done on the controller side to perform something similar to #1673 (comment) ??

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading to Flagger built on main (commit id #133fdec) causes canary rollouts when no change in deployments #1673

Upgrading to Flagger built on main (commit id #133fdec) causes canary rollouts when no change in deployments #1673

jdgeisler commented Jun 28, 2024

aryan9600 commented Jul 1, 2024

jdgeisler commented Jul 1, 2024 •

edited

Loading

aryan9600 commented Jul 17, 2024

stefanprodan commented Jul 17, 2024

lcooper40 commented Jul 30, 2024 •

edited

Loading

snuggie12 commented Sep 4, 2024

Upgrading to Flagger built on main (commit id #133fdec) causes canary rollouts when no change in deployments #1673

Upgrading to Flagger built on main (commit id #133fdec) causes canary rollouts when no change in deployments #1673

Comments

jdgeisler commented Jun 28, 2024

Describe the bug

To Reproduce

Expected behavior

Additional context

aryan9600 commented Jul 1, 2024

jdgeisler commented Jul 1, 2024 • edited Loading

aryan9600 commented Jul 17, 2024

stefanprodan commented Jul 17, 2024

lcooper40 commented Jul 30, 2024 • edited Loading

snuggie12 commented Sep 4, 2024

jdgeisler commented Jul 1, 2024 •

edited

Loading

lcooper40 commented Jul 30, 2024 •

edited

Loading