Rollout can sometimes run indefinitely when new pods are in a crash loop #3765

andrii-korotkov-verkada · 2024-08-04T17:11:15Z

Checklist:

I've included steps to reproduce the bug.
I've included the version of argo rollouts.

Describe the bug

There's an occasional issue with ArgoCD sync can run for 1+ day waiting for a healthy state of a rollout, while the new version pods are crash looping. The rollout is stuck at the first canary step and neither progresses to the next step nor rolls back. The number of ready replicas keeps going up and down by a few, which might be confusing the rollouts controller that some progress is actually happening.
Note: that's a different issue from previously reported rollouts stuck due to "object modified", there's no indication in the logs that that's happening.

To Reproduce

Create a deployment with pods that can run okay. Create a rollout with 100 replicas and a first canary step with 10% of pods to be updated first, no analysis run. Put some analysis run for the 2nd step and beyond, but not sure it's necessary. Get a new pod version that should crash on a startup and try to release it. May need to repeat a few times.

Expected behavior

The rollout should get automatically aborted with pods rolling back to a previous version, app sync fails and app enters a degraded state.

Screenshots

Version

v1.7.1

Logs

From oldest to newest (partial logs)

Enqueueing parent of default/<new-replica-set-name>: Rollout default/<rollout-name>
Patched: {"status":{"availableReplicas":92,"conditions":[{"lastTransitionTime":"2024-08-02T00:25:25Z","lastUpdateTime":"2024-08-02T00:25:25Z","message":"Rollout is paused","reason":"RolloutPaused","status":"False","type":"Paused"},{"lastTransitionTime":"2024-08-02T23:50:30Z","lastUpdateTime":"2024-08-02T23:50:30Z","message":"Rollout is not healthy","reason":"RolloutHealthy","status":"False","type":"Healthy"},{"lastTransitionTime":"2024-08-02T23:50:30Z","lastUpdateTime":"2024-08-02T23:50:30Z","message":"Rollout does not have minimum availability","reason":"AvailableReason","status":"False","type":"Available"},{"lastTransitionTime":"2024-08-02T23:53:21Z","lastUpdateTime":"2024-08-02T23:53:21Z","message":"RolloutCompleted","reason":"RolloutCompleted","status":"False","type":"Completed"},{"lastTransitionTime":"2024-08-02T17:21:27Z","lastUpdateTime":"2024-08-03T01:30:07Z","message":"ReplicaSet \"<new-replica-set-name>\" is progressing.","reason":"ReplicaSetUpdated","status":"True","type":"Progressing"}],"readyReplicas":92}}
Enqueueing parent of default/<new-replica-set-name>: Rollout default/<rollout-name>
Enqueueing parent of default/<new-replica-set-name>: Rollout default/<rollout-name>
Enqueueing parent of default/<new-replica-set-name>: Rollout default/<rollout-name>
Patched: {"status":{"availableReplicas":91,"conditions":[{"lastTransitionTime":"2024-08-02T00:25:25Z","lastUpdateTime":"2024-08-02T00:25:25Z","message":"Rollout is paused","reason":"RolloutPaused","status":"False","type":"Paused"},{"lastTransitionTime":"2024-08-02T23:50:30Z","lastUpdateTime":"2024-08-02T23:50:30Z","message":"Rollout is not healthy","reason":"RolloutHealthy","status":"False","type":"Healthy"},{"lastTransitionTime":"2024-08-02T23:50:30Z","lastUpdateTime":"2024-08-02T23:50:30Z","message":"Rollout does not have minimum availability","reason":"AvailableReason","status":"False","type":"Available"},{"lastTransitionTime":"2024-08-02T23:53:21Z","lastUpdateTime":"2024-08-02T23:53:21Z","message":"RolloutCompleted","reason":"RolloutCompleted","status":"False","type":"Completed"},{"lastTransitionTime":"2024-08-02T17:21:27Z","lastUpdateTime":"2024-08-03T01:31:21Z","message":"ReplicaSet \"<new-replica-set-name>\" is progressing.","reason":"ReplicaSetUpdated","status":"True","type":"Progressing"}],"readyReplicas":91}}
Enqueueing parent of default/<new-replica-set-name>: Rollout default/<rollout-name>
Started syncing Analysis at (2024-08-03 01:32:10.919158096 +0000 UTC m=+220587.756264716)
No status changes. Skipping patch
Started syncing Analysis at (2024-08-03 01:32:10.919769572 +0000 UTC m=+220587.756876202)
No status changes. Skipping patch
Reconciliation completed
Reconciliation completed

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

andrii-korotkov-verkada · 2024-08-04T17:16:26Z

I also find logs like "message":"Rollout is not healthy","reason":"RolloutHealthy" confusing, since they seem to send different signals.

andrii-korotkov-verkada · 2024-08-16T15:13:47Z

For now, I've set the following on Rollouts:

  progressDeadlineAbort: true
  progressDeadlineSeconds: 900

andrii-korotkov-verkada added the bug Something isn't working label Aug 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rollout can sometimes run indefinitely when new pods are in a crash loop #3765

Rollout can sometimes run indefinitely when new pods are in a crash loop #3765

andrii-korotkov-verkada commented Aug 4, 2024 •

edited

Loading

andrii-korotkov-verkada commented Aug 4, 2024

andrii-korotkov-verkada commented Aug 16, 2024

Rollout can sometimes run indefinitely when new pods are in a crash loop #3765

Rollout can sometimes run indefinitely when new pods are in a crash loop #3765

Comments

andrii-korotkov-verkada commented Aug 4, 2024 • edited Loading

andrii-korotkov-verkada commented Aug 4, 2024

andrii-korotkov-verkada commented Aug 16, 2024

andrii-korotkov-verkada commented Aug 4, 2024 •

edited

Loading