Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollout can sometimes run indefinitely when new pods are in a crash loop #3765

Open
2 tasks done
andrii-korotkov-verkada opened this issue Aug 4, 2024 · 2 comments
Open
2 tasks done
Labels
bug Something isn't working

Comments

@andrii-korotkov-verkada
Copy link

andrii-korotkov-verkada commented Aug 4, 2024

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Describe the bug

There's an occasional issue with ArgoCD sync can run for 1+ day waiting for a healthy state of a rollout, while the new version pods are crash looping. The rollout is stuck at the first canary step and neither progresses to the next step nor rolls back. The number of ready replicas keeps going up and down by a few, which might be confusing the rollouts controller that some progress is actually happening.
Note: that's a different issue from previously reported rollouts stuck due to "object modified", there's no indication in the logs that that's happening.

To Reproduce

Create a deployment with pods that can run okay. Create a rollout with 100 replicas and a first canary step with 10% of pods to be updated first, no analysis run. Put some analysis run for the 2nd step and beyond, but not sure it's necessary. Get a new pod version that should crash on a startup and try to release it. May need to repeat a few times.

Expected behavior

The rollout should get automatically aborted with pods rolling back to a previous version, app sync fails and app enters a degraded state.

Screenshots

Version

v1.7.1

Logs

From oldest to newest (partial logs)

Enqueueing parent of default/<new-replica-set-name>: Rollout default/<rollout-name>
Patched: {"status":{"availableReplicas":92,"conditions":[{"lastTransitionTime":"2024-08-02T00:25:25Z","lastUpdateTime":"2024-08-02T00:25:25Z","message":"Rollout is paused","reason":"RolloutPaused","status":"False","type":"Paused"},{"lastTransitionTime":"2024-08-02T23:50:30Z","lastUpdateTime":"2024-08-02T23:50:30Z","message":"Rollout is not healthy","reason":"RolloutHealthy","status":"False","type":"Healthy"},{"lastTransitionTime":"2024-08-02T23:50:30Z","lastUpdateTime":"2024-08-02T23:50:30Z","message":"Rollout does not have minimum availability","reason":"AvailableReason","status":"False","type":"Available"},{"lastTransitionTime":"2024-08-02T23:53:21Z","lastUpdateTime":"2024-08-02T23:53:21Z","message":"RolloutCompleted","reason":"RolloutCompleted","status":"False","type":"Completed"},{"lastTransitionTime":"2024-08-02T17:21:27Z","lastUpdateTime":"2024-08-03T01:30:07Z","message":"ReplicaSet \"<new-replica-set-name>\" is progressing.","reason":"ReplicaSetUpdated","status":"True","type":"Progressing"}],"readyReplicas":92}}
Enqueueing parent of default/<new-replica-set-name>: Rollout default/<rollout-name>
Enqueueing parent of default/<new-replica-set-name>: Rollout default/<rollout-name>
Enqueueing parent of default/<new-replica-set-name>: Rollout default/<rollout-name>
Patched: {"status":{"availableReplicas":91,"conditions":[{"lastTransitionTime":"2024-08-02T00:25:25Z","lastUpdateTime":"2024-08-02T00:25:25Z","message":"Rollout is paused","reason":"RolloutPaused","status":"False","type":"Paused"},{"lastTransitionTime":"2024-08-02T23:50:30Z","lastUpdateTime":"2024-08-02T23:50:30Z","message":"Rollout is not healthy","reason":"RolloutHealthy","status":"False","type":"Healthy"},{"lastTransitionTime":"2024-08-02T23:50:30Z","lastUpdateTime":"2024-08-02T23:50:30Z","message":"Rollout does not have minimum availability","reason":"AvailableReason","status":"False","type":"Available"},{"lastTransitionTime":"2024-08-02T23:53:21Z","lastUpdateTime":"2024-08-02T23:53:21Z","message":"RolloutCompleted","reason":"RolloutCompleted","status":"False","type":"Completed"},{"lastTransitionTime":"2024-08-02T17:21:27Z","lastUpdateTime":"2024-08-03T01:31:21Z","message":"ReplicaSet \"<new-replica-set-name>\" is progressing.","reason":"ReplicaSetUpdated","status":"True","type":"Progressing"}],"readyReplicas":91}}
Enqueueing parent of default/<new-replica-set-name>: Rollout default/<rollout-name>
Started syncing Analysis at (2024-08-03 01:32:10.919158096 +0000 UTC m=+220587.756264716)
No status changes. Skipping patch
Started syncing Analysis at (2024-08-03 01:32:10.919769572 +0000 UTC m=+220587.756876202)
No status changes. Skipping patch
Reconciliation completed
Reconciliation completed

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@andrii-korotkov-verkada andrii-korotkov-verkada added the bug Something isn't working label Aug 4, 2024
@andrii-korotkov-verkada
Copy link
Author

I also find logs like "message":"Rollout is not healthy","reason":"RolloutHealthy" confusing, since they seem to send different signals.

@andrii-korotkov-verkada
Copy link
Author

For now, I've set the following on Rollouts:

  progressDeadlineAbort: true
  progressDeadlineSeconds: 900

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant