Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downtime On initial deployment using Istio destination rule #2507

Closed
2 tasks done
dearboll23 opened this issue Jan 6, 2023 · 8 comments · Fixed by #3602
Closed
2 tasks done

Downtime On initial deployment using Istio destination rule #2507

dearboll23 opened this issue Jan 6, 2023 · 8 comments · Fixed by #3602
Labels
bug Something isn't working bug-reproduced istio
Milestone

Comments

@dearboll23
Copy link

dearboll23 commented Jan 6, 2023

Checklist:

  • I've included steps to reproduce the bug.
  • I've included the version of argo rollouts.

Describe the bug
When we initialized rollout for a deployment with Istio traffic routing, we found rollout controller created a new RS from the current RS template with a new label key named 'rollouts-pod-template-hash', and the controller update Istio DR with the label immediately without considering whether the new RS is available or not. This leads 503 http error, no healthy upstreams.

To Reproduce

  1. create virtualservice & destinationrule resources as this.
  2. create rollout resource following the migration doc
  3. While creating the rollout resource, the rollout controller will create a RS for it ,and immediately update the Istio Destination Rule and Virtual Service resource, which leading no healthy upstream error because the the RS is not ready for traffic.

Expected behavior
The rollout controller shift the traffic to rollout's RS while the RS is available.

Version
argo-rollouts: v1.2.2

Could we add some determine statements to ensure the new RS is available before updating the Istio DR, which is the same like ensureSVCTarget does?


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@dearboll23 dearboll23 added the bug Something isn't working label Jan 6, 2023
@dearboll23
Copy link
Author

@zachaller please take a look, thanks.

@zachaller
Copy link
Collaborator

Another user had issues https://cloud-native.slack.com/archives/C01U781DW2E/p1674472008986509

@zachaller zachaller added this to the v1.5 milestone Jan 23, 2023
@zachaller zachaller modified the milestones: v1.5, v1.6 May 22, 2023
@lironrisk
Copy link

+1 we are also experiencing this issue, is there an estimation for when it will be fixed? @zachaller

@shaoye
Copy link

shaoye commented Sep 5, 2023

we have a workaround to mediate this downtime for initial deployment:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: app-vs
spec:
  {{- if .Values.canary.intialDeployment }}
  # do not specify the host here, so the routing rules won't affect requests to the original deployment
  # this guarantees zero downtime when installing argo-rollouts for the first time
  hosts:
  - app.svc.cluster.local
  {{- end }}

after the initial deployment, we removed intialDeployment flag and everything works perfectly.

@bsura-brex
Copy link

We just saw this happen in our infrastructure. I think this happens on every deployment and not just the initial creation of the Rollout resource. I don't see any "wait for canary RS to be available" before updating the Istio VirtualService, and this means the traffic gets routed to the canary pods even before they have spun up. This causes "no healthy upstream" error in our case (grpc traffic). Can we please add something similar to ensureSVCTarget for folks that are using subset-level traffic splitting.

As a workaround, we have added first step that pauses for a configured amount of duration so that the canary replicaset can come up before it starts serving traffic.

steps:
  # Step 1: Scale canary up to 25%, and wait 10 minutes to give the pods time to start up
  - setCanaryScale:
      weight: 25
    pause:
      duration: 10m
  - setHeaderRoute:
      ...

@mateuszbanasiewicz
Copy link

@zachaller We also came across this bug during the implementation of the argo rollouts process for Istio. Is there any chance to fix that problem?

@wmuizelaar
Copy link
Contributor

From my extensive testing while creating a fix for this I noticed that also on the final step of a canary-release, the destinationrule is updated to 100% before the canary-replicaset is at full desired capacity, which might actually limit availability as well. My above PR fixes that issue as well.

wmuizelaar added a commit to wmuizelaar/argo-rollouts that referenced this issue Jul 8, 2024
…estinationRule Subsets. Fixes argoproj#2507

Signed-off-by: Wietse Muizelaar <wmuizelaar@bol.com>
@jeanmorais
Copy link

jeanmorais commented Aug 14, 2024

We are also experiencing this error. I noticed it happened even if I create the Rollout for the first time.

I created a new Rollout with paused: true, and immediately it changed the DestinationRule by adding the labels rollouts-pod-template-hash: 59f7696587, which removed all traffic from Deployment (we are migrating from Deployment to Rollout), causing downtime.

➜ argo kubectl -n rollouts-demo-istio  get rs   
NAME                       DESIRED   CURRENT   READY   AGE
istio-rollout-59f7696587   0         0         0       14m
istio-rollout-5b5d5f9776   3         3         3       17m
➜ kubectl -n rollouts-demo-istio  get pods --show-labels
➜ kubectl argo rollouts -n rollouts-demo-istio get rollout istio-rollout
Name:            istio-rollout
Namespace:       rollouts-demo-istio
Status:          ॥ Paused
Message:         manually paused
Strategy:        Canary
  Step:          3/3
  SetWeight:     100
  ActualWeight:  100
Replicas:
  Desired:       1
  Current:       0
  Updated:       0
  Ready:         0
  Available:     0

NAME                              KIND        STATUS        AGE  INFO
⟳ istio-rollout                   Rollout     ॥ Paused      14m  
└──# revision:1                                                  
   └──⧉ istio-rollout-59f7696587  ReplicaSet  • ScaledDown  14m  stable
  subsets:
  - labels:
      app: istio-rollout
      rollouts-pod-template-hash: 59f7696587
    name: canary
  - labels:
      app: istio-rollout
      rollouts-pod-template-hash: 59f7696587
    name: stable
➜ kubectl -n rollouts-demo-istio  get rs   
NAME                       DESIRED   CURRENT   READY   AGE
istio-rollout-59f7696587   0         0         0       14m # all the traffic is pointed to this empty RS
istio-rollout-5b5d5f9776   3         3         3       17m

We're looking forward to this fix as well

zachaller pushed a commit to wmuizelaar/argo-rollouts that referenced this issue Aug 27, 2024
…estinationRule Subsets. Fixes argoproj#2507

Signed-off-by: Wietse Muizelaar <wmuizelaar@bol.com>
zachaller pushed a commit that referenced this issue Aug 27, 2024
…estinationRule Subsets. Fixes #2507 (#3602)

Signed-off-by: Wietse Muizelaar <wmuizelaar@bol.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working bug-reproduced istio
Projects
None yet
9 participants