Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controllers sometimes repeat the same Phase twice even after hitting t.next() and writing to APIserver successfully #1026

Open
djwhatle opened this issue Mar 25, 2021 · 1 comment
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@djwhatle
Copy link
Contributor

djwhatle commented Mar 25, 2021

Describe the bug

cc @shawn-hurley @alaypatel07 @pranavgaikwad I thought you guys might be interested in this phenomena I noticed.

Background
I was running some tests with new phase timing code and noticed a few phases were regularly showing up double or triple.

image

# Duplicated phase executions accounted for (.02 + 5 + 1.4 + 2.7 + 1.1) = 10.22 seconds of duplicate work

PhaseName ElapsedSeconds PercentOfTotal                                            
Started 0.186913 0.0006852284627244249                                             
StartRefresh 0.334802 0.0012273938130416982                                        
WaitForRefresh 16.499677 0.06048829298208017                                       
CleanStaleAnnotations 2.186253 0.00801486671508489                                 
CleanStaleResticCRs 3.266709 0.011975849653250676                                  
CleanStaleVeleroCRs 3.273203 0.011999656845029377                                  
RestartVelero 0.394479 0.0014461714206452651                                       
CleanStaleStagePods 1.942676 0.007121906389879971                                  
WaitForStaleStagePodsTerminated 2.438181 0.00893844204776501                       
WaitForVeleroReady 1.055978 0.0038712458823667315                                  
EnsureCloudSecretPropagated 5.568926 0.020415843745518403                          
PreBackupHooks 0.590061 0.0021631806880400874                                      
# (2x) .02 seconds wasted
CreateDirectImageMigration 0.720796 0.0026424589783370584                          
CreateDirectImageMigration 0.742097 0.0027205490602708617                          
EnsureInitialBackup 3.294488 0.012077688270500528                                  
Started 1.644732 0.00602963507061397                                               
Prepare 2.090334 0.007663224887517712                                              
CreateDestinationNamespaces 2.937383 0.010768531014551474                          
ListImageStreams 2.932012 0.010748840773245128                                     
CreateDirectImageStreamMigrations 3.067405 0.011245195426231534                    
WaitingForDirectImageStreamMigrationsToComplete 1.541926 0.00565274530190421       
InitialBackupCreated 15.645106 0.05735541098554235                                 
WaitForResticReady 0.737611 0.002704103254554931                                   
CreateDirectVolumeMigration 0.888114 0.0032558516044579027                         
Started 0.959208 0.0035164842641922725                                             
Prepare 1.498225 0.005492536172258224                                              
WaitForDirectImageMigrationToComplete 0.500482 0.0018347814837985888               
CleanStaleRsyncResources 2.655886 0.00973655487286236                              
WaitForStaleRsyncResourcesTerminated 2.693322 0.009873796331351345                 
CreateDestinationNamespaces 2.624833 0.009622713676942432                          
DestinationNamespacesCreated 1.155189 0.004234956277124468                         
CreateDestinationPVCs 2.142726 0.007855295474470243                                
DestinationPVCsCreated 0.662397 0.002428366555687785                               
CreateRsyncRoute 3.094858 0.011345838918054862                                     
EnsureRsyncRouteAdmitted 1.529884 0.005608598981701081                             
CreateRsyncConfig 2.589998 0.00949500755966324                                     
CreateStunnelConfig 8.81166 0.03230380035551463                                    
CreatePVProgressCRs 1.383327 0.0050713167818995505                                 
# (2x), 5 seconds wasted
CreateRsyncTransferPods 5.188735 0.01902205254602098                               
CreateRsyncTransferPods 10.094827 0.03700792768121543                              
# (2x), 1.4 seconds wasted
WaitForRsyncTransferPodsRunning 6.61158 0.024238243458612047                       
WaitForRsyncTransferPodsRunning 8.044323 0.029490720725410938                      
CreateStunnelClientPods 3.745158 0.013729857521949152                              
WaitForStunnelClientPodsRunning 3.144289 0.011527054393387937                      
# (2x), 2.7 seconds wasted
CreateRsyncClientPods 3.026426 0.011094965227294145                                
CreateRsyncClientPods 5.718631 0.020964666604346628                                
# (3x) 1.1 seconds wasted
WaitForRsyncClientPodsCompleted 3.802742 0.013940962131032112                      
WaitForRsyncClientPodsCompleted 4.330965 0.015877442922981756                      
WaitForRsyncClientPodsCompleted 4.855572 0.017800667354372152                      
DeleteRsyncResources 3.444769 0.012628622761984211                                 
WaitForRsyncResourcesTerminated 8.764045 0.03212924238869251                       
WaitForDirectVolumeMigrationToComplete 58.785843 0.2155106002731185                
EnsureInitialBackupReplicated 1.824562 0.006688897050528333                        
WaitForRsyncResourcesTerminated 10.87754 0.03987737617192727                       
PostBackupHooks 0.960631 0.0035217010233393456                                     
PreRestoreHooks 1.084106 0.003974363943708171                                      
EnsureFinalRestore 3.569899 0.013087352960208557                                   
FinalRestoreCreated 15.109297 0.05539111969823803                                  
UnQuiesceDestApplications 1.214452 0.004452216148756926                            
PostRestoreHooks 0.33661 0.0012340219933213244                                     
DeleteRegistries 2.144771 0.007862792503602894

So I added some extra logs to see the phase progression from start to end of reconcile in one place this was happening.

# [RECONCILE 1] 
# The phase "CreateRsyncClientPods" is started. This reconcile will go as planned 
# and we'll successfully write the DVM back to the APIserver at the end of the reconcile.

{"level":"info","ts":1616708632169,"logger":"directvolume|gd7rs","msg":"START Reconciling DVM",
"preReconcilePhase":"CreateRsyncClientPods","dvmStatus":{"conditions":[{"type":"Running","status":"True","reason":"CreateRsyncClientPods","category":"Advisory","message":"Step: 19/23","lastTransitionTime":"2021-03-25T21:43:52Z"},{"type":"Ready","status":"True","category":"Required","message":"Direct migration is ready","lastTransitionTime":"2021-03-25T21:43:10Z"}],"observedDigest":"8c3663504c0ba31ffed2cb1a75c80afb1072a7b7ecb10186a9d34dc72c05e640","startTimestamp":"2021-03-25T21:43:10Z","phaseDescription":"Creating Rsync client pods","phase":"CreateRsyncClientPods","itinerary":"VolumeMigration"}}

{"level":"info","ts":1616708632680,"logger":"directvolume|gd7rs","msg":"[RUN] (Step 19/23) Creating Rsync client pods","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708633268,"logger":"directvolume|gd7rs","msg":"Getting image for Rsync client Pods that will be created on source MigCluster","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708633855,"logger":"directvolume|gd7rs","msg":"Using image [quay.io/konveyor/rsync-transfer:latest] for Rsync client Pods","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708633855,"logger":"directvolume|gd7rs","msg":"Getting [NS => PVCWithSecurityContext] mappings for PVCs to be migrated","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708634348,"logger":"directvolume|gd7rs","msg":"Getting Rsync password for Rsync client Pods","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708634348,"logger":"directvolume|gd7rs","msg":"Getting Rsync Password from Secret [/] on host cluster","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708634348,"logger":"directvolume|gd7rs","msg":"Getting [PVC => NodeName] mapping for PVCs to be migrated","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708634956,"logger":"directvolume|gd7rs","msg":"Getting limits and requests for Rsync client Pods","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708634970,"logger":"directvolume|gd7rs","msg":"Rsync client Pods will be created with privileged=[false]","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708634988,"logger":"directvolume|gd7rs","msg":"Container [rsync-client] will use Rsync command [rsync --bwlimit=0 --info=COPY2,DEL2,REMOVE2,SKIP2,FLIST2,PROGRESS2,STATS2 --human-readable --port 2222 --log-file /dev/stdout /mnt/rocket-chat/rocketchat-data-claim/ rsync://root@172.30.245.182/rocketchat-data-claim]","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708634988,"logger":"directvolume|gd7rs","msg":"Creating Rsync client Pod [rocket-chat/directvolumemigration-rsync-transfer-rocketchat-data-claim] on source cluster.","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708635026,"logger":"directvolume|gd7rs","msg":"Rsync client pod created","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods","name":"directvolumemigration-rsync-transfer-rocketchat-data-claim","namespace":"rocket-chat"}

# We arrive at the end of the reconcile, the phase is complete and 
# a DVM with next phase=WaitForRsyncClientPodsCompleted is successfully posted back to the APIserver

{"level":"info","ts":1616708635026,"logger":"directvolume|gd7rs","msg":"Phase completed","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods","phaseElapsed":3.026426}
{"level":"info","ts":1616708635050,"logger":"directvolume|gd7rs","msg":"EXIT: SUCCESS - DVM got to end of reconcile","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs",
"postReconcilePhase":"WaitForRsyncClientPodsCompleted","dvmStatus":{"conditions":[{"type":"Running","status":"True","reason":"WaitForRsyncClientPodsCompleted","category":"Advisory","message":"Step: 20/23","lastTransitionTime":"2021-03-25T21:43:55Z"},{"type":"Ready","status":"True","category":"Required","message":"Direct migration is ready","lastTransitionTime":"2021-03-25T21:43:10Z"}],"observedDigest":"8c3663504c0ba31ffed2cb1a75c80afb1072a7b7ecb10186a9d34dc72c05e640","startTimestamp":"2021-03-25T21:43:10Z","phaseDescription":"Waiting for the Rsync client pods to be completed","phase":"WaitForRsyncClientPodsCompleted","itinerary":"VolumeMigration"}}

-----------------------

# [RECONCILE 2]
# Here's where things get weird. The next reconcile starts out with 
# phase=CreateRsyncClientPods, so we seem to be behind by one generation!

{"level":"info","ts":1616708635050,"logger":"directvolume|x9hjl","msg":"START Reconciling DVM",
"preReconcilePhase":"CreateRsyncClientPods","dvmStatus":{"conditions":[{"type":"Running","status":"True","reason":"CreateRsyncClientPods","category":"Advisory","message":"Step: 19/23","lastTransitionTime":"2021-03-25T21:43:52Z"},{"type":"Ready","status":"True","category":"Required","message":"Direct migration is ready","lastTransitionTime":"2021-03-25T21:43:10Z"}],"observedDigest":"8c3663504c0ba31ffed2cb1a75c80afb1072a7b7ecb10186a9d34dc72c05e640","startTimestamp":"2021-03-25T21:43:10Z","phaseDescription":"Creating Rsync client pods","phase":"CreateRsyncClientPods","itinerary":"VolumeMigration"}}
{"level":"info","ts":1616708635567,"logger":"directvolume|x9hjl","msg":"[RUN] (Step 19/23) Creating Rsync client pods","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708636050,"logger":"directvolume|x9hjl","msg":"Getting image for Rsync client Pods that will be created on source MigCluster","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708636577,"logger":"directvolume|x9hjl","msg":"Using image [quay.io/konveyor/rsync-transfer:latest] for Rsync client Pods","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708636577,"logger":"directvolume|x9hjl","msg":"Getting [NS => PVCWithSecurityContext] mappings for PVCs to be migrated","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708637157,"logger":"directvolume|x9hjl","msg":"Getting Rsync password for Rsync client Pods","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708637157,"logger":"directvolume|x9hjl","msg":"Getting Rsync Password from Secret [/] on host cluster","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708637157,"logger":"directvolume|x9hjl","msg":"Getting [PVC => NodeName] mapping for PVCs to be migrated","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708637679,"logger":"directvolume|x9hjl","msg":"Getting limits and requests for Rsync client Pods","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708637690,"logger":"directvolume|x9hjl","msg":"Rsync client Pods will be created with privileged=[false]","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708637700,"logger":"directvolume|x9hjl","msg":"Container [rsync-client] will use Rsync command [rsync --bwlimit=0 --info=COPY2,DEL2,REMOVE2,SKIP2,FLIST2,PROGRESS2,STATS2 --human-readable --port 2222 --log-file /dev/stdout /mnt/rocket-chat/rocketchat-data-claim/ rsync://root@172.30.245.182/rocketchat-data-claim]","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708637700,"logger":"directvolume|x9hjl","msg":"Creating Rsync client Pod [rocket-chat/directvolumemigration-rsync-transfer-rocketchat-data-claim] on source cluster.","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods"}
{"level":"info","ts":1616708637719,"logger":"directvolume|x9hjl","msg":"Rsync client pod already exists on source","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods","namespace":"rocket-chat"}
{"level":"info","ts":1616708637719,"logger":"directvolume|x9hjl","msg":"Rsync client pod created","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods","name":"directvolumemigration-rsync-transfer-rocketchat-data-claim","namespace":"rocket-chat"}

# We arrive at the end of the reconcile, have re-done everything in the phase needlessly
#, but unfortunately we hit a failure updating our DVM because the DVM we started the 
# reconcile with was a generation behind the APIserver (laggy cache?)

{"level":"info","ts":1616708637719,"logger":"directvolume|x9hjl","msg":"Phase completed","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"CreateRsyncClientPods","phaseElapsed":5.718631}
{"level":"info","ts":1616708637741,"logger":"directvolume|x9hjl","msg":"EXIT: DVM is failed to update","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs",
"postReconcilePhase":"WaitForRsyncClientPodsCompleted","dvmStatus":{"conditions":[{"type":"Running","status":"True","reason":"WaitForRsyncClientPodsCompleted","category":"Advisory","message":"Step: 20/23","lastTransitionTime":"2021-03-25T21:43:57Z"},{"type":"Ready","status":"True","category":"Required","message":"Direct migration is ready","lastTransitionTime":"2021-03-25T21:43:10Z"}],"observedDigest":"8c3663504c0ba31ffed2cb1a75c80afb1072a7b7ecb10186a9d34dc72c05e640","startTimestamp":"2021-03-25T21:43:10Z","phaseDescription":"Waiting for the Rsync client pods to be completed","phase":"WaitForRsyncClientPodsCompleted","itinerary":"VolumeMigration"}}

-----------------------

# [RECONCILE 3]
# Here the cached client finally gives us the DVM that was
# pushed back to the APIserver at the end of Reconcile #1, and we're on to the next phase.

{"level":"info","ts":1616708637741,"logger":"directvolume|p2w7j","msg":"START Reconciling DVM",
"preReconcilePhase":"WaitForRsyncClientPodsCompleted","dvmStatus":{"conditions":[{"type":"Running","status":"True","reason":"WaitForRsyncClientPodsCompleted","category":"Advisory","message":"Step: 20/23","lastTransitionTime":"2021-03-25T21:43:55Z"},{"type":"Ready","status":"True","category":"Required","message":"Direct migration is ready","lastTransitionTime":"2021-03-25T21:43:10Z"}],"observedDigest":"8c3663504c0ba31ffed2cb1a75c80afb1072a7b7ecb10186a9d34dc72c05e640","startTimestamp":"2021-03-25T21:43:10Z","phaseDescription":"Waiting for the Rsync client pods to be completed","phase":"WaitForRsyncClientPodsCompleted","itinerary":"VolumeMigration"}}
{"level":"info","ts":1616708638232,"logger":"directvolume|p2w7j","msg":"[RUN] (Step 20/23) Waiting for the Rsync client pods to be completed","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs","Phase":"WaitForRsyncClientPodsCompleted"}
{"level":"info","ts":1616708638262,"logger":"directvolume|p2w7j","msg":"EXIT: SUCCESS - DVM got to end of reconcile","DVM":"f6c20e60-8db2-11eb-aa03-d79309d81160-lkdvs",
"postReconcilePhase":"WaitForRsyncClientPodsCompleted","dvmStatus":{"conditions":[{"type":"Running","status":"True","reason":"WaitForRsyncClientPodsCompleted","category":"Advisory","message":"Step: 20/23","lastTransitionTime":"2021-03-25T21:43:55Z"},{"type":"Ready","status":"True","category":"Required","message":"Direct migration is ready","lastTransitionTime":"2021-03-25T21:43:10Z"}],"observedDigest":"8c3663504c0ba31ffed2cb1a75c80afb1072a7b7ecb10186a9d34dc72c05e640","startTimestamp":"2021-03-25T21:43:10Z","phaseDescription":"Waiting for the Rsync client pods to be completed","phase":"WaitForRsyncClientPodsCompleted","itinerary":"VolumeMigration"}}

I am wondering if this is related to FastReQ interval used in DVM possibly being shorter than in MigMigration controller. We haven't merged #1013 yet but I actually think it might fix it because it'll switch us from using immediate requeues to FastReQs.

@djwhatle djwhatle added the kind/bug Categorizes issue or PR as related to a bug. label Mar 25, 2021
@djwhatle djwhatle changed the title DVM and possibly other controllers are repeating some phases twice Controllers sometimes repeat the same Phase twice even after hitting t.next() and writing to APIserver successfully Mar 25, 2021
@djwhatle
Copy link
Contributor Author

djwhatle commented Mar 26, 2021

@pranavgaikwad @alaypatel07 @shawn-hurley
Ok, I've provably demonstrated that the cached client is giving us resources a generation behind the APIserver.
This makes sense because the cached client has updates pushed to it.

Generation goes from 8 -> 9 -> 8

# [Reconcile 1] Update is written and generation is bumped
{"level":"info","ts":1616770743729,"logger":"directvolume|92xjm","msg":">>> START >>>","DVM":"a6f41860-8e43-11eb-b81c-f9ed484af5e5-pbxqp",
"generation":8,"prePhase":"CreateDestinationPVCs"}

{"level":"info","ts":1616770745852,"logger":"directvolume|92xjm","msg":"<<< UPDATE [SUCCESS] <<<","DVM":"a6f41860-8e43-11eb-b81c-f9ed484af5e5-pbxqp",
"generation":9,"postPhase":"DestinationPVCsCreated"}

# [Reconcile 2] Cached client hasn't been pushed the update to DVM that was made at end of last reconcile
{"level":"info","ts":1616770745852,"logger":"directvolume|29grx","msg":">>> START >>>","DVM":"a6f41860-8e43-11eb-b81c-f9ed484af5e5-pbxqp",
"generation":8,"prePhase":"CreateDestinationPVCs"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant