-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test flaking with Timed out waiting for 1 nodes to be created for MachineDeployment #8786
Comments
/assign |
/triage accepted Thanks for opening this @adilGhaffarDev! |
@ykakarap - you might be able to help out on this one. Could it possibly be related to changes in the MachineSet controller? It appears to be recently introduced. |
Please add a link to the concrete job instance that failed under "Which jobs are flaking?". |
I will take a look. |
Initial analysis for the failure:
So I would say the problem is not actually the preflight checks but something different but the preflight checks now surfaced this problem. Of course the problem still needs to be fixed. Line of code that fails: here @sbueringer Any recommendations on how to address the rate-limiting issue? |
I'm not sure if the problem is the rate limiter or the late limiter just reports the error it had a few seconds before which seems to be : "error from a previous attempt: EOF". How is the requeue behavior of that error case? EDIT: I'll take a closer look now |
In my opinion the issue is the following:
There are two possible root causes:
I'll open a PR to also dump kube-system Pods at the end of tests. This should help us to figure out if it's 1. or 2. Independent of that the ClusterCacheTracker log is not ideal, but I think it's not our issue here. Some data: The apiserver pod starts up around 05:14:19 (link) So it could be that the Pod gets ready after KCP did its last reconcile before deletion at 05:14:27 |
The PR is merged now. Let's take a closer look on the next occurence |
PR to improve the CCT error: #8801 |
We had a new occurence now. Following results:
It's unclear to me at this point why KCP didn't see that the apiserver Pod was ready. I think we can further triage this by collecting more data, e.g. by adding logs to KCP to surface the Pod object it is seeing. A potential solution would be to requeue more often in KCP if KCP is not entirely healthy, but it would be good to first figure out what exactly is going on. @adilGhaffarDev If you have time to work on this, feel free to go ahead. I unfortunately won't have time to work on this the next few weeks. |
Yes I can take a look. |
@adilGhaffarDev Did you have time to investigate the issue? |
Sorry I didn't get time to look into it. I will not be able to work on it soon as I am on vacations right now. If someone has time please check it. |
/help |
@killianmuldoon: GuidelinesPlease ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Taking another look at the above:
From reading the objects again and some trial and error:
The only sources for reconciliations for the KCP object are:
For this edge case I could see some variants to get to a solution:
|
We also have a similar case where we already do a requeue: cluster-api/controlplane/kubeadm/internal/controllers/controller.go Lines 225 to 232 in ddbf9cf
But in this new case, I'll go forward and create a PR which would requeue in this new case too. |
The KCP controller also watches Machines AFAIU through the |
Not sure replicating this would be non-controversial - sounds like it shouldn't be needed now that there is a |
Ah right. The point is not a Machine watch, but a watch to the Pods inside the workload cluster instead. (KCP itself updates the conditions for the pods on the machine) |
I don't think this is a good idea - we'd end up caching too many pods. I'm fine with adding something like:
But we should comment clearly that it's to avoid a watch on the pods, and maybe reassess the case in #8786 (comment) to see if it's actually needed anymore and update the comment to reflect the current state. Thanks for taking a look at this one! |
I totally agree 👍. Created #9032 as proposal to solve this case. |
I think a requeue is probably the less-risky option. We could have a metadata-only watch for Pods. So we wouldn't have a memory problem. But I'm more than just a bit concerned about the amount of Pod events we would get (even if we filter down on only the most important pods) and then the corresponding reconciles. |
/reopen to wait for confirmation via CI? |
@sbueringer: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
More flexible triage link to confirm later on: |
This is fixed :-) /close |
@chrischdi: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Awesome work! |
Nice work on a tricky issue. I especialy like the k8s-triage links :D !! |
Which jobs are flaking?
e.g. https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1665614080297144320
Which tests are flaking?
periodic-cluster-api-e2e-main
periodic-cluster-api-e2e-ipv6-main
periodic-cluster-api-e2e-dualstack-ipv6-main
periodic-cluster-api-e2e-dualstack-ipv6-main
periodic-cluster-api-e2e-ipv6-main
All failing at Should create a workload cluster
Since when has it been flaking?
01-06-2023
https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#295d4f44852a6339ba54
Testgrid link
https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main
Reason for failure (if possible)
To be analyzed.
Anything else we need to know?
No response
Label(s) to be applied
/kind flake
The text was updated successfully, but these errors were encountered: