Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator 1.7.0 -> 1.8.0: Reconciliation loop on the DaemonSet #1417

Open
diranged opened this issue Sep 13, 2024 · 2 comments
Open

Operator 1.7.0 -> 1.8.0: Reconciliation loop on the DaemonSet #1417

diranged opened this issue Sep 13, 2024 · 2 comments

Comments

@diranged
Copy link

Describe what happened:
We're testing the datadog-operator helm chart upgrade datadog-operator 1.8.6...2.0.0 - and we're seeing a new behavior in the operator pods where they are in a reconciliation loop on the Datadog Agent DaemonSet:

{"level":"INFO","ts":"2024-09-13T16:31:57.299Z","logger":"controllers.DatadogAgent","msg":"Reconciling DatadogAgent","datadogagent":{"name":"datadog","namespace":"datadog-operator"}}
{"level":"INFO","ts":"2024-09-13T16:31:57.301Z","logger":"controllers.DatadogAgent","msg":"Updating Daemonset","datadogagent":{"name":"datadog","namespace":"datadog-operator"},"component":"nodeAgent","daemonset.Namespace":"datadog-operator","daemonset.Name":"datadog-agent"}
{"level":"INFO","ts":"2024-09-13T16:31:57.668Z","logger":"controllers.DatadogAgent","msg":"Creating Daemonset","datadogagent":{"name":"datadog","namespace":"datadog-operator"},"component":"nodeAgent","daemonset.Namespace":"datadog-operator","daemonset.Name":"datadog-agent"}
{"level":"INFO","ts":"2024-09-13T16:31:57.893Z","logger":"controllers.DatadogAgent","msg":"Reconciling DatadogAgent","datadogagent":{"name":"datadog","namespace":"datadog-operator"}}
{"level":"INFO","ts":"2024-09-13T16:31:57.909Z","logger":"controllers.DatadogAgent","msg":"Updating Daemonset","datadogagent":{"name":"datadog","namespace":"datadog-operator"},"component":"nodeAgent","daemonset.Namespace":"datadog-operator","daemonset.Name":"datadog-agent"}
{"level":"INFO","ts":"2024-09-13T16:31:58.280Z","logger":"controllers.DatadogAgent","msg":"Creating Daemonset","datadogagent":{"name":"datadog","namespace":"datadog-operator"},"component":"nodeAgent","daemonset.Namespace":"datadog-operator","daemonset.Name":"datadog-agent"}
{"level":"INFO","ts":"2024-09-13T16:31:58.450Z","logger":"controllers.DatadogAgent","msg":"Reconciling DatadogAgent","datadogagent":{"name":"datadog","namespace":"datadog-operator"}}
{"level":"INFO","ts":"2024-09-13T16:31:58.469Z","logger":"controllers.DatadogAgent","msg":"Updating Daemonset","datadogagent":{"name":"datadog","namespace":"datadog-operator"},"component":"nodeAgent","daemonset.Namespace":"datadog-operator","daemonset.Name":"datadog-agent"}
{"level":"INFO","ts":"2024-09-13T16:31:58.821Z","logger":"controllers.DatadogAgent","msg":"Creating Daemonset","datadogagent":{"name":"datadog","namespace":"datadog-operator"},"component":"nodeAgent","daemonset.Namespace":"datadog-operator","daemonset.Name":"datadog-agent"}
{"level":"INFO","ts":"2024-09-13T16:31:58.986Z","logger":"controllers.DatadogAgent","msg":"Reconciling DatadogAgent","datadogagent":{"name":"datadog","namespace":"datadog-operator"}}
{"level":"INFO","ts":"2024-09-13T16:31:58.988Z","logger":"controllers.DatadogAgent","msg":"Updating Daemonset","datadogagent":{"name":"datadog","namespace":"datadog-operator"},"component":"nodeAgent","daemonset.Namespace":"datadog-operator","daemonset.Name":"datadog-agent"}
{"level":"INFO","ts":"2024-09-13T16:31:59.344Z","logger":"controllers.DatadogAgent","msg":"Creating Daemonset","datadogagent":{"name":"datadog","namespace":"datadog-operator"},"component":"nodeAgent","daemonset.Namespace":"datadog-operator","daemonset.Name":"datadog-agent"}
{"level":"INFO","ts":"2024-09-13T16:31:59.510Z","logger":"controllers.DatadogAgent","msg":"Reconciling DatadogAgent","datadogagent":{"name":"datadog","namespace":"datadog-operator"}}
{"level":"INFO","ts":"2024-09-13T16:31:59.512Z","logger":"controllers.DatadogAgent","msg":"Updating Daemonset","datadogagent":{"name":"datadog","namespace":"datadog-operator"},"component":"nodeAgent","daemonset.Namespace":"datadog-operator","daemonset.Name":"datadog-agent"}
{"level":"INFO","ts":"2024-09-13T16:31:59.869Z","logger":"controllers.DatadogAgent","msg":"Creating Daemonset","datadogagent":{"name":"datadog","namespace":"datadog-operator"},"component":"nodeAgent","daemonset.Namespace":"datadog-operator","daemonset.Name":"datadog-agent"}
{"level":"INFO","ts":"2024-09-13T16:32:00.056Z","logger":"controllers.DatadogAgent","msg":"Reconciling DatadogAgent","datadogagent":{"name":"datadog","namespace":"datadog-operator"}}
{"level":"INFO","ts":"2024-09-13T16:32:00.059Z","logger":"controllers.DatadogAgent","msg":"Updating Daemonset","datadogagent":{"name":"datadog","namespace":"datadog-operator"},"component":"nodeAgent","daemonset.Namespace":"datadog-operator","daemonset.Name":"datadog-agent"}
{"level":"INFO","ts":"2024-09-13T16:32:00.419Z","logger":"controllers.DatadogAgent","msg":"Creating Daemonset","datadogagent":{"name":"datadog","namespace":"datadog-operator"},"component":"nodeAgent","daemonset.Namespace":"datadog-operator","daemonset.Name":"datadog-agent"}
{"level":"INFO","ts":"2024-09-13T16:32:00.579Z","logger":"controllers.DatadogAgent","msg":"Reconciling DatadogAgent","datadogagent":{"name":"datadog","namespace":"datadog-operator"}}
{"level":"INFO","ts":"2024-09-13T16:32:00.582Z","logger":"controllers.DatadogAgent","msg":"Updating Daemonset","datadogagent":{"name":"datadog","namespace":"datadog-operator"},"component":"nodeAgent","daemonset.Namespace":"datadog-operator","daemonset.Name":"datadog-agent"}
{"level":"INFO","ts":"2024-09-13T16:32:00.940Z","logger":"controllers.DatadogAgent","msg":"Creating Daemonset","datadogagent":{"name":"datadog","namespace":"datadog-operator"},"component":"nodeAgent","daemonset.Namespace":"datadog-operator","daemonset.Name":"datadog-agent"}

I have tried monitoring the datadog-agent datemonset using kubectl get daemonset datadog-agent -o json -w, and I see zero changes being made to the resource... so this appears to be an internal reconciliation loop that isn't actually making changes to the API.

When we look at the Audit logs, we see a dramatic increase in the number of requests/second being made (though its absolute value isn't insane, the change is significant):

image

@diranged
Copy link
Author

So ... reverting to the 1.7.0 operator and the 1.8.6 chart didn't resolve the issue... after over an hour of debugging, including fully deleting the DatadogAgent resource, it still didn't resolve... Then magically it resolved on its own:

image

I can't fathom what happened... I spent time digging through the code at https://github.com/DataDog/datadog-operator/blob/v1.7.0/controllers/datadogagent/controller_reconcile_v2_common.go#L155-L207 and I can only think that there is something odd happening with the setting of the needsUpdate variable at https://github.com/DataDog/datadog-operator/blob/v1.7.0/controllers/datadogagent/controller_reconcile_v2_common.go#L194. Unfortunately there are no logs in the operator to indicate what the difference it was seeing might be..

@diranged
Copy link
Author

Ok.. so I checked out our other clusters (that didn't have this upgrade done) - and apparently this pattern just happens periodically throughout the day!

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant