Move webhook registration behind feature gate flag #5099

bryan-cox · 2024-08-28T15:24:14Z

What type of PR is this?
/kind bug

What this PR does / why we need it:
Move webhook registration behind feature gate flags similar to controller registration.

Without this PR, from a self-managed / externally managed infrastructure perspective, if you want to exclude the CRDs behind the MachinePool and ASOAPI feature flags, you'll get an error because the webhook for them is still registered.

E0828 10:05:27.972237       1 kind.go:63] "if kind is a CRD, it should be installed before calling Start" err="failed to get restmapping: no matches for kind \"AzureManagedControlPlane\" in group \"infrastructure.cluster.x-k8s.io\"" logger="controller-runtime.source.EventHandler" kind="AzureManagedControlPlane.infrastructure.cluster.x-k8s.io"

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

cherry-pick candidate

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

Moves webhook registration behind feature gate flags like controller registration already does.

Move webhook registration behind feature gate flags similar to controller registration. Signed-off-by: Bryan Cox <brcox@redhat.com>

k8s-ci-robot · 2024-08-28T15:24:23Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jont828 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-08-28T15:24:24Z

Hi @bryan-cox. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

bryan-cox · 2024-08-28T15:24:55Z

main.go

-		setupLog.Error(err, "unable to create webhook", "webhook", "AzureManagedMachinePool")
-		os.Exit(1)
-	}
+		// NOTE: AzureManagedCluster is behind AKS feature gate flag; the webhook


Is this comment still valid or can it be removed? Looks like its from a few years back.

muraee · 2024-08-28T15:31:59Z

/ok-to-test

nojnhuh · 2024-08-29T15:13:54Z

We use the webhooks to forbid creating resources disabled by feature flags. That's also what CAPI does so I think we should align with that: https://github.com/kubernetes-sigs/cluster-api/blob/be86b82e7e30a844bca141ff8bcdc450b0499549/exp/internal/webhooks/machinepool.go#L168. Does a user still get some kind of error here when they try to create an AzureMachinePool when the MachinePool flag is disabled?

This seems fine as long as users do some extra work to ensure those CRDs are not installed at all when the feature flags are disabled, but that would force users to adapt to keep the existing behavior and clusterctl doesn't make that easy.

Are you seeing any adverse behavior besides the error message?

bryan-cox · 2024-08-29T15:58:56Z

We use the webhooks to forbid creating resources disabled by feature flags. That's also what CAPI does so I think we should align with that: https://github.com/kubernetes-sigs/cluster-api/blob/be86b82e7e30a844bca141ff8bcdc450b0499549/exp/internal/webhooks/machinepool.go#L168. Does a user still get some kind of error here when they try to create an AzureMachinePool when the MachinePool flag is disabled?

This seems fine as long as users do some extra work to ensure those CRDs are not installed at all when the feature flags are disabled, but that would force users to adapt to keep the existing behavior and clusterctl doesn't make that easy.

Are you seeing any adverse behavior besides the error message?

We aren't using AzureMachinePool. Yeah, we are seeing more than just the log message; the CAPZ pod restarts constantly. Here are some additional logs before the pod restarts:

E0829 15:50:31.089094       1 kind.go:63] "if kind is a CRD, it should be installed before calling Start" err="failed to get restmapping: no matches for kind \"AzureManagedControlPlane\" in group \"infrastructure.cluster.x-k8s.io\"" logger="controller-runtime.source.EventHandler" kind="AzureManagedControlPlane.infrastructure.cluster.x-k8s.io"
I0829 15:50:38.588560       1 azuremachine_controller.go:243] "Reconciling AzureMachine" logger="controllers.AzureMachineReconciler.reconcileNormal" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine" AzureMachine="clusters-generic-hc/generic-hc-9npwz-8z465" namespace="clusters-generic-hc" name="generic-hc-9npwz-8z465" reconcileID="743788a0-e979-4c1e-9ca4-0c854d575fc0" x-ms-correlation-request-id="0951596f-73b2-4a57-801b-40faca63ef50"
I0829 15:50:38.809896       1 azuremachine_controller.go:243] "Reconciling AzureMachine" logger="controllers.AzureMachineReconciler.reconcileNormal" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine" AzureMachine="clusters-generic-hc/generic-hc-9npwz-7p4fb" namespace="clusters-generic-hc" name="generic-hc-9npwz-7p4fb" reconcileID="f9e21048-ea5d-44bf-9c2d-195d7ad86e74" x-ms-correlation-request-id="1e6492aa-fe1d-413c-9cac-292107e030f7"
E0829 15:50:41.091628       1 kind.go:63] "if kind is a CRD, it should be installed before calling Start" err="failed to get restmapping: no matches for kind \"AzureManagedControlPlane\" in group \"infrastructure.cluster.x-k8s.io\"" logger="controller-runtime.source.EventHandler" kind="AzureManagedControlPlane.infrastructure.cluster.x-k8s.io"
E0829 15:50:41.235638       1 controller.go:203] "Could not wait for Cache to sync" err="failed to wait for ASOSecret caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.AzureManagedControlPlane" controller="ASOSecret" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster"
I0829 15:50:41.235695       1 internal.go:516] "Stopping and waiting for non leader election runnables"
I0829 15:50:41.235829       1 internal.go:520] "Stopping and waiting for leader election runnables"
I0829 15:50:41.235949       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine"
I0829 15:50:41.236026       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="azuremachinetemplate" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachineTemplate"
I0829 15:50:41.236232       1 controller.go:242] "All workers finished" controller="azuremachinetemplate" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachineTemplate"
I0829 15:50:41.236158       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="azurecluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster"
I0829 15:50:41.236386       1 controller.go:242] "All workers finished" controller="azurecluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster"
I0829 15:50:41.236177       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine"
I0829 15:50:41.236823       1 controller.go:242] "All workers finished" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine"
I0829 15:50:41.237036       1 controller.go:242] "All workers finished" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine"
I0829 15:50:41.237121       1 internal.go:528] "Stopping and waiting for caches"
I0829 15:50:41.237583       1 internal.go:532] "Stopping and waiting for webhooks"
I0829 15:50:41.237981       1 server.go:249] "Shutting down webhook server with timeout of 1 minute" logger="controller-runtime.webhook"
I0829 15:50:41.238191       1 internal.go:535] "Stopping and waiting for HTTP servers"
I0829 15:50:41.238323       1 server.go:231] "Shutting down metrics server with timeout of 1 minute" logger="controller-runtime.metrics"
I0829 15:50:41.238458       1 server.go:43] "shutting down server" kind="health probe" addr="[::]:9440"
I0829 15:50:41.238568       1 internal.go:539] "Wait completed, proceeding to shutdown the manager"
E0829 15:50:41.238677       1 main.go:353] "problem running manager" err="failed to wait for ASOSecret caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.AzureManagedControlPlane" logger="setup"

We have the MachinePool feature turned off in our pod deployment:

      containers:
      - args:
        - --namespace=$(MY_NAMESPACE)
        - --leader-elect=true
        - --feature-gates=MachinePool=false
...
        name: manager

bryan-cox · 2024-08-29T16:00:04Z

FWIW the machines do get provisioned and join our cluster. The CAPZ pod just consistently restarts.

bryan-cox · 2024-08-29T16:25:43Z

CAPA seems to follow this same pattern as well - https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/4507c0bc7371dd44e7b7b719c393f86452be60dd/main.go#L323.

Move webhook registration behind feature gate flag

83f3f66

Move webhook registration behind feature gate flags similar to controller registration. Signed-off-by: Bryan Cox <brcox@redhat.com>

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 28, 2024

k8s-ci-robot requested review from jackfrancis and Jont828 August 28, 2024 15:24

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 28, 2024

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 28, 2024

bryan-cox commented Aug 28, 2024

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 28, 2024

bryan-cox mentioned this pull request Aug 28, 2024

HOSTEDCP-1921: Remove unused CAPZ CRDs from HyperShift install openshift/hypershift#4618

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move webhook registration behind feature gate flag #5099

Move webhook registration behind feature gate flag #5099

bryan-cox commented Aug 28, 2024 •

edited

Loading

k8s-ci-robot commented Aug 28, 2024

k8s-ci-robot commented Aug 28, 2024

bryan-cox Aug 28, 2024

muraee commented Aug 28, 2024

nojnhuh commented Aug 29, 2024

bryan-cox commented Aug 29, 2024

bryan-cox commented Aug 29, 2024

bryan-cox commented Aug 29, 2024

Move webhook registration behind feature gate flag #5099

Are you sure you want to change the base?

Move webhook registration behind feature gate flag #5099

Conversation

bryan-cox commented Aug 28, 2024 • edited Loading

k8s-ci-robot commented Aug 28, 2024

k8s-ci-robot commented Aug 28, 2024

bryan-cox Aug 28, 2024

Choose a reason for hiding this comment

muraee commented Aug 28, 2024

nojnhuh commented Aug 29, 2024

bryan-cox commented Aug 29, 2024

bryan-cox commented Aug 29, 2024

bryan-cox commented Aug 29, 2024

bryan-cox commented Aug 28, 2024 •

edited

Loading