Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move webhook registration behind feature gate flag #5099

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bryan-cox
Copy link
Contributor

@bryan-cox bryan-cox commented Aug 28, 2024

What type of PR is this?
/kind bug

What this PR does / why we need it:
Move webhook registration behind feature gate flags similar to controller registration.

Without this PR, from a self-managed / externally managed infrastructure perspective, if you want to exclude the CRDs behind the MachinePool and ASOAPI feature flags, you'll get an error because the webhook for them is still registered.

E0828 10:05:27.972237       1 kind.go:63] "if kind is a CRD, it should be installed before calling Start" err="failed to get restmapping: no matches for kind \"AzureManagedControlPlane\" in group \"infrastructure.cluster.x-k8s.io\"" logger="controller-runtime.source.EventHandler" kind="AzureManagedControlPlane.infrastructure.cluster.x-k8s.io"

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

  • cherry-pick candidate

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Moves webhook registration behind feature gate flags like controller registration already does.

Move webhook registration behind feature gate flags similar to
controller registration.

Signed-off-by: Bryan Cox <brcox@redhat.com>
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 28, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jont828 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 28, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @bryan-cox. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 28, 2024
setupLog.Error(err, "unable to create webhook", "webhook", "AzureManagedMachinePool")
os.Exit(1)
}
// NOTE: AzureManagedCluster is behind AKS feature gate flag; the webhook
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still valid or can it be removed? Looks like its from a few years back.

@muraee
Copy link

muraee commented Aug 28, 2024

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 28, 2024
@nojnhuh
Copy link
Contributor

nojnhuh commented Aug 29, 2024

We use the webhooks to forbid creating resources disabled by feature flags. That's also what CAPI does so I think we should align with that: https://github.com/kubernetes-sigs/cluster-api/blob/be86b82e7e30a844bca141ff8bcdc450b0499549/exp/internal/webhooks/machinepool.go#L168. Does a user still get some kind of error here when they try to create an AzureMachinePool when the MachinePool flag is disabled?

This seems fine as long as users do some extra work to ensure those CRDs are not installed at all when the feature flags are disabled, but that would force users to adapt to keep the existing behavior and clusterctl doesn't make that easy.

Are you seeing any adverse behavior besides the error message?

@bryan-cox
Copy link
Contributor Author

We use the webhooks to forbid creating resources disabled by feature flags. That's also what CAPI does so I think we should align with that: https://github.com/kubernetes-sigs/cluster-api/blob/be86b82e7e30a844bca141ff8bcdc450b0499549/exp/internal/webhooks/machinepool.go#L168. Does a user still get some kind of error here when they try to create an AzureMachinePool when the MachinePool flag is disabled?

This seems fine as long as users do some extra work to ensure those CRDs are not installed at all when the feature flags are disabled, but that would force users to adapt to keep the existing behavior and clusterctl doesn't make that easy.

Are you seeing any adverse behavior besides the error message?

We aren't using AzureMachinePool. Yeah, we are seeing more than just the log message; the CAPZ pod restarts constantly. Here are some additional logs before the pod restarts:

E0829 15:50:31.089094       1 kind.go:63] "if kind is a CRD, it should be installed before calling Start" err="failed to get restmapping: no matches for kind \"AzureManagedControlPlane\" in group \"infrastructure.cluster.x-k8s.io\"" logger="controller-runtime.source.EventHandler" kind="AzureManagedControlPlane.infrastructure.cluster.x-k8s.io"
I0829 15:50:38.588560       1 azuremachine_controller.go:243] "Reconciling AzureMachine" logger="controllers.AzureMachineReconciler.reconcileNormal" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine" AzureMachine="clusters-generic-hc/generic-hc-9npwz-8z465" namespace="clusters-generic-hc" name="generic-hc-9npwz-8z465" reconcileID="743788a0-e979-4c1e-9ca4-0c854d575fc0" x-ms-correlation-request-id="0951596f-73b2-4a57-801b-40faca63ef50"
I0829 15:50:38.809896       1 azuremachine_controller.go:243] "Reconciling AzureMachine" logger="controllers.AzureMachineReconciler.reconcileNormal" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine" AzureMachine="clusters-generic-hc/generic-hc-9npwz-7p4fb" namespace="clusters-generic-hc" name="generic-hc-9npwz-7p4fb" reconcileID="f9e21048-ea5d-44bf-9c2d-195d7ad86e74" x-ms-correlation-request-id="1e6492aa-fe1d-413c-9cac-292107e030f7"
E0829 15:50:41.091628       1 kind.go:63] "if kind is a CRD, it should be installed before calling Start" err="failed to get restmapping: no matches for kind \"AzureManagedControlPlane\" in group \"infrastructure.cluster.x-k8s.io\"" logger="controller-runtime.source.EventHandler" kind="AzureManagedControlPlane.infrastructure.cluster.x-k8s.io"
E0829 15:50:41.235638       1 controller.go:203] "Could not wait for Cache to sync" err="failed to wait for ASOSecret caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.AzureManagedControlPlane" controller="ASOSecret" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster"
I0829 15:50:41.235695       1 internal.go:516] "Stopping and waiting for non leader election runnables"
I0829 15:50:41.235829       1 internal.go:520] "Stopping and waiting for leader election runnables"
I0829 15:50:41.235949       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine"
I0829 15:50:41.236026       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="azuremachinetemplate" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachineTemplate"
I0829 15:50:41.236232       1 controller.go:242] "All workers finished" controller="azuremachinetemplate" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachineTemplate"
I0829 15:50:41.236158       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="azurecluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster"
I0829 15:50:41.236386       1 controller.go:242] "All workers finished" controller="azurecluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureCluster"
I0829 15:50:41.236177       1 controller.go:240] "Shutdown signal received, waiting for all workers to finish" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine"
I0829 15:50:41.236823       1 controller.go:242] "All workers finished" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine"
I0829 15:50:41.237036       1 controller.go:242] "All workers finished" controller="azuremachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AzureMachine"
I0829 15:50:41.237121       1 internal.go:528] "Stopping and waiting for caches"
I0829 15:50:41.237583       1 internal.go:532] "Stopping and waiting for webhooks"
I0829 15:50:41.237981       1 server.go:249] "Shutting down webhook server with timeout of 1 minute" logger="controller-runtime.webhook"
I0829 15:50:41.238191       1 internal.go:535] "Stopping and waiting for HTTP servers"
I0829 15:50:41.238323       1 server.go:231] "Shutting down metrics server with timeout of 1 minute" logger="controller-runtime.metrics"
I0829 15:50:41.238458       1 server.go:43] "shutting down server" kind="health probe" addr="[::]:9440"
I0829 15:50:41.238568       1 internal.go:539] "Wait completed, proceeding to shutdown the manager"
E0829 15:50:41.238677       1 main.go:353] "problem running manager" err="failed to wait for ASOSecret caches to sync: timed out waiting for cache to be synced for Kind *v1beta1.AzureManagedControlPlane" logger="setup"

We have the MachinePool feature turned off in our pod deployment:

      containers:
      - args:
        - --namespace=$(MY_NAMESPACE)
        - --leader-elect=true
        - --feature-gates=MachinePool=false
...
        name: manager

@bryan-cox
Copy link
Contributor Author

FWIW the machines do get provisioned and join our cluster. The CAPZ pod just consistently restarts.

@bryan-cox
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

4 participants