-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow provider id regex matching to support more provider formats #8485
Comments
From my current reading of code, the https://github.com/kubernetes-sigs/cluster-api/blob/main/controllers/noderefutil/providerid.go has 2 contracts it has to follow.
The other method currently being used in lots if places is IndexKey, which again uses the whole string and not the split string. So from the limited understanding of the code, its only the validation which is blocking us. /cc @CecileRobertMichon @jackfrancis @fabriziopandini @joekr - continuing from our discussion from yesterday. |
Please correct me if I'm wrong but my understanding is that the If we were to remove the regex (which we should, there is nothing that requires that specific format in k8s), what would be the user experience if the nodes' provider ID don't match the format of the CAPI Infra Machine provider ID? Would there be an error anywhere in the controller indicating to the user that there is an issue? or would the Machines stay without a nodeRef forever, stuck waiting for a node with the wrong provider ID? If it's the latter, can we think of any way to better enforce this contract of "provider id of the Kubernetes node matches that of the CAPI machine" without relying on the regex? |
To be clear, it's not just the regular expression that needs to be addressed: the regular expression is simply sanity checking input for the foo that formulates an I think we need to assume that those substrings are useful across the provider ecosystem, and so any changes to accommodate a novel I'm happy to take a stab at that @shyamradhakrishnan, would you like to see a possible back-compat solution? |
@CecileRobertMichon the nodes will be stuck and the machinepools will not be in running state, will be stuck in scaling up state, with the following logs
But that can happen even if the regex is present also if there is a bug right? For example assume, that in the infra controller, developer sets the provider id as clouprovider:/// and in the managed service, it is set as cloudprovider:// or any such small error while constructing the provider id. Of course, if user has setup machine health checks the node will be health checked and deleted and replaced. And once we have machine pool machines, that can happen in machinepools as well(hopefully). Do you think apart from the fact that machine pools dont go to running state which users will monitor, there needs to be another error which we should throw for this specific case? |
@jackfrancis sure would be great if you can help. When you say backward compatible, you mean that the cloudProvider/id fields of ProviderID struct is being used as a library outside core CAPI and we should not remove those fields? Like for example, we split the strings, and set the cloud provider to "unknown" and set the id as original if the regex does not match? |
I'm putting together some context here so it is easier to understand why/where we're going:
I'm good relaxing the providerID format (I'm curious though what's the reason for this provider to deviate from the existing non-official convention cc @shyamradhakrishnan), but whatever we allow needs to be done in hand with https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/types.go#L4377-L4380 I also share @CecileRobertMichon concerns above, as we do this I think we should also improve UX for error scenarios (not sure how well we're communicating right now in a condition that there's a permanent miss-match between providerIDs), but mainly how to prevent that from happening e.g. one random idea might be for "clouds" to define their contractual format in a source of truth that both capi providers and kcm cloud providers consume... |
I think something like this will be needed to address the requirements outlined in this issue: Take a look folks and let's discuss pros/cons in the above PR! |
@enxebre the format is not validated by apiserver/kubelet, and OKE(the managed provider for OCI) is passing the Kubernetes conformance tests without this format. So it does seem little strange that only CAPI tries to enforce the format right? If Kubernetes in general forces this format, it should have been a different question. From what I understand this is a soft rule and not a hard one and that too a legacy one, but my understanding maybe wrong. |
@jackfrancis I think your PR looks like a solution which will work for existing providers following the format and new ones also, so looks good to me. |
/triage accepted |
Thanks @enxebre for the great context. my 2 cents; given https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/types.go#L4377-L4380 it seems to me that Kubernetes accepting values like ocid1.instance.oc1 is more a bug than a feature (it doesn't enforce what the API states). Given that IMO in CAPI we have two options:
I would prefer to stick to option 1, but if this can improve adoption and there is consensus, I can accept also 2. I consider the UX issue about surfacing when a Provider doesn't match a node a sort of orthogonal problem, because it is not related to the change we are discussing (also today we only have log lines if this happens). Said that, this is a good chance to get it properly fixed |
@fabriziopandini I am following up with #sig-nod and #sig-cloud-provider on the format. If you look at the kubelet arguments here - https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/
It does not specify a format. so it does feel to me like a suggestion rather than something which is a standard. Note that even the https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/types.go#L4377-L4380, the terms "should" etc are not present. So I generally feel CAPI should follow what Kubernetes in general is following. I will update this thread once I get an answer from the SIG groups. |
@fabriziopandini @enxebre this has been discussed in multiple github issues in core Kubernetes namely here kubernetes/kubernetes#96871 (comment) It is pretty clear that the provider id is not prescriptive and hence the validations doesnt exist in kubelet/Apiserver etc. Do you think we should do more research into this, I will try to get answers from the SIG channels but the github issues and kubelet documentation does suggest to me that format is not something which CAPI should fail on. |
Thanks for this research, really appreaciated it! |
@enxebre @CecileRobertMichon @jackfrancis please comment with your opinion. |
@enxebre @CecileRobertMichon @jackfrancis gentle reminder for your opinion/comments. |
I made #8505, so I'm in favor. :) |
+1 for dropping the regex and thinking about a better solution to enforce a provider ID format contract between CAPI and providers |
+1 here as well, seems it would simplify a lot the current logic as well |
sgtm, no objections to proceed via #8505 |
Thanks @fabriziopandini @enxebre @jackfrancis @vincepri @CecileRobertMichon for the feedback |
+1 as well from my side. Just wanted to say at this point if someone has a fancy idea how we can get rid of depending on providerID to match Nodes to Machines, that would be great :) |
Not sure fancy idea, but one way could be having the Machine unique name/namespace to be a label or annotation on the node? This would be a new requirement on the bootstrap provider to spin up nodes with a well-defined label, which might not work for everyone, although we can always fallback to the providerID? |
fwiw I kinda think the existing providerID matching is still the best way to create node/infra affinity all the way up and down the k8s stack In any event, I've been convinced that simply bypassing our current CAPI-specific |
Yup. Problem is that e.g. kubelet can only set labels in specific domains / prefixes. Was just thinking about how much time we already spent on this topic and even with the latest change we have problems like that CAPI doesn't work if the CCM has a totally unrelated problem. But let's keep this issue scoped to the current problem ;) |
Thanks to all those provided inputs on the ticket and special thanks to @jackfrancis for getting this in so quickly in CAPI. We tested the branch and works amazing. |
What would you like to be added (User Story)?
As a developer of a Cluster API Provider, I would like to support more formats for the provider id rather than the current restrictive format of "^[^:]+://.*[^/]$" defined here - https://github.com/kubernetes-sigs/cluster-api/blob/main/controllers/noderefutil/providerid.go#L48
Detailed Description
While developing a provider for Oracle Cloud Infrastructure, we are facing a problem with the current restrictive validation being done in CAPI here - https://github.com/kubernetes-sigs/cluster-api/blob/main/controllers/noderefutil/providerid.go#L48 . OCI managed kubernets offering sets the provider id as ocid1.instance.oc1.. and ocid1.virtualnode.oc1... Kubernetes API Server and Kubelet is allowing these formats, but since CAPI is throwing validations errors against theses formats, we are currently unable to properly add support for managed providers for OCI. We have a workaround for the managed nodepool, bur for virtual nodepool we are unable to use the woraround.
Anything else you would like to add?
No response
Label(s) to be applied
/kind feature
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
The text was updated successfully, but these errors were encountered: