Allow provider id regex matching to support more provider formats #8485

shyamradhakrishnan · 2023-04-06T04:02:11Z

What would you like to be added (User Story)?

As a developer of a Cluster API Provider, I would like to support more formats for the provider id rather than the current restrictive format of "^[^:]+://.*[^/]$" defined here - https://github.com/kubernetes-sigs/cluster-api/blob/main/controllers/noderefutil/providerid.go#L48

Detailed Description

While developing a provider for Oracle Cloud Infrastructure, we are facing a problem with the current restrictive validation being done in CAPI here - https://github.com/kubernetes-sigs/cluster-api/blob/main/controllers/noderefutil/providerid.go#L48 . OCI managed kubernets offering sets the provider id as ocid1.instance.oc1.. and ocid1.virtualnode.oc1... Kubernetes API Server and Kubelet is allowing these formats, but since CAPI is throwing validations errors against theses formats, we are currently unable to properly add support for managed providers for OCI. We have a workaround for the managed nodepool, bur for virtual nodepool we are unable to use the woraround.

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

shyamradhakrishnan · 2023-04-06T04:12:22Z

From my current reading of code, the https://github.com/kubernetes-sigs/cluster-api/blob/main/controllers/noderefutil/providerid.go has 2 contracts it has to follow.

It validates whether the provider id is of the format "^[^:]+://.*[^/]$" , which is essentially cloudprovider://
The second and crucial contract is the Equals method It is called from here - https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machine/machine_controller_noderef.go#L223 which validates if the provider id of the Kubernetes node matches that of the CAPI machine. The Equals() method as of now does not use the regex split strings, it compares the whole string.

The other method currently being used in lots if places is IndexKey, which again uses the whole string and not the split string. So from the limited understanding of the code, its only the validation which is blocking us.

/cc @CecileRobertMichon @jackfrancis @fabriziopandini @joekr - continuing from our discussion from yesterday.

CecileRobertMichon · 2023-04-06T15:55:00Z

The second and crucial contract is the Equals method It is called from here

Please correct me if I'm wrong but my understanding is that the Equal method itself is not a contract or a validation, it is what is used to match nodes with CAPI Machines and MachinePools. The contract "the provider ID of node must match the provider ID of the infra machine" does exist implicitly but it's not actually validated/enforced (what I call enforced in this context is an error being returned by the controllers), except right now by 1) because the code assumes that's how all node provider IDs look so it is enforcing that specific format.

If we were to remove the regex (which we should, there is nothing that requires that specific format in k8s), what would be the user experience if the nodes' provider ID don't match the format of the CAPI Infra Machine provider ID? Would there be an error anywhere in the controller indicating to the user that there is an issue? or would the Machines stay without a nodeRef forever, stuck waiting for a node with the wrong provider ID?

If it's the latter, can we think of any way to better enforce this contract of "provider id of the Kubernetes node matches that of the CAPI machine" without relying on the regex?

jackfrancis · 2023-04-06T17:39:15Z

To be clear, it's not just the regular expression that needs to be addressed: the regular expression is simply sanity checking input for the foo that formulates an ID and a CloudProvider substring from the source providerID string.

I think we need to assume that those substrings are useful across the provider ecosystem, and so any changes to accommodate a novel providerID string format will need to be backwards compatible with existing structures that have been built on top of the current format assumptions.

I'm happy to take a stab at that @shyamradhakrishnan, would you like to see a possible back-compat solution?

shyamradhakrishnan · 2023-04-07T02:51:49Z

@CecileRobertMichon the nodes will be stuck and the machinepools will not be in running state, will be stuck in scaling up state, with the following logs

I0407 02:47:39.378987       1 machinepool_controller_noderef.go:83] "Cannot assign NodeRefs to MachinePool, no matching Nodes" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="default/iad-cluster-3-mp-0" namespace="default" name="iad-cluster-3-mp-0" reconcileID=e540d24c-f2ce-4923-9438-067c8089669b Cluster="default/iad-cluster-3"

But that can happen even if the regex is present also if there is a bug right? For example assume, that in the infra controller, developer sets the provider id as clouprovider:/// and in the managed service, it is set as cloudprovider:// or any such small error while constructing the provider id. Of course, if user has setup machine health checks the node will be health checked and deleted and replaced. And once we have machine pool machines, that can happen in machinepools as well(hopefully).

Do you think apart from the fact that machine pools dont go to running state which users will monitor, there needs to be another error which we should throw for this specific case?

shyamradhakrishnan · 2023-04-07T02:54:01Z

@jackfrancis sure would be great if you can help. When you say backward compatible, you mean that the cloudProvider/id fields of ProviderID struct is being used as a library outside core CAPI and we should not remove those fields? Like for example, we split the strings, and set the cloud provider to "unknown" and set the id as original if the regex does not match?

enxebre · 2023-04-10T09:42:40Z

I'm putting together some context here so it is easier to understand why/where we're going:

providerID was first added to the Node API in 2015 api: add the ProviderID attribute to NodeSpec kubernetes/kubernetes#7775 (comment) as a semantic to have stronger Node/Cloud Provider identity. The format cloud://// was kind of openly agreed in that thread and it was perpetuated until today.
We added providerID in CAPI Machine API in 2018/2019 (v1alpha) to have stronger Machine/Node/Cloud provider identity Extend MachineStatus to add ProviderID #565
We then added the abstractions to manage the providerID NodeRef controller #1011 in the noderef controller
We wanted to ensure capi providers use the same format than kcm cloud providers
ProviderID more consistent with kube manager cloud provider cluster-api-provider-aws#1693
✨Make machine's providerID consistent with node providerID cluster-api-provider-aws#1730
ProviderID set by capi infra providers should match the one set by the controller manager cloud-provider #4526
⚠️ Machine ProviderID equality is now strictly enforced #6412
providerId in lowerCase does not match the providerId on the nodes cluster-api-provider-azure#2533
to reduce room for side effects from edge cases, e.g a cloud provider using the same instanteID in two different regions/zones.
The providerID is the contractual source of truth to match Machines to Nodes for either CAPI internal controllers 🌱 Add providerID index to get nodes #4521 and external consumers, e.g. autoscaler CAPI: Do not normalize Node IDs outside of CAPI provider kubernetes/autoscaler#3057

I'm good relaxing the providerID format (I'm curious though what's the reason for this provider to deviate from the existing non-official convention cc @shyamradhakrishnan), but whatever we allow needs to be done in hand with https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/types.go#L4377-L4380

I also share @CecileRobertMichon concerns above, as we do this I think we should also improve UX for error scenarios (not sure how well we're communicating right now in a condition that there's a permanent miss-match between providerIDs), but mainly how to prevent that from happening e.g. one random idea might be for "clouds" to define their contractual format in a source of truth that both capi providers and kcm cloud providers consume...

jackfrancis · 2023-04-10T23:46:26Z

I think something like this will be needed to address the requirements outlined in this issue:

#8505

Take a look folks and let's discuss pros/cons in the above PR!

shyamradhakrishnan · 2023-04-11T03:50:12Z

@enxebre the format is not validated by apiserver/kubelet, and OKE(the managed provider for OCI) is passing the Kubernetes conformance tests without this format. So it does seem little strange that only CAPI tries to enforce the format right? If Kubernetes in general forces this format, it should have been a different question. From what I understand this is a soft rule and not a hard one and that too a legacy one, but my understanding maybe wrong.

shyamradhakrishnan · 2023-04-11T03:52:05Z

@jackfrancis I think your PR looks like a solution which will work for existing providers following the format and new ones also, so looks good to me.

fabriziopandini · 2023-04-11T17:16:23Z

/triage accepted
catching up with this thread, I will comment as soon as I can do some research in this are that predates me joining the project

fabriziopandini · 2023-04-12T14:02:23Z

Thanks @enxebre for the great context.

my 2 cents; given https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/types.go#L4377-L4380 it seems to me that Kubernetes accepting values like ocid1.instance.oc1 is more a bug than a feature (it doesn't enforce what the API states).

Given that IMO in CAPI we have two options:

Keep things as it is, which is in line with the comment on the node API
Keep only the equality check, which is the part we care about and drop entirely the validation (so we simplify our codebase, dropping a check that is not blocking)

I would prefer to stick to option 1, but if this can improve adoption and there is consensus, I can accept also 2.

I consider the UX issue about surfacing when a Provider doesn't match a node a sort of orthogonal problem, because it is not related to the change we are discussing (also today we only have log lines if this happens). Said that, this is a good chance to get it properly fixed

shyamradhakrishnan · 2023-04-13T02:33:30Z

@fabriziopandini I am following up with #sig-nod and #sig-cloud-provider on the format. If you look at the kubelet arguments here - https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

Unique identifier for identifying the node in a machine database, i.e cloud provider. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See [kubelet-config-file](https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/) for more information.)

It does not specify a format. so it does feel to me like a suggestion rather than something which is a standard. Note that even the https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/types.go#L4377-L4380, the terms "should" etc are not present. So I generally feel CAPI should follow what Kubernetes in general is following.

I will update this thread once I get an answer from the SIG groups.

shyamradhakrishnan · 2023-04-13T14:52:40Z

@fabriziopandini @enxebre this has been discussed in multiple github issues in core Kubernetes namely here

kubernetes/kubernetes#96871 (comment)
kubernetes/kubernetes#81859 (comment)

It is pretty clear that the provider id is not prescriptive and hence the validations doesnt exist in kubelet/Apiserver etc. Do you think we should do more research into this, I will try to get answers from the SIG channels but the github issues and kubelet documentation does suggest to me that format is not something which CAPI should fail on.

fabriziopandini · 2023-04-14T09:15:36Z

Thanks for this research, really appreaciated it!
if there is consensus, I'm ok with dropping the check and preserving only the equality check which is relevant for CAPI

shyamradhakrishnan · 2023-04-14T16:11:31Z

@enxebre @CecileRobertMichon @jackfrancis please comment with your opinion.

shyamradhakrishnan · 2023-04-24T10:51:27Z

@enxebre @CecileRobertMichon @jackfrancis gentle reminder for your opinion/comments.

jackfrancis · 2023-04-24T16:30:23Z

I made #8505, so I'm in favor. :)

CecileRobertMichon · 2023-04-24T17:02:56Z

+1 for dropping the regex and thinking about a better solution to enforce a provider ID format contract between CAPI and providers

vincepri · 2023-04-24T19:37:28Z

+1 here as well, seems it would simplify a lot the current logic as well

enxebre · 2023-04-25T08:04:34Z

sgtm, no objections to proceed via #8505

shyamradhakrishnan · 2023-04-25T08:55:45Z

Thanks @fabriziopandini @enxebre @jackfrancis @vincepri @CecileRobertMichon for the feedback

sbueringer · 2023-04-26T18:21:06Z

+1 as well from my side.

Just wanted to say at this point if someone has a fancy idea how we can get rid of depending on providerID to match Nodes to Machines, that would be great :)

vincepri · 2023-04-27T17:30:07Z

Not sure fancy idea, but one way could be having the Machine unique name/namespace to be a label or annotation on the node? This would be a new requirement on the bootstrap provider to spin up nodes with a well-defined label, which might not work for everyone, although we can always fallback to the providerID?

jackfrancis · 2023-04-27T17:59:10Z

fwiw I kinda think the existing providerID matching is still the best way to create node/infra affinity all the way up and down the k8s stack

In any event, I've been convinced that simply bypassing our current CAPI-specific ProviderID foo is the cleaner approach to moving this effort forward see #8577

sbueringer · 2023-04-27T18:28:28Z

This would be a new requirement on the bootstrap provider to spin up nodes with a well-defined label, which might not work for everyone, although we can always fallback to the providerID?

Yup. Problem is that e.g. kubelet can only set labels in specific domains / prefixes.

Was just thinking about how much time we already spent on this topic and even with the latest change we have problems like that CAPI doesn't work if the CCM has a totally unrelated problem.

But let's keep this issue scoped to the current problem ;)

shyamradhakrishnan · 2023-05-04T06:06:25Z

Thanks to all those provided inputs on the ticket and special thanks to @jackfrancis for getting this in so quickly in CAPI. We tested the branch and works amazing.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 6, 2023

jackfrancis mentioned this issue Apr 10, 2023

⚠️ enable arbitrary strings as ProviderID format #8505

Closed

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 11, 2023

jackfrancis mentioned this issue Apr 27, 2023

🌱 use providerID string as-is #8577

Merged

k8s-ci-robot closed this as completed in #8577 May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow provider id regex matching to support more provider formats #8485

Allow provider id regex matching to support more provider formats #8485

shyamradhakrishnan commented Apr 6, 2023

shyamradhakrishnan commented Apr 6, 2023

CecileRobertMichon commented Apr 6, 2023

jackfrancis commented Apr 6, 2023

shyamradhakrishnan commented Apr 7, 2023

shyamradhakrishnan commented Apr 7, 2023

enxebre commented Apr 10, 2023 •

edited

Loading

jackfrancis commented Apr 10, 2023

shyamradhakrishnan commented Apr 11, 2023 •

edited

Loading

shyamradhakrishnan commented Apr 11, 2023

fabriziopandini commented Apr 11, 2023

fabriziopandini commented Apr 12, 2023

shyamradhakrishnan commented Apr 13, 2023

shyamradhakrishnan commented Apr 13, 2023

fabriziopandini commented Apr 14, 2023

shyamradhakrishnan commented Apr 14, 2023

shyamradhakrishnan commented Apr 24, 2023

jackfrancis commented Apr 24, 2023

CecileRobertMichon commented Apr 24, 2023

vincepri commented Apr 24, 2023

enxebre commented Apr 25, 2023

shyamradhakrishnan commented Apr 25, 2023

sbueringer commented Apr 26, 2023 •

edited

Loading

vincepri commented Apr 27, 2023 •

edited

Loading

jackfrancis commented Apr 27, 2023

sbueringer commented Apr 27, 2023

shyamradhakrishnan commented May 4, 2023

Allow provider id regex matching to support more provider formats #8485

Allow provider id regex matching to support more provider formats #8485

Comments

shyamradhakrishnan commented Apr 6, 2023

What would you like to be added (User Story)?

Detailed Description

Anything else you would like to add?

Label(s) to be applied

shyamradhakrishnan commented Apr 6, 2023

CecileRobertMichon commented Apr 6, 2023

jackfrancis commented Apr 6, 2023

shyamradhakrishnan commented Apr 7, 2023

shyamradhakrishnan commented Apr 7, 2023

enxebre commented Apr 10, 2023 • edited Loading

jackfrancis commented Apr 10, 2023

shyamradhakrishnan commented Apr 11, 2023 • edited Loading

shyamradhakrishnan commented Apr 11, 2023

fabriziopandini commented Apr 11, 2023

fabriziopandini commented Apr 12, 2023

shyamradhakrishnan commented Apr 13, 2023

shyamradhakrishnan commented Apr 13, 2023

fabriziopandini commented Apr 14, 2023

shyamradhakrishnan commented Apr 14, 2023

shyamradhakrishnan commented Apr 24, 2023

jackfrancis commented Apr 24, 2023

CecileRobertMichon commented Apr 24, 2023

vincepri commented Apr 24, 2023

enxebre commented Apr 25, 2023

shyamradhakrishnan commented Apr 25, 2023

sbueringer commented Apr 26, 2023 • edited Loading

vincepri commented Apr 27, 2023 • edited Loading

jackfrancis commented Apr 27, 2023

sbueringer commented Apr 27, 2023

shyamradhakrishnan commented May 4, 2023

enxebre commented Apr 10, 2023 •

edited

Loading

shyamradhakrishnan commented Apr 11, 2023 •

edited

Loading

sbueringer commented Apr 26, 2023 •

edited

Loading

vincepri commented Apr 27, 2023 •

edited

Loading