Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow provider id regex matching to support more provider formats #8485

Closed
shyamradhakrishnan opened this issue Apr 6, 2023 · 26 comments · Fixed by #8577
Closed

Allow provider id regex matching to support more provider formats #8485

shyamradhakrishnan opened this issue Apr 6, 2023 · 26 comments · Fixed by #8577
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@shyamradhakrishnan
Copy link

What would you like to be added (User Story)?

As a developer of a Cluster API Provider, I would like to support more formats for the provider id rather than the current restrictive format of "^[^:]+://.*[^/]$" defined here - https://github.com/kubernetes-sigs/cluster-api/blob/main/controllers/noderefutil/providerid.go#L48

Detailed Description

While developing a provider for Oracle Cloud Infrastructure, we are facing a problem with the current restrictive validation being done in CAPI here - https://github.com/kubernetes-sigs/cluster-api/blob/main/controllers/noderefutil/providerid.go#L48 . OCI managed kubernets offering sets the provider id as ocid1.instance.oc1.. and ocid1.virtualnode.oc1... Kubernetes API Server and Kubelet is allowing these formats, but since CAPI is throwing validations errors against theses formats, we are currently unable to properly add support for managed providers for OCI. We have a workaround for the managed nodepool, bur for virtual nodepool we are unable to use the woraround.

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 6, 2023
@shyamradhakrishnan
Copy link
Author

From my current reading of code, the https://github.com/kubernetes-sigs/cluster-api/blob/main/controllers/noderefutil/providerid.go has 2 contracts it has to follow.

  1. It validates whether the provider id is of the format "^[^:]+://.*[^/]$" , which is essentially cloudprovider://
  2. The second and crucial contract is the Equals method It is called from here - https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machine/machine_controller_noderef.go#L223 which validates if the provider id of the Kubernetes node matches that of the CAPI machine. The Equals() method as of now does not use the regex split strings, it compares the whole string.

The other method currently being used in lots if places is IndexKey, which again uses the whole string and not the split string. So from the limited understanding of the code, its only the validation which is blocking us.

/cc @CecileRobertMichon @jackfrancis @fabriziopandini @joekr - continuing from our discussion from yesterday.

@CecileRobertMichon
Copy link
Contributor

The second and crucial contract is the Equals method It is called from here

Please correct me if I'm wrong but my understanding is that the Equal method itself is not a contract or a validation, it is what is used to match nodes with CAPI Machines and MachinePools. The contract "the provider ID of node must match the provider ID of the infra machine" does exist implicitly but it's not actually validated/enforced (what I call enforced in this context is an error being returned by the controllers), except right now by 1) because the code assumes that's how all node provider IDs look so it is enforcing that specific format.

If we were to remove the regex (which we should, there is nothing that requires that specific format in k8s), what would be the user experience if the nodes' provider ID don't match the format of the CAPI Infra Machine provider ID? Would there be an error anywhere in the controller indicating to the user that there is an issue? or would the Machines stay without a nodeRef forever, stuck waiting for a node with the wrong provider ID?

If it's the latter, can we think of any way to better enforce this contract of "provider id of the Kubernetes node matches that of the CAPI machine" without relying on the regex?

@jackfrancis
Copy link
Contributor

To be clear, it's not just the regular expression that needs to be addressed: the regular expression is simply sanity checking input for the foo that formulates an ID and a CloudProvider substring from the source providerID string.

I think we need to assume that those substrings are useful across the provider ecosystem, and so any changes to accommodate a novel providerID string format will need to be backwards compatible with existing structures that have been built on top of the current format assumptions.

I'm happy to take a stab at that @shyamradhakrishnan, would you like to see a possible back-compat solution?

@shyamradhakrishnan
Copy link
Author

@CecileRobertMichon the nodes will be stuck and the machinepools will not be in running state, will be stuck in scaling up state, with the following logs

I0407 02:47:39.378987       1 machinepool_controller_noderef.go:83] "Cannot assign NodeRefs to MachinePool, no matching Nodes" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="default/iad-cluster-3-mp-0" namespace="default" name="iad-cluster-3-mp-0" reconcileID=e540d24c-f2ce-4923-9438-067c8089669b Cluster="default/iad-cluster-3"

But that can happen even if the regex is present also if there is a bug right? For example assume, that in the infra controller, developer sets the provider id as clouprovider:/// and in the managed service, it is set as cloudprovider:// or any such small error while constructing the provider id. Of course, if user has setup machine health checks the node will be health checked and deleted and replaced. And once we have machine pool machines, that can happen in machinepools as well(hopefully).

Do you think apart from the fact that machine pools dont go to running state which users will monitor, there needs to be another error which we should throw for this specific case?

@shyamradhakrishnan
Copy link
Author

@jackfrancis sure would be great if you can help. When you say backward compatible, you mean that the cloudProvider/id fields of ProviderID struct is being used as a library outside core CAPI and we should not remove those fields? Like for example, we split the strings, and set the cloud provider to "unknown" and set the id as original if the regex does not match?

@enxebre
Copy link
Member

enxebre commented Apr 10, 2023

I'm putting together some context here so it is easier to understand why/where we're going:

I'm good relaxing the providerID format (I'm curious though what's the reason for this provider to deviate from the existing non-official convention cc @shyamradhakrishnan), but whatever we allow needs to be done in hand with https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/types.go#L4377-L4380

I also share @CecileRobertMichon concerns above, as we do this I think we should also improve UX for error scenarios (not sure how well we're communicating right now in a condition that there's a permanent miss-match between providerIDs), but mainly how to prevent that from happening e.g. one random idea might be for "clouds" to define their contractual format in a source of truth that both capi providers and kcm cloud providers consume...

@jackfrancis
Copy link
Contributor

I think something like this will be needed to address the requirements outlined in this issue:

#8505

Take a look folks and let's discuss pros/cons in the above PR!

@shyamradhakrishnan
Copy link
Author

shyamradhakrishnan commented Apr 11, 2023

@enxebre the format is not validated by apiserver/kubelet, and OKE(the managed provider for OCI) is passing the Kubernetes conformance tests without this format. So it does seem little strange that only CAPI tries to enforce the format right? If Kubernetes in general forces this format, it should have been a different question. From what I understand this is a soft rule and not a hard one and that too a legacy one, but my understanding maybe wrong.

@shyamradhakrishnan
Copy link
Author

@jackfrancis I think your PR looks like a solution which will work for existing providers following the format and new ones also, so looks good to me.

@fabriziopandini
Copy link
Member

/triage accepted
catching up with this thread, I will comment as soon as I can do some research in this are that predates me joining the project

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 11, 2023
@fabriziopandini
Copy link
Member

Thanks @enxebre for the great context.

my 2 cents; given https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/types.go#L4377-L4380 it seems to me that Kubernetes accepting values like ocid1.instance.oc1 is more a bug than a feature (it doesn't enforce what the API states).

Given that IMO in CAPI we have two options:

  1. Keep things as it is, which is in line with the comment on the node API
  2. Keep only the equality check, which is the part we care about and drop entirely the validation (so we simplify our codebase, dropping a check that is not blocking)

I would prefer to stick to option 1, but if this can improve adoption and there is consensus, I can accept also 2.

I consider the UX issue about surfacing when a Provider doesn't match a node a sort of orthogonal problem, because it is not related to the change we are discussing (also today we only have log lines if this happens). Said that, this is a good chance to get it properly fixed

@shyamradhakrishnan
Copy link
Author

@fabriziopandini I am following up with #sig-nod and #sig-cloud-provider on the format. If you look at the kubelet arguments here - https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/

Unique identifier for identifying the node in a machine database, i.e cloud provider. (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See [kubelet-config-file](https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/) for more information.)

It does not specify a format. so it does feel to me like a suggestion rather than something which is a standard. Note that even the https://github.com/kubernetes/kubernetes/blob/master/pkg/apis/core/types.go#L4377-L4380, the terms "should" etc are not present. So I generally feel CAPI should follow what Kubernetes in general is following.

I will update this thread once I get an answer from the SIG groups.

@shyamradhakrishnan
Copy link
Author

@fabriziopandini @enxebre this has been discussed in multiple github issues in core Kubernetes namely here

kubernetes/kubernetes#96871 (comment)
kubernetes/kubernetes#81859 (comment)

It is pretty clear that the provider id is not prescriptive and hence the validations doesnt exist in kubelet/Apiserver etc. Do you think we should do more research into this, I will try to get answers from the SIG channels but the github issues and kubelet documentation does suggest to me that format is not something which CAPI should fail on.

@fabriziopandini
Copy link
Member

Thanks for this research, really appreaciated it!
if there is consensus, I'm ok with dropping the check and preserving only the equality check which is relevant for CAPI

@shyamradhakrishnan
Copy link
Author

@enxebre @CecileRobertMichon @jackfrancis please comment with your opinion.

@shyamradhakrishnan
Copy link
Author

@enxebre @CecileRobertMichon @jackfrancis gentle reminder for your opinion/comments.

@jackfrancis
Copy link
Contributor

I made #8505, so I'm in favor. :)

@CecileRobertMichon
Copy link
Contributor

+1 for dropping the regex and thinking about a better solution to enforce a provider ID format contract between CAPI and providers

@vincepri
Copy link
Member

+1 here as well, seems it would simplify a lot the current logic as well

@enxebre
Copy link
Member

enxebre commented Apr 25, 2023

sgtm, no objections to proceed via #8505

@shyamradhakrishnan
Copy link
Author

Thanks @fabriziopandini @enxebre @jackfrancis @vincepri @CecileRobertMichon for the feedback

@sbueringer
Copy link
Member

sbueringer commented Apr 26, 2023

+1 as well from my side.

Just wanted to say at this point if someone has a fancy idea how we can get rid of depending on providerID to match Nodes to Machines, that would be great :)

@vincepri
Copy link
Member

vincepri commented Apr 27, 2023

Not sure fancy idea, but one way could be having the Machine unique name/namespace to be a label or annotation on the node? This would be a new requirement on the bootstrap provider to spin up nodes with a well-defined label, which might not work for everyone, although we can always fallback to the providerID?

@jackfrancis
Copy link
Contributor

fwiw I kinda think the existing providerID matching is still the best way to create node/infra affinity all the way up and down the k8s stack

In any event, I've been convinced that simply bypassing our current CAPI-specific ProviderID foo is the cleaner approach to moving this effort forward see #8577

@sbueringer
Copy link
Member

This would be a new requirement on the bootstrap provider to spin up nodes with a well-defined label, which might not work for everyone, although we can always fallback to the providerID?

Yup. Problem is that e.g. kubelet can only set labels in specific domains / prefixes.

Was just thinking about how much time we already spent on this topic and even with the latest change we have problems like that CAPI doesn't work if the CCM has a totally unrelated problem.

But let's keep this issue scoped to the current problem ;)

@shyamradhakrishnan
Copy link
Author

Thanks to all those provided inputs on the ticket and special thanks to @jackfrancis for getting this in so quickly in CAPI. We tested the branch and works amazing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
8 participants