Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: defork cloud-provider-azure #6947

Conversation

comtalyst
Copy link
Contributor

@comtalyst comtalyst commented Jun 20, 2024

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

In the past, Azure provider has borrowed some code from cloud-provider-azure, although some did not even exist in cloud-provider-azure at that time.
This results in several pieces of code being duplicated or branched away. This PR is delegating those functionalities back to cloud-provider-azure as much as possible in attempt to eliminate the duplicated logic.

Examples for instances where we miss cloud-provider-azure updates for the duplicated code:

  • The authentication logic using AAD certificate, and not to mention security concerns implied from outdated code in this category.
  • The config file format (i.e., azure.json) for autoscaler branches out from the format expected by other components using cloud-provider-azure. For example, the field VmssCacheTTL in autoscaler's duplicated code is named VmssCacheTTLInSeconds in cloud-provider-azure. This causes confusion and kills the purpose of reusable Azure configuration file that cloud-provider-azure is trying to achieve.

On the other way around:

Furthermore, we see no need to have a customized version of cloud-provider-azure library for Cluster Autoscaler. Thus, delegating it back is justified.

This attempt, however, will not be perfect, given we have yet to finish the deprecation for existing behavior differences (e.g., config file format, CLI creds support).

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

azure_config.go and azure_client.go are the core changes of this PR. Changes in other files are more of consequences.

Does this PR introduce a user-facing change?

- Azure: from now on, users should refer to https://cloud-provider-azure.sigs.k8s.io/install/configs/ for configuration interface
- Azure: fixed an issue where environment variables were not being passed in when config file exists
- Azure: fixed an issue where some cloud provider configurations were not being validated when UseManagedIdentityExtension is set to true
- Azure: renamed several fields from config file, with old names are still acceptable and taking precedence: `useWorkloadIdentityExtension` to `useFederatedWorkloadIdentityExtension`, `vmssCacheTTL` to `vmssCacheTTLInSeconds`, `vmssVmsCacheTTL` to `vmssVirtualMachinesCacheTTLInSeconds`, `enableVmssFlex` to `enableVmssFlexNodes`
- Azure: renamed several environment variables, with old names are still acceptable and taking precedence: `ARM_USE_MANAGED_IDENTITY_EXTENSION` to `ARM_USE_FEDERATED_WORKLOAD_IDENTITY_EXTENSION`, `AZURE_VMSS_CACHE_TTL` to `AZURE_VMSS_CACHE_TTL_IN_SECONDS`, `AZURE_VMSS_VMS_CACHE_TTL` to `AZURE_VMSS_VMS_CACHE_TTL_IN_SECONDS`, `AZURE_ENABLE_VMSS_FLEX` to `AZURE_ENABLE_VMSS_FLEX_NODES`

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 20, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @comtalyst!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 20, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @comtalyst. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 20, 2024
@k8s-ci-robot k8s-ci-robot requested a review from nilo19 June 20, 2024 00:46
@k8s-ci-robot k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Jun 20, 2024
)

// DeploymentsClient defines needed functions for azure network.DeploymentsClient.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delegated it back to the one provided by cloud-provider-azure. Don't see any breaking things so far, but haven't test it yet.

interfacesClient interfaceclient.Interface
disksClient diskclient.Interface
storageAccountsClient storageaccountclient.Interface
skuClient compute.ResourceSkusClient
}

// newServicePrincipalTokenFromCredentials creates a new ServicePrincipalToken using values of the
// passed credentials map.
func newServicePrincipalTokenFromCredentials(config *Config, env *azure.Environment) (*adal.ServicePrincipalToken, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delegated back to cloud-provider-azure.


return nil, fmt.Errorf("no credentials provided for AAD application %s", config.AADClientID)
}

func newAuthorizer(config *Config, env *azure.Environment) (autorest.Authorizer, error) {
switch config.AuthMethod {
case authMethodCLI:
return auth.NewAuthorizerFromCLI()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CLI authentication is the only real blocker for the near-perfect de-forking of cloud-provider-azure uses here. Its deprecation (or alternatively, supporting it in cloud-provider-azure) could be something up for discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's propose supporting it in cloud-provider-azure?

"sigs.k8s.io/cloud-provider-azure/pkg/retry"
)

const (
vmTypeVMSS = "vmss"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delegated back to same constants provided by cloud-provider-azure.

SubscriptionID string `json:"subscriptionId" yaml:"subscriptionId"`
ResourceGroup string `json:"resourceGroup" yaml:"resourceGroup"`
VMType string `json:"vmType" yaml:"vmType"`
providerazure.Config `json:",inline" yaml:",inline"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delegated most of the config back to cloud-provider-azure.

However, this, along with other user-facing interfaces that got delegated back to cloud-provider-azure will be more hideous from the user. For example, if VM type string "vmss" changes to "vmscalesets" on the newer version of cloud-provider-azure, it won't be visible to the user that might not be aware of the usage of cloud-provider-azure.

Well-covered unit testing would definitely help still. But it might even be difficult for the maintainers to keep track of the true size of the exposed interface by the library (e.g., through this Config and cloud-config-file which will be unmarshalled into this struct).

Or, maybe it is by design that the user is supposed to know the existence of this module and can configure them? If so, is it intended for the maintainer to not having to worry about those interfaces, and treat them as black boxes?

Anyway, in this case (and likely in most cases--in practice), although not ideal, I think it is worth it to delegate it back given that changes are really rare, the size of the interface is not too large to have a proper coverage/know what to care about, and there should be a proper deprecation procedures if something changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"more hideous" -> "more hidden / less visible / less discoverable"? (And I don't think this is a major issue; can be addressed by documentation.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may also be worth capturing in some way whatever guarantees are intended regarding compatibility of this config with cloud-provider-azure config serialization. Though answer is probably just that this one is a superset?

Copy link
Contributor Author

@comtalyst comtalyst Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"more hideous" -> "more hidden / less visible / less discoverable"?

Yes. And also less aware of changes.

can be addressed by documentation

How would you imagine the documentation to be like?
Something like A) "we are using cloud-provider-azure, go see their repo for how to use it" or B) "here are the configs" then 'fork' the docs from cloud-provider-azure?
A) would be easier to be maintained, but additional burden to the user. B) would be like a fork to us.

If you ask me I would prefer A). In practice it wouldn't hurt the experience that much given that it seems to always be one-layer.

(also discussed with cloud-provider-azure maintainers--we are allowed to import the config struct)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A) makes sense to me, since cloud-provider-azure should be keeping their docs up to date (doesn't make sense for us to duplicate the work here).

@@ -42,27 +43,9 @@ const (

imdsServerURL = "http://169.254.169.254"

// backoff
backoffRetriesDefault = 6
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delegated back to cloud-provider-azure

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we confirmed that the cloud-provider-azure defaults are the same as those listed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I did.
Also considering about having unit tests to guarantee them in a long run, but may not be necessary if we rely on user to check cloud-provider-azure directly.

if _, err = assignFromEnvIfExists(&cfg.UserAssignedIdentityID, "ARM_USER_ASSIGNED_IDENTITY_ID"); err != nil {
return nil, err
}
if _, err = assignIntFromEnvIfExists(&cfg.VmssCacheTTLInSeconds, "AZURE_VMSS_CACHE_TTL"); err != nil {
Copy link
Contributor Author

@comtalyst comtalyst Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the fields that have a difference in naming (cloud-provider-azure version has "InSeconds"). Although we keep the environment variables the same to not introduce breaking changes. Not that environment variable name needs to be changed either, given the minimal change. Changed my mind--I introduced both the new one along with it.

case authMethodCLI:
// Nothing to check at the moment.
default:
return fmt.Errorf("unsupported authorization method: %s", cfg.AuthMethod)
}

if cfg.CloudProviderBackoff && cfg.CloudProviderBackoffRetries == 0 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be unintentionally(?) neglected when UseManagedIdentityExtension is true, which should be completely unrelated. Current changes should fix this bug.

@comtalyst comtalyst force-pushed the comtalyst/cas-azure-defork-cloud-provider-azure branch from d07dbf2 to 24d4667 Compare June 20, 2024 20:26
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 20, 2024
Copy link
Contributor

@tallaxes tallaxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like where this is going, don't see any blockers (though would need another round of close review). Would be good to find a way to prove there are no regressions - to what extent do you think the tests cover this?

@comtalyst
Copy link
Contributor Author

I really like where this is going, don't see any blockers (though would need another round of close review). Would be good to find a way to prove there are no regressions - to what extent do you think the tests cover this?

For config changes, the existing unit tests could be a prove that the delegations are not breaking interface-wise. More unit tests capturing more corner cases will be added as well. I have confidence in those so far.

For service principal token, I think the E2Es are currently the best bet. Some manual long-running tests might help as well.
Looking at their code, I don't see if they have real differences from ours at the moment. But not that my eyes should be trusted that much.

@comtalyst comtalyst force-pushed the comtalyst/cas-azure-defork-cloud-provider-azure branch from db0fb68 to 1b54748 Compare July 17, 2024 21:53
@comtalyst comtalyst force-pushed the comtalyst/cas-azure-defork-cloud-provider-azure branch from 1b54748 to ecd0123 Compare August 3, 2024 00:17
@comtalyst comtalyst force-pushed the comtalyst/cas-azure-defork-cloud-provider-azure branch from ecd0123 to 2b40ce5 Compare August 10, 2024 00:55
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 10, 2024
@comtalyst comtalyst force-pushed the comtalyst/cas-azure-defork-cloud-provider-azure branch from 2b40ce5 to 7e9172d Compare August 10, 2024 00:57
@comtalyst
Copy link
Contributor Author

/test pull-cluster-autoscaler-e2e-azure

@k8s-ci-robot
Copy link
Contributor

@comtalyst: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test pull-cluster-autoscaler-e2e-azure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@comtalyst comtalyst force-pushed the comtalyst/cas-azure-defork-cloud-provider-azure branch from 7e9172d to 4830b17 Compare August 16, 2024 17:53
@comtalyst comtalyst marked this pull request as ready for review August 16, 2024 17:57
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 16, 2024
@comtalyst
Copy link
Contributor Author

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 16, 2024
@k8s-ci-robot k8s-ci-robot requested a review from x13n August 16, 2024 17:58
@comtalyst comtalyst force-pushed the comtalyst/cas-azure-defork-cloud-provider-azure branch from 4830b17 to 811f93c Compare August 16, 2024 18:13
Copy link
Contributor

@tallaxes tallaxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, mostly minor quips/questions.

May also need to check if updates/renames to config/environment variables warrant any updates to readme and/or charts (could be a follow-up).

cluster-autoscaler/cloudprovider/azure/azure_agent_pool.go Outdated Show resolved Hide resolved
Comment on lines +386 to 388
if cfg.VMType != providerazureconsts.VMTypeStandard && cfg.VMType != providerazureconsts.VMTypeVMSS {
return fmt.Errorf("unsupported VM type: %s", cfg.VMType)
}
Copy link
Contributor

@tallaxes tallaxes Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on VMType still says vmssflex value is supported. Also need to think through migration, if that's indeed how it was intended to be set, and is changing now (if anybody is using it to begin with)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func (m *AzureManager) buildNodeGroupFromSpec(spec string) (cloudprovider.NodeGroup, error) {
	scaleToZeroSupported := scaleToZeroSupportedStandard
	if strings.EqualFold(m.config.VMType, providerazureconsts.VMTypeVMSS) {
		scaleToZeroSupported = scaleToZeroSupportedVMSS
	}
	s, err := dynamic.SpecFromString(spec, scaleToZeroSupported)
	if err != nil {
		return nil, fmt.Errorf("failed to parse node group spec: %v", err)
	}
	vmsPoolSet := m.azureCache.getVMsPoolSet()
	if _, ok := vmsPoolSet[s.Name]; ok {
		return NewVMsPool(s, m), nil
	}

	switch m.config.VMType {
	case providerazureconsts.VMTypeStandard:
		return NewAgentPool(s, m)
	case providerazureconsts.VMTypeVMSS:
		return NewScaleSet(s, m, -1, false)
	default:
		return nil, fmt.Errorf("vmtype %s not supported", m.config.VMType)
	}
}

From this (that has always been here), VM types other than these two will effectively disable CAS already, given this method is critical. I don't think anyone can use that right now.
I think it is fair to "notify users" through fail initialization like this, given the config is cluster-level.

For our support of flex, I think we do have it (that EnableVMSSFlexNodes?), but not through this. This sounds like something to revisit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any more tests we could/should add here (e.g. to test the changes in the overrid logic, such as legacy overrides?) Or are these covered in the azure_manager_tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are already covered in azure_manager_test.go, I think the coverage is quite decent now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no suitable mock in the cloud provider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't look at that yet. But, given low traffic for this area (deprecated VMAS type pool), I don't think we should put effort on it now.

cluster-autoscaler/cloudprovider/azure/azure_util.go Outdated Show resolved Hide resolved
@comtalyst
Copy link
Contributor Author

comtalyst commented Aug 16, 2024

May also need to check if updates/renames to config/environment variables warrant any updates to readme and/or charts (could be a follow-up).

As discussed, I believe we should delegate all those to cloud-provider-azure docs. We could sweep through the readmes if there is any left to begin with.
For now, I left a note in the release notes (in PR description).

Also, I give it backward compatibility for those who wants to use old names for the interface config fields/envs. Included in the release note as well.

@comtalyst comtalyst force-pushed the comtalyst/cas-azure-defork-cloud-provider-azure branch from 811f93c to 2d55b65 Compare August 17, 2024 01:43
@comtalyst comtalyst force-pushed the comtalyst/cas-azure-defork-cloud-provider-azure branch 2 times, most recently from 2849664 to dfac3bb Compare August 17, 2024 02:43
Copy link
Contributor

@tallaxes tallaxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but for error checking bug

@@ -258,9 +267,9 @@ func (as *AgentPool) deleteOutdatedDeployments() (err error) {
errList := make([]error, 0)
for _, deployment := range toBeDeleted {
klog.V(4).Infof("deleteOutdatedDeployments: starts deleting outdated deployment (%s)", *deployment.Name)
_, err := as.manager.azClient.deploymentsClient.Delete(ctx, as.manager.config.ResourceGroup, *deployment.Name)
rerr := as.manager.azClient.deploymentClient.Delete(ctx, as.manager.config.ResourceGroup, *deployment.Name)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be checking rerr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and looks like this one of the last of this kind of issue

@comtalyst comtalyst force-pushed the comtalyst/cas-azure-defork-cloud-provider-azure branch from 52fe2da to 4e12429 Compare August 19, 2024 17:31
Copy link
Contributor

@tallaxes tallaxes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 20, 2024
@tallaxes
Copy link
Contributor

/approve

@comtalyst
Copy link
Contributor Author

comtalyst commented Aug 23, 2024

/assign @towca
There are some module bumps in go.mod from this, thanks.

@towca
Copy link
Collaborator

towca commented Aug 26, 2024

CA go.mod changes LGTM

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: comtalyst, tallaxes, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 26, 2024
@k8s-ci-robot k8s-ci-robot merged commit 527de12 into kubernetes:master Aug 26, 2024
7 checks passed
@comtalyst
Copy link
Contributor Author

/cherry-pick cluster-autoscaler-release-1.30

@k8s-infra-cherrypick-robot

@comtalyst: #6947 failed to apply on top of branch "cluster-autoscaler-release-1.30":

Applying: chore: defork cloud-provider-azure
Using index info to reconstruct a base tree...
M	cluster-autoscaler/cloudprovider/azure/azure_manager.go
M	cluster-autoscaler/cloudprovider/azure/azure_manager_test.go
M	cluster-autoscaler/go.mod
M	cluster-autoscaler/go.sum
Falling back to patching base and 3-way merge...
Auto-merging cluster-autoscaler/go.sum
CONFLICT (content): Merge conflict in cluster-autoscaler/go.sum
Auto-merging cluster-autoscaler/go.mod
CONFLICT (content): Merge conflict in cluster-autoscaler/go.mod
Auto-merging cluster-autoscaler/cloudprovider/azure/azure_manager_test.go
Auto-merging cluster-autoscaler/cloudprovider/azure/azure_manager.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 chore: defork cloud-provider-azure
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick cluster-autoscaler-release-1.30

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants