Fix create nodepool azure command #1118

alvaroaleman · 2022-03-08T17:58:20Z

This was never fully implemented. In order to fix it, the bootImageId
was moved from the nodepool to the cluster, because it is unique per
cluster and never changes. Otherwise if there is no nodepool, we are
unable to find it.

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, use fixes #<issue_number>(, fixes #<issue_number>, ...) format, where issue_number might be a GitHub issue, or a Jira story:
Fixes #

Checklist

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

openshift-ci · 2022-03-08T17:59:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alvaroaleman]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2022-03-08T17:59:43Z

✔️ Deploy Preview for hypershift-docs ready!

🔨 Explore the source changes: 37f6899

🔍 Inspect the deploy log: https://app.netlify.com/sites/hypershift-docs/deploys/6228ea59c1057d0008d0cab3

😎 Browse the preview: https://deploy-preview-1118--hypershift-docs.netlify.app/reference/api

enxebre · 2022-03-08T18:26:50Z

the bootImageId
was moved from the nodepool to the cluster, because it is unique per
cluster and never changes.

The default boot image will vary depending on the OCP release chosen for each NodePool. Otherwise it might even break the OS upgrade compatibility eventually for a new NodePool that starts with a very old boot image.
Also it should be a NodePool property as it's a user choice to override it to other than default.

We should keep it in NodePools and find it by default via release info if not set. Is there anything preventing us from doing the same than AWS as in the links below?

hypershift/hypershift-operator/controllers/nodepool/nodepool_controller.go

Lines 907 to 913 in ca08c87

    
           func getAMI(nodePool *hyperv1.NodePool, region string, releaseImage *releaseinfo.ReleaseImage) (string, error) { 
        
           	if nodePool.Spec.Platform.AWS.AMI != "" { 
        
           		return nodePool.Spec.Platform.AWS.AMI, nil 
        
           	} 
        
           	return defaultNodePoolAMI(region, releaseImage) 
        
           }

https://github.com/openshift/hypershift/blob/main/support/releaseinfo/fixtures/4.10-installer-coreos-bootimages.yaml#L399

alvaroaleman · 2022-03-08T18:29:30Z

The default boot image will vary depending on the OCP release chosen for each NodePool. Otherwise it might even break the OS upgrade compatibility eventually for a new NodePool that starts with a very old boot image.

No it does not. It gets created once and never updated. The rhcos version the node gets then updated to is what changes. This matches what the installer does: https://github.com/openshift/installer/blob/8fca1ade5b096d9b2cd312c4599881d099439288/data/data/azure/vnet/main.tf#L94

We should keep it in NodePools and find it by default via release info if not set. Is there anything preventing us from doing the same than AWS as in the links below?

Yes, that there is no public image. Only a public blob which can be used to back an image (I believe even for that it needs to be copied first, but haven't verified that).

enxebre · 2022-03-08T19:33:45Z

No it does not. It gets created once and never updated. The rhcos version the node gets then updated to is what changes. This matches what the installer does: https://github.com/openshift/installer/blob/8fca1ade5b096d9b2cd312c4599881d099439288/data/data/azure/vnet/main.tf#L94

Ok, so the installer infers the boot image from
https://github.com/openshift/installer/blob/master/data/data/coreos/rhcos.json#L407
https://github.com/openshift/installer/blob/master/data/data/coreos/rhcos.json#L747

It does via https://github.com/openshift/installer/blob/a7c5c03058db0e20c58a6f9627157d2116968290/pkg/asset/rhcos/image.go#L128-L140

Then copies to https://github.com/openshift/installer/blob/master/pkg/asset/machines/azure/machines.go#L118

Yes, that there is no public image. Only a public blob which can be used to back an image (I believe even for that it needs to be copied first, but haven't verified that).

That's the same problem either you expose in NodePool or in HostedCluster (say we copy the image once and we always point to the same source from all NodePools), right?

I still think Ideally a brand new NodePool shouldn't come up with the original HostedCluster boot image but rather discover the default boot image belonging to the .releaseImage of that NodePool, though I agree we wouldn't want to manage copying each image.
Even ignoring the above the boot images might be arch specific for each NodePool so the boot Image would need to be ultimately dictated by NodePool, I think HostedCluster as a single source of truth would not be enough from an API pov. Thoughts?

@patrickdillon @cgwalters Can you refresh my mind Is there still any plan on installer/mco to manage lifecycle of boot images somehow openshift/enhancements#201?

alvaroaleman · 2022-03-08T19:49:57Z

Even ignoring the above the boot images might be arch specific for each NodePool so the boot Image would need to be ultimately dictated by NodePool, I think HostedCluster as a single source of truth would not be enough from an API pov. Thoughts?

A cluster might have zero pools at which point we are stuck, because we can not create a new nodepool without an existing nodepool, this is why I moved it to the HostedCluster.

If we end up needing arch-specific images, we will need to have the create infra command create all of them and extend the api to have an image per arg, rather than one image for everyone. But that is IMHO a problem we should be solving once we actually add multi-arch support.

cgwalters · 2022-03-08T20:06:44Z

@patrickdillon @cgwalters Can you refresh my mind Is there still any plan on installer/mco to manage lifecycle of boot images somehow openshift/enhancements#201?

I think we still want to execute on that, and it may actually turn out to be a near-hard dependency of openshift/enhancements#1032

But we landed openshift/installer#4760 which was definitely intended to be used by hypershift - is that not sufficient?

enxebre · 2022-03-09T06:29:55Z

A cluster might have zero pools at which point we are stuck, because we can not create a new nodepool without an existing nodepool, this is why I moved it to the HostedCluster.

I'm not sure how you mean "we are stuck". The NodePool API enables you to create a NodePool by specifying its boot image as input anytime. I think we should preserve the ability for consumers to specify a different image per NodePool and keep the decoupling (Multi-arch, fresh boot image, testing alpha images, consumer custom images...). NodePools are by definition heterogenous to each other.

Now, how to automate the process to discover and copy the image to a known location to be consumed as API input and how we provide the best UX on the CLI is a separate discussion which should not impact nor drive our API design:
For creating a NodePool through the CLI after the initial creation command, since we know the rg and clusterID from the HostedCluster, can't we assume and feed the API with a well known id as in https://github.com/openshift/installer/blob/8fca1ade5b096d9b2cd312c4599881d099439288/pkg/asset/machines/azure/machines.go#L118?

We can also keep it pretty much as it is and let HostedCluster be the fallback if bootimage is not specified in NodePool (now or add this in future) which make sense from an API pov and incidentally solves the cli automation problem.

Alternatively I'm curious how expensive would be for the NodePool to own the boot image discovery from the payload and copying it?

This was never fully implemented. In order to fix it, the bootImageId was moved from the nodepool to the cluster, because it is unique per cluster and never changes. Otherwise if there is no nodepool, we are unable to find it.

alvaroaleman · 2022-03-09T17:58:28Z

@enxebre updated to calculate the bootImage if unset, that is a good idea, hadn't thought about that.

cgwalters · 2022-03-09T18:33:36Z

cmd/install/assets/hypershift-operator/hypershift.openshift.io_nodepools.yaml

@@ -423,11 +423,13 @@ spec:
                        minimum: 16
                        type: integer
                      imageID:
+                        description: 'ImageID is the id of the image to boot from.
+                          If unset, the default image at the location below will be


Having a hardcoded default seems like a long term bad idea. I think we should error out if we can't find the right image instead.

There are no public azure images, only public azure image blobs, which need to be copied and referenced from an image in order to be usable. What we construct here is that reference. If we wouldn't do this, we have to store the location somewhere even if we have no nodepools, which would only leave the cluster and that doesn't really seem appropriate and might cause issues if we need more images in the future, for example because of a different arch or because we have a windows node.

Having a hardcoded default seems like a long term bad idea. I think we should error out if we can't find the right image instead.

We default the backend for a zero friction happy path while still enabling other required scenarios by exposing input at the API.
In any case if the targeted image can't be found the capi Machine will fail and the error will be bubbled up as a NodePool.status.condition (eventually, PR in flight for capi upstream for bubbling up individual Machine errors to MahcineDeployments). Alternatively we could explicitly check the image existence from our controller and fail early but keeping our controller as slim and declarative as possible while delegating logic/implementation into capi is intended so letting the Machine fail and signalling the error back seems reasonable to me.

enxebre · 2022-03-09T19:04:58Z

Thanks!
/lgtm

let's get it to the merge pool, we can always follow up if Colin or anyone else have more feedback

openshift-bot · 2022-03-09T19:07:26Z

/retest-required

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2022-03-09T20:56:45Z

@alvaroaleman: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci bot requested review from enxebre and ironcladlou March 8, 2022 17:59

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 8, 2022

alvaroaleman force-pushed the fix-nodepool-create branch 2 times, most recently from 5224235 to 8c6d4f6 Compare March 8, 2022 18:26

Fix create nodepool azure command

37f6899

This was never fully implemented. In order to fix it, the bootImageId was moved from the nodepool to the cluster, because it is unique per cluster and never changes. Otherwise if there is no nodepool, we are unable to find it.

alvaroaleman force-pushed the fix-nodepool-create branch from 8c6d4f6 to 37f6899 Compare March 9, 2022 17:56

cgwalters reviewed Mar 9, 2022

View reviewed changes

openshift-ci bot assigned enxebre Mar 9, 2022

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 9, 2022

openshift-merge-robot merged commit cad2fc1 into openshift:main Mar 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix create nodepool azure command #1118

Fix create nodepool azure command #1118

alvaroaleman commented Mar 8, 2022

openshift-ci bot commented Mar 8, 2022

netlify bot commented Mar 8, 2022 •

edited

Loading

enxebre commented Mar 8, 2022

alvaroaleman commented Mar 8, 2022 •

edited

Loading

enxebre commented Mar 8, 2022 •

edited

Loading

alvaroaleman commented Mar 8, 2022

cgwalters commented Mar 8, 2022

enxebre commented Mar 9, 2022 •

edited

Loading

alvaroaleman commented Mar 9, 2022

cgwalters Mar 9, 2022

alvaroaleman Mar 9, 2022

enxebre Mar 9, 2022

enxebre commented Mar 9, 2022

openshift-bot commented Mar 9, 2022

openshift-ci bot commented Mar 9, 2022

Fix create nodepool azure command #1118

Fix create nodepool azure command #1118

Conversation

alvaroaleman commented Mar 8, 2022

openshift-ci bot commented Mar 8, 2022

netlify bot commented Mar 8, 2022 • edited Loading

enxebre commented Mar 8, 2022

alvaroaleman commented Mar 8, 2022 • edited Loading

enxebre commented Mar 8, 2022 • edited Loading

alvaroaleman commented Mar 8, 2022

cgwalters commented Mar 8, 2022

enxebre commented Mar 9, 2022 • edited Loading

alvaroaleman commented Mar 9, 2022

cgwalters Mar 9, 2022

Choose a reason for hiding this comment

alvaroaleman Mar 9, 2022

Choose a reason for hiding this comment

enxebre Mar 9, 2022

Choose a reason for hiding this comment

enxebre commented Mar 9, 2022

openshift-bot commented Mar 9, 2022

openshift-ci bot commented Mar 9, 2022

netlify bot commented Mar 8, 2022 •

edited

Loading

alvaroaleman commented Mar 8, 2022 •

edited

Loading

enxebre commented Mar 8, 2022 •

edited

Loading

enxebre commented Mar 9, 2022 •

edited

Loading