Improve vGPU allocation #11399

torchiaf · 2024-07-09T15:30:57Z

Summary

Related issue harvester/harvester#5774
Blocked by harvester#1154

Occurred changes and/or fixed issues

Technical notes summary

Improve tracking of device allocation harvester/harvester#6096 introduced the new /v1/harvester/cluster endpoint to provide available Harvester resources; we are now using those information to get the allocatable number for each vGpu device.
Fixing the vGpu validation; We are now counting the nodePools x machineCount to check if there are enough vGpu devices to be assigned to each machine.
In legacy Harvester versions - /v1/harvester/cluster endpoint is not available and the UI will not be able to calculate the allocatable label. We will leave it empty - see screenshots- and add some notes in the documentations to inform the users to upgrade their Harvester 1.3.x to 1.3.2.
965287d - Based on offline discussion with @a110605 and @ibrokethecloud we are removing the vGpu profile id from option labels, since it is meaningless from users perspective.
- vGpu options key in the 'vGPUs' dropdown element is now the vGpu type.
- The request payload still requires the vGpu profile id (Name) along the vGpu type (deviceName); The first profile Id for each vGpu type will be used when assigning vGpus in the request.
- This means Harvester clusters will use always the same vGpu profile when assigning new vGpus.

How to tests

Create a new Harvester cluster

1 Go to Cluster Management
2 Create a new Harvester cluster
3 Click on 'Advanced'
4 Add 1 or more vGPus

Create another Harvester Cluster

Go at step 4
Check if available vGPUs are changed. The UI should remove vGPUs allocated in first cluster, based on allocatable number - the backend should update this number after cluster creation.

Edit a Harvester Cluster

Click on Edit Config to edit an existent Harvester Cluster
Click on vGPU dropdown. The UI should show only available vGPU + the one already assigned to the cluster, based on allocatable number - the backend should update this number after saving.

Create multiple nodes Clusters

Repeat above steps, assigning more than 1 node pools to the new cluster.
Repeat above steps, assigning more than 1 Machines to each node pool.
- For each vGpu , should get validation error if there are not enough devices for node pools x machine counts.

Areas or cases that should be tested

Harvester clusters, create/edit page.

Areas which could experience regressions

Screenshot/Video

Harvester 1.3.2

Harvester legacy versions

Validation

Checklist

The PR is linked to an issue and the linked issue has a Milestone, or no issue is needed
The PR has a Milestone
The PR template has been filled out
The PR has been self reviewed
The PR has a reviewer assigned
The PR has automated tests or clear instructions for manual tests and the linked issue has appropriate QA labels, or tests are not needed
The PR has reviewed with UX and tested in light and dark mode, or there are no UX changes

pkg/harvester-manager/machine-config/harvester.vue

a110605 · 2024-08-01T03:39:30Z

We need a deployment with /v1/harvester/cluster endpoint to test this PR change.

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

pkg/harvester-manager/machine-config/harvester.vue

a110605 · 2024-08-06T14:40:34Z

After created a 3 nodes harvester cluster,

create new cluster, /v1/harvester/clusters/local?link=deviceCapacity return "-1" causes I can't select any available vGPU.

click edit config, I can't choose other vGPU.

Are above result expected?

ibrokethecloud · 2024-08-07T05:57:34Z

the vGPU dropdown should only show name of vGPU Type, and not individual vGPU name as well.

for example, the cluster in question has the following vgpu devices configured

NAME                    ADDRESS        NODE NAME     ENABLED   UUID                                   VGPUTYPE       PARENTGPUDEVICE
vgpu-test-1-000008004   0000:08:00.4   vgpu-test-1   true      6465b93d-1256-4a3f-be8b-7b24121f1fc6   NVIDIA A2-2Q   0000:08:00.0
vgpu-test-1-000008005   0000:08:00.5   vgpu-test-1   true      00832125-d938-45cf-93a7-3f25754b8f01   NVIDIA A2-2Q   0000:08:00.0
vgpu-test-1-000008006   0000:08:00.6   vgpu-test-1   true      bcf86801-00db-4345-9f95-d9a7103c767c   NVIDIA A2-2Q   0000:08:00.0
vgpu-test-1-000008007   0000:08:00.7   vgpu-test-1   true      a29cb262-5fae-4511-a39a-b2be01a92d01   NVIDIA A2-2Q   0000:08:00.0

the UI screenshot seems to be including vGPU Type + Name in the drop down and using the count for all vGPU Types of a particular type, which is 4. This is an incorrect representation.

Ideally we should only use the names reported from the new api. In this case the drop down should show

{"cpu":"24","devices.kubevirt.io/kvm":"1k","devices.kubevirt.io/tun":"1k","devices.kubevirt.io/vhost-net":"1k","ephemeral-storage":"250935833813","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"131858080Ki","nvidia.com/NVIDIA_A2-2Q":"-1","pods":"200"}

nvidia.com/NVIDIA_A2-2Q and the count available for the same.

a110605 · 2024-08-07T06:12:46Z

the vGPU dropdown should only show name of vGPU Type, and not individual vGPU name as well.

for example, the cluster in question has the following vgpu devices configured
NAME                    ADDRESS        NODE NAME     ENABLED   UUID                                   VGPUTYPE       PARENTGPUDEVICE
vgpu-test-1-000008004   0000:08:00.4   vgpu-test-1   true      6465b93d-1256-4a3f-be8b-7b24121f1fc6   NVIDIA A2-2Q   0000:08:00.0
vgpu-test-1-000008005   0000:08:00.5   vgpu-test-1   true      00832125-d938-45cf-93a7-3f25754b8f01   NVIDIA A2-2Q   0000:08:00.0
vgpu-test-1-000008006   0000:08:00.6   vgpu-test-1   true      bcf86801-00db-4345-9f95-d9a7103c767c   NVIDIA A2-2Q   0000:08:00.0
vgpu-test-1-000008007   0000:08:00.7   vgpu-test-1   true      a29cb262-5fae-4511-a39a-b2be01a92d01   NVIDIA A2-2Q   0000:08:00.0
the UI screenshot seems to be including vGPU Type + Name in the drop down and using the count for all vGPU Types of a particular type, which is 4. This is an incorrect representation.

Ideally we should only use the names reported from the new api. In this case the drop down should show
{"cpu":"24","devices.kubevirt.io/kvm":"1k","devices.kubevirt.io/tun":"1k","devices.kubevirt.io/vhost-net":"1k","ephemeral-storage":"250935833813","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"131858080Ki","nvidia.com/NVIDIA_A2-2Q":"-1","pods":"200"}
nvidia.com/NVIDIA_A2-2Q and the count available for the same.

Thanks for helping us clarify the test result.

So if each vGPU type allocatable number > 0 , we would show below as option name

nvidia.com/NVIDIA_A2-2Q (allocatable : {vGpu.allocatable})`

torchiaf · 2024-08-07T10:36:15Z

Ideally we should only use the names reported from the new api. In this case the drop down should show

Is it in conflict with vGPU modeling we have on Harvester -> vGPU devices table ?

The key for vGPU elements in that table is Name, for instance vgpu-test-1-000008005, so I would expected to select the vGPU name when assigning gpus to clusters on Rancher UI.

In fact, the vGPU Name is needed in the request payload on Rancher:

    updateVGpu() {
      const vGPURequests = this.vGpus?.filter((name) => name).reduce((acc, name) => ([
        ...acc,
        {
          name,
          deviceName: this.vGpuDevices[name]?.type,
        }
      ]), []);

      this.value.vgpuInfo = vGPURequests.length > 0 ? JSON.stringify({ vGPURequests }) : '';
    },

Is there something I'm missing ?

pkg/harvester-manager/machine-config/harvester.vue

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

torchiaf · 2024-08-08T17:31:54Z

I've updated the PR description with the some notes about removing the profile from the label. Please take a look.

a110605

Looks good in view mode and edit / create mode.

Need change to handle legacy harvester scenario

torchiaf · 2024-08-09T13:38:07Z

Just pushed the changes to support legacy Harvester versions and re-enable validations steps.

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

gaktive · 2024-08-14T00:18:48Z

As I understand through comments including @nwmac once we finalize this, we should backport this to 2.9.next2 (as of writing, which will go up to next1 once we rename versions). Harvester will also do a release note.

gaktive · 2024-08-16T19:00:22Z

Along with 2.9.x, we should backport to 2.8.x.

a110605 · 2024-09-16T02:41:51Z

pkg/harvester-manager/machine-config/harvester.vue

+       * This will not work if we will remove the limit of only one vGpu assignable to each cluster.
+       */
+      const vGPURequests = this.vGpus?.filter((f) => f).map((deviceName) => ({
+        name: Object.values(this.vGpuDevices).filter((f) => f.type === deviceName)?.[0]?.id || '',


Hi @torchiaf, are we keep assigning first vGPU profile id as payload name here ?

I know we had a concern about it before.

Since harvester/pcidevices#91 introduces a new annotation harvesterhci.io/deviceAllocationDetails in VM.

We need another PR in harvester dashboard to leverage that annotation.

Correct, in Rancher we can pass a default name (better if it's something explaining the root device); on Harvester we will need to get the device id from the label rather than the vm's spec.

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

torchiaf · 2024-09-17T22:16:28Z

@a110605 this is now updated to support harvester#1154

torchiaf added this to the v2.10.0 milestone Jul 9, 2024

torchiaf requested review from a110605 and mantis-toboggan-md July 9, 2024 15:30

torchiaf changed the title ~~Improve vGPU allocatable mechanism checks~~ Improve vGPU allocatable mechanism Jul 9, 2024

torchiaf changed the title ~~Improve vGPU allocatable mechanism~~ Improve vGPU allocation Jul 9, 2024

rancher-ui-project-bot bot assigned torchiaf Jul 9, 2024

torchiaf force-pushed the 5774-vgpu-allocatable-enh branch from 2376d70 to d815091 Compare July 10, 2024 08:09

a110605 reviewed Aug 1, 2024

View reviewed changes

pkg/harvester-manager/machine-config/harvester.vue Outdated Show resolved Hide resolved

Fix vGpu allocatable mechanism; removing validation steps

3839a0b

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

torchiaf force-pushed the 5774-vgpu-allocatable-enh branch from d815091 to c8b07d8 Compare August 6, 2024 08:36

Code clean up

9ea14af

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

torchiaf force-pushed the 5774-vgpu-allocatable-enh branch from c8b07d8 to 9ea14af Compare August 6, 2024 08:37

torchiaf requested a review from a110605 August 6, 2024 11:13

a110605 reviewed Aug 6, 2024

View reviewed changes

pkg/harvester-manager/machine-config/harvester.vue Outdated Show resolved Hide resolved

a110605 mentioned this pull request Aug 8, 2024

Add vGPU allocatable warning banner #11017

Open

7 tasks

a110605 reviewed Aug 8, 2024

View reviewed changes

pkg/harvester-manager/machine-config/harvester.vue Outdated Show resolved Hide resolved

Select vGpus by type; Remove vGpu profile from labels

965287d

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

torchiaf force-pushed the 5774-vgpu-allocatable-enh branch from c5aa986 to 965287d Compare August 8, 2024 17:28

torchiaf requested a review from a110605 August 8, 2024 17:28

a110605 requested a review from ibrokethecloud August 9, 2024 03:56

a110605 previously approved these changes Aug 9, 2024

View reviewed changes

torchiaf requested a review from a110605 August 9, 2024 13:37

Add support for Harvester <1.3.2 versions - allocatable info is empty

578e44d

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

torchiaf force-pushed the 5774-vgpu-allocatable-enh branch from 1328c7b to 578e44d Compare August 9, 2024 14:14

Re-enable machine pools validation

1c93eb7

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

torchiaf force-pushed the 5774-vgpu-allocatable-enh branch from 7d8d4d4 to 1c93eb7 Compare August 9, 2024 15:55

a110605 mentioned this pull request Aug 15, 2024

[backport v2.8.next1] Add vGPU allocatable warning banner #11600

Open

ibrokethecloud mentioned this pull request Sep 13, 2024

initial work to reconcile devices from launcher pods harvester/pcidevices#91

Open

a110605 reviewed Sep 16, 2024

View reviewed changes

Replace vGpu name with a placeholder

86b1377

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>

torchiaf mentioned this pull request Sep 17, 2024

Add provisioned vGpus in VM's devices list. harvester/dashboard#1154

Open

3 tasks

ibrokethecloud mentioned this pull request Sep 23, 2024

pcidevices controller plugin device tracking changes harvester/charts#290

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve vGPU allocation #11399

Improve vGPU allocation #11399

torchiaf commented Jul 9, 2024 •

edited

Loading

a110605 commented Aug 1, 2024

a110605 commented Aug 6, 2024

ibrokethecloud commented Aug 7, 2024

a110605 commented Aug 7, 2024

torchiaf commented Aug 7, 2024 •

edited

Loading

torchiaf commented Aug 8, 2024

a110605 left a comment

torchiaf commented Aug 9, 2024 •

edited

Loading

gaktive commented Aug 14, 2024

gaktive commented Aug 16, 2024

a110605 Sep 16, 2024 •

edited

Loading

torchiaf Sep 16, 2024

torchiaf commented Sep 17, 2024

Improve vGPU allocation #11399

Are you sure you want to change the base?

Improve vGPU allocation #11399

Conversation

torchiaf commented Jul 9, 2024 • edited Loading

Summary

Occurred changes and/or fixed issues

Technical notes summary

How to tests

Areas or cases that should be tested

Areas which could experience regressions

Screenshot/Video

Checklist

a110605 commented Aug 1, 2024

a110605 commented Aug 6, 2024

ibrokethecloud commented Aug 7, 2024

a110605 commented Aug 7, 2024

torchiaf commented Aug 7, 2024 • edited Loading

torchiaf commented Aug 8, 2024

a110605 left a comment

Choose a reason for hiding this comment

torchiaf commented Aug 9, 2024 • edited Loading

gaktive commented Aug 14, 2024

gaktive commented Aug 16, 2024

a110605 Sep 16, 2024 • edited Loading

Choose a reason for hiding this comment

torchiaf Sep 16, 2024

Choose a reason for hiding this comment

torchiaf commented Sep 17, 2024

torchiaf commented Jul 9, 2024 •

edited

Loading

torchiaf commented Aug 7, 2024 •

edited

Loading

torchiaf commented Aug 9, 2024 •

edited

Loading

a110605 Sep 16, 2024 •

edited

Loading