Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve vGPU allocation #11399

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

torchiaf
Copy link
Member

@torchiaf torchiaf commented Jul 9, 2024

Summary

Related issue harvester/harvester#5774
Blocked by harvester#1154

Occurred changes and/or fixed issues

Technical notes summary

  • Improve tracking of device allocation harvester/harvester#6096 introduced the new /v1/harvester/cluster endpoint to provide available Harvester resources; we are now using those information to get the allocatable number for each vGpu device.

  • Fixing the vGpu validation; We are now counting the nodePools x machineCount to check if there are enough vGpu devices to be assigned to each machine.

  • In legacy Harvester versions - /v1/harvester/cluster endpoint is not available and the UI will not be able to calculate the allocatable label. We will leave it empty - see screenshots- and add some notes in the documentations to inform the users to upgrade their Harvester 1.3.x to 1.3.2.

  • 965287d - Based on offline discussion with @a110605 and @ibrokethecloud we are removing the vGpu profile id from option labels, since it is meaningless from users perspective.

    • vGpu options key in the 'vGPUs' dropdown element is now the vGpu type.
    • The request payload still requires the vGpu profile id (Name) along the vGpu type (deviceName); The first profile Id for each vGpu type will be used when assigning vGpus in the request.

    image

    • This means Harvester clusters will use always the same vGpu profile when assigning new vGpus.

How to tests

Create a new Harvester cluster

  • 1 Go to Cluster Management
  • 2 Create a new Harvester cluster
  • 3 Click on 'Advanced'
  • 4 Add 1 or more vGPus

Create another Harvester Cluster

  • Go at step 4
  • Check if available vGPUs are changed. The UI should remove vGPUs allocated in first cluster, based on allocatable number - the backend should update this number after cluster creation.

Edit a Harvester Cluster

  • Click on Edit Config to edit an existent Harvester Cluster
  • Click on vGPU dropdown. The UI should show only available vGPU + the one already assigned to the cluster, based on allocatable number - the backend should update this number after saving.

Create multiple nodes Clusters

  • Repeat above steps, assigning more than 1 node pools to the new cluster.
  • Repeat above steps, assigning more than 1 Machines to each node pool.
    • For each vGpu , should get validation error if there are not enough devices for node pools x machine counts.

Areas or cases that should be tested

Harvester clusters, create/edit page.

Areas which could experience regressions

Screenshot/Video

Harvester 1.3.2

image

Harvester legacy versions

image

Validation

image

Checklist

  • The PR is linked to an issue and the linked issue has a Milestone, or no issue is needed
  • The PR has a Milestone
  • The PR template has been filled out
  • The PR has been self reviewed
  • The PR has a reviewer assigned
  • The PR has automated tests or clear instructions for manual tests and the linked issue has appropriate QA labels, or tests are not needed
  • The PR has reviewed with UX and tested in light and dark mode, or there are no UX changes

@torchiaf torchiaf added this to the v2.10.0 milestone Jul 9, 2024
@torchiaf torchiaf changed the title Improve vGPU allocatable mechanism checks Improve vGPU allocatable mechanism Jul 9, 2024
@torchiaf torchiaf changed the title Improve vGPU allocatable mechanism Improve vGPU allocation Jul 9, 2024
@a110605
Copy link
Contributor

a110605 commented Aug 1, 2024

We need a deployment with /v1/harvester/cluster endpoint to test this PR change.

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
@a110605
Copy link
Contributor

a110605 commented Aug 6, 2024

After created a 3 nodes harvester cluster,

  • create new cluster, /v1/harvester/clusters/local?link=deviceCapacity return "-1" causes I can't select any available vGPU.
Screenshot 2024-08-06 at 10 36 44 PM
  • click edit config, I can't choose other vGPU.
Screenshot 2024-08-06 at 10 35 19 PM

Are above result expected?

@ibrokethecloud
Copy link

the vGPU dropdown should only show name of vGPU Type, and not individual vGPU name as well.

for example, the cluster in question has the following vgpu devices configured

NAME                    ADDRESS        NODE NAME     ENABLED   UUID                                   VGPUTYPE       PARENTGPUDEVICE
vgpu-test-1-000008004   0000:08:00.4   vgpu-test-1   true      6465b93d-1256-4a3f-be8b-7b24121f1fc6   NVIDIA A2-2Q   0000:08:00.0
vgpu-test-1-000008005   0000:08:00.5   vgpu-test-1   true      00832125-d938-45cf-93a7-3f25754b8f01   NVIDIA A2-2Q   0000:08:00.0
vgpu-test-1-000008006   0000:08:00.6   vgpu-test-1   true      bcf86801-00db-4345-9f95-d9a7103c767c   NVIDIA A2-2Q   0000:08:00.0
vgpu-test-1-000008007   0000:08:00.7   vgpu-test-1   true      a29cb262-5fae-4511-a39a-b2be01a92d01   NVIDIA A2-2Q   0000:08:00.0

the UI screenshot seems to be including vGPU Type + Name in the drop down and using the count for all vGPU Types of a particular type, which is 4. This is an incorrect representation.

Ideally we should only use the names reported from the new api. In this case the drop down should show

{"cpu":"24","devices.kubevirt.io/kvm":"1k","devices.kubevirt.io/tun":"1k","devices.kubevirt.io/vhost-net":"1k","ephemeral-storage":"250935833813","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"131858080Ki","nvidia.com/NVIDIA_A2-2Q":"-1","pods":"200"}

nvidia.com/NVIDIA_A2-2Q and the count available for the same.

@a110605
Copy link
Contributor

a110605 commented Aug 7, 2024

the vGPU dropdown should only show name of vGPU Type, and not individual vGPU name as well.

for example, the cluster in question has the following vgpu devices configured

NAME                    ADDRESS        NODE NAME     ENABLED   UUID                                   VGPUTYPE       PARENTGPUDEVICE
vgpu-test-1-000008004   0000:08:00.4   vgpu-test-1   true      6465b93d-1256-4a3f-be8b-7b24121f1fc6   NVIDIA A2-2Q   0000:08:00.0
vgpu-test-1-000008005   0000:08:00.5   vgpu-test-1   true      00832125-d938-45cf-93a7-3f25754b8f01   NVIDIA A2-2Q   0000:08:00.0
vgpu-test-1-000008006   0000:08:00.6   vgpu-test-1   true      bcf86801-00db-4345-9f95-d9a7103c767c   NVIDIA A2-2Q   0000:08:00.0
vgpu-test-1-000008007   0000:08:00.7   vgpu-test-1   true      a29cb262-5fae-4511-a39a-b2be01a92d01   NVIDIA A2-2Q   0000:08:00.0

the UI screenshot seems to be including vGPU Type + Name in the drop down and using the count for all vGPU Types of a particular type, which is 4. This is an incorrect representation.

Ideally we should only use the names reported from the new api. In this case the drop down should show

{"cpu":"24","devices.kubevirt.io/kvm":"1k","devices.kubevirt.io/tun":"1k","devices.kubevirt.io/vhost-net":"1k","ephemeral-storage":"250935833813","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"131858080Ki","nvidia.com/NVIDIA_A2-2Q":"-1","pods":"200"}

nvidia.com/NVIDIA_A2-2Q and the count available for the same.

Thanks for helping us clarify the test result.

So if each vGPU type allocatable number > 0 , we would show below as option name

nvidia.com/NVIDIA_A2-2Q (allocatable : {vGpu.allocatable})` 

@torchiaf
Copy link
Member Author

torchiaf commented Aug 7, 2024

Ideally we should only use the names reported from the new api. In this case the drop down should show

Is it in conflict with vGPU modeling we have on Harvester -> vGPU devices table ?

image

The key for vGPU elements in that table is Name, for instance vgpu-test-1-000008005, so I would expected to select the vGPU name when assigning gpus to clusters on Rancher UI.

In fact, the vGPU Name is needed in the request payload on Rancher:

    updateVGpu() {
      const vGPURequests = this.vGpus?.filter((name) => name).reduce((acc, name) => ([
        ...acc,
        {
          name,
          deviceName: this.vGpuDevices[name]?.type,
        }
      ]), []);

      this.value.vgpuInfo = vGPURequests.length > 0 ? JSON.stringify({ vGPURequests }) : '';
    },

image

Is there something I'm missing ?

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
@torchiaf
Copy link
Member Author

torchiaf commented Aug 8, 2024

I've updated the PR description with the some notes about removing the profile from the label. Please take a look.

a110605
a110605 previously approved these changes Aug 9, 2024
Copy link
Contributor

@a110605 a110605 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in view mode and edit / create mode.

Screenshot 2024-08-09 at 11 59 43 AM Screenshot 2024-08-09 at 12 00 21 PM

@a110605 a110605 dismissed their stale review August 9, 2024 09:39

Need change to handle legacy harvester scenario

@torchiaf torchiaf requested a review from a110605 August 9, 2024 13:37
@torchiaf
Copy link
Member Author

torchiaf commented Aug 9, 2024

Just pushed the changes to support legacy Harvester versions and re-enable validations steps.

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
@gaktive
Copy link
Member

gaktive commented Aug 14, 2024

As I understand through comments including @nwmac once we finalize this, we should backport this to 2.9.next2 (as of writing, which will go up to next1 once we rename versions). Harvester will also do a release note.

@gaktive
Copy link
Member

gaktive commented Aug 16, 2024

Along with 2.9.x, we should backport to 2.8.x.

* This will not work if we will remove the limit of only one vGpu assignable to each cluster.
*/
const vGPURequests = this.vGpus?.filter((f) => f).map((deviceName) => ({
name: Object.values(this.vGpuDevices).filter((f) => f.type === deviceName)?.[0]?.id || '',
Copy link
Contributor

@a110605 a110605 Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @torchiaf, are we keep assigning first vGPU profile id as payload name here ?

I know we had a concern about it before.

Since harvester/pcidevices#91 introduces a new annotation harvesterhci.io/deviceAllocationDetails in VM.

We need another PR in harvester dashboard to leverage that annotation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, in Rancher we can pass a default name (better if it's something explaining the root device); on Harvester we will need to get the device id from the label rather than the vm's spec.

Signed-off-by: Francesco Torchia <francesco.torchia@suse.com>
@torchiaf
Copy link
Member Author

@a110605 this is now updated to support harvester#1154

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants