Set affinity by device UUID. #5566

mzient · 2024-07-16T19:46:43Z

Category:

Bug fix (non-breaking change which fixes an issue)

Description:

NVML and CUDA runtime use different device indices. Device UUID is a reliable way of establishing device identity.

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

Implements new requirements
Affects existing requirements
N/A

REQ IDs: N/A

JIRA TASK: N/A

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

dali-automaton · 2024-07-16T19:50:41Z

CI MESSAGE: [16649134]: BUILD STARTED

dali-automaton · 2024-07-16T22:58:58Z

CI MESSAGE: [16649134]: BUILD FAILED

dali-automaton · 2024-07-17T10:47:58Z

CI MESSAGE: [16669228]: BUILD STARTED

mzient · 2024-07-17T10:49:12Z

dali/util/nvml.cc

+  return dev;
+}
+
+void GetNVMLAffinityMask(cpu_set_t *mask, size_t num_cpus) {


This function is moved here from the header.

mzient · 2024-07-17T10:49:34Z

dali/util/nvml.cc

+  size_t cpu_set_size = (num_cpus + 63) / 64;
+  std::vector<unsigned long> nvml_mask_container(cpu_set_size);  // NOLINT(runtime/int)
+  auto * nvml_mask = nvml_mask_container.data();
+  nvmlDevice_t device = nvmlGetDeviceHandleForCUDA(device_idx);


This is the important line.

mzient · 2024-07-17T10:49:50Z

dali/util/nvml.cc

+  CPU_AND(mask, &nvml_set, &current_set);
+}
+
+void SetCPUAffinity(int core) {


Moved here from the header without any changes.

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

dali-automaton · 2024-07-17T10:59:51Z

CI MESSAGE: [16669418]: BUILD STARTED

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

dali-automaton · 2024-07-17T12:23:51Z

CI MESSAGE: [16671324]: BUILD STARTED

dali-automaton · 2024-07-17T19:27:47Z

CI MESSAGE: [16671324]: BUILD PASSED

mzient closed this Jul 16, 2024

mzient force-pushed the FixNVMLDeviceQueries branch from 47e1985 to 68417ad Compare July 16, 2024 19:47

mzient reopened this Jul 16, 2024

Set affinity by device UUID.

eb50e8d

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>

mzient force-pushed the FixNVMLDeviceQueries branch from 47e1985 to eb50e8d Compare July 16, 2024 19:49

dali-automaton assigned banasraf and klecki Jul 17, 2024

mzient marked this pull request as draft July 17, 2024 10:34

mzient marked this pull request as ready for review July 17, 2024 10:47

mzient commented Jul 17, 2024

View reviewed changes

Take 2.

eb1e0f7

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

mzient force-pushed the FixNVMLDeviceQueries branch from fa6dd16 to eb1e0f7 Compare July 17, 2024 10:59

Add MIG support.

6067f5b

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>

klecki approved these changes Jul 17, 2024

View reviewed changes

banasraf approved these changes Jul 18, 2024

View reviewed changes

mzient merged commit 127015f into NVIDIA:main Jul 18, 2024
6 checks passed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set affinity by device UUID. #5566

Set affinity by device UUID. #5566

mzient commented Jul 16, 2024

dali-automaton commented Jul 16, 2024

dali-automaton commented Jul 16, 2024

dali-automaton commented Jul 17, 2024

mzient Jul 17, 2024

mzient Jul 17, 2024

mzient Jul 17, 2024

dali-automaton commented Jul 17, 2024

dali-automaton commented Jul 17, 2024

dali-automaton commented Jul 17, 2024

Set affinity by device UUID. #5566

Set affinity by device UUID. #5566

Conversation

mzient commented Jul 16, 2024

Category:

Description:

Additional information:

Affected modules and functionalities:

Key points relevant for the review:

Tests:

Checklist

Documentation

DALI team only

Requirements

dali-automaton commented Jul 16, 2024

dali-automaton commented Jul 16, 2024

dali-automaton commented Jul 17, 2024

mzient Jul 17, 2024

Choose a reason for hiding this comment

mzient Jul 17, 2024

Choose a reason for hiding this comment

mzient Jul 17, 2024

Choose a reason for hiding this comment

dali-automaton commented Jul 17, 2024

dali-automaton commented Jul 17, 2024

dali-automaton commented Jul 17, 2024