You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We recently added a second Radeon Pro VII to our simulation system. Unfortunately, though, it seems the GPUs do not want to talk to each other, although they are directly connected with an Infinity Fabric Link Bridge.
The system usually runs Arch Linux, where I also started a discussion about the issue, but testing with Ubuntu shows the same issue. Everything posted here was done on the Ubuntu system.
system
hardware setup
GPUs: 2 AMD Radeon Pro VII
CPU: AMD Ryzen Threadripper 2950X
mainboard: Asus X399-A
The GPUs are connected with an Infinity Fabric Link Bridge.
software
OS: Ubuntu 20.04.3
kernel: 5.11
ROCM: installed via sudo amdgpu-install --usecase=rocm with amdgpu-install from here
other requirements
I did verify that critical requirements according to the ROCM supported hardware page are met, eg. hardware (see above), but also
IOMMU
$ sudo dmesg | grep -i iommu
[sudo] password for tinux:
[ 0.271162] iommu: Default domain type: Translated
[ 0.471020] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[ 0.471076] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[ 0.471120] pci 0000:00:01.0: Adding to iommu group 0
[ 0.471133] pci 0000:00:01.1: Adding to iommu group 1
[ 0.471146] pci 0000:00:01.2: Adding to iommu group 2
[ 0.471166] pci 0000:00:02.0: Adding to iommu group 3
[ 0.471183] pci 0000:00:03.0: Adding to iommu group 4
[ 0.471195] pci 0000:00:03.1: Adding to iommu group 5
[ 0.471213] pci 0000:00:04.0: Adding to iommu group 6
[ 0.471231] pci 0000:00:07.0: Adding to iommu group 7
[ 0.471243] pci 0000:00:07.1: Adding to iommu group 8
[ 0.471261] pci 0000:00:08.0: Adding to iommu group 9
[ 0.471273] pci 0000:00:08.1: Adding to iommu group 10
[ 0.471297] pci 0000:00:14.0: Adding to iommu group 11
[ 0.471308] pci 0000:00:14.3: Adding to iommu group 11
[ 0.471368] pci 0000:00:18.0: Adding to iommu group 12
[ 0.471379] pci 0000:00:18.1: Adding to iommu group 12
[ 0.471390] pci 0000:00:18.2: Adding to iommu group 12
[ 0.471401] pci 0000:00:18.3: Adding to iommu group 12
[ 0.471414] pci 0000:00:18.4: Adding to iommu group 12
[ 0.471425] pci 0000:00:18.5: Adding to iommu group 12
[ 0.471436] pci 0000:00:18.6: Adding to iommu group 12
[ 0.471447] pci 0000:00:18.7: Adding to iommu group 12
[ 0.471506] pci 0000:00:19.0: Adding to iommu group 13
[ 0.471517] pci 0000:00:19.1: Adding to iommu group 13
[ 0.471529] pci 0000:00:19.2: Adding to iommu group 13
[ 0.471542] pci 0000:00:19.3: Adding to iommu group 13
[ 0.471553] pci 0000:00:19.4: Adding to iommu group 13
[ 0.471565] pci 0000:00:19.5: Adding to iommu group 13
[ 0.471577] pci 0000:00:19.6: Adding to iommu group 13
[ 0.471588] pci 0000:00:19.7: Adding to iommu group 13
[ 0.471622] pci 0000:01:00.0: Adding to iommu group 14
[ 0.471635] pci 0000:01:00.1: Adding to iommu group 14
[ 0.471649] pci 0000:01:00.2: Adding to iommu group 14
[ 0.471654] pci 0000:02:00.0: Adding to iommu group 14
[ 0.471658] pci 0000:02:01.0: Adding to iommu group 14
[ 0.471662] pci 0000:02:02.0: Adding to iommu group 14
[ 0.471666] pci 0000:02:03.0: Adding to iommu group 14
[ 0.471670] pci 0000:02:04.0: Adding to iommu group 14
[ 0.471674] pci 0000:02:09.0: Adding to iommu group 14
[ 0.471678] pci 0000:05:00.0: Adding to iommu group 14
[ 0.471683] pci 0000:08:00.0: Adding to iommu group 14
[ 0.471695] pci 0000:09:00.0: Adding to iommu group 15
[ 0.471707] pci 0000:0a:00.0: Adding to iommu group 16
[ 0.471719] pci 0000:0b:00.0: Adding to iommu group 17
[ 0.471744] pci 0000:0c:00.0: Adding to iommu group 18
[ 0.471759] pci 0000:0c:00.1: Adding to iommu group 19
[ 0.471772] pci 0000:0d:00.0: Adding to iommu group 20
[ 0.471784] pci 0000:0d:00.2: Adding to iommu group 21
[ 0.471798] pci 0000:0d:00.3: Adding to iommu group 22
[ 0.471810] pci 0000:0e:00.0: Adding to iommu group 23
[ 0.471825] pci 0000:0e:00.2: Adding to iommu group 24
[ 0.471838] pci 0000:0e:00.3: Adding to iommu group 25
[ 0.471856] pci 0000:40:01.0: Adding to iommu group 26
[ 0.471872] pci 0000:40:02.0: Adding to iommu group 27
[ 0.471890] pci 0000:40:03.0: Adding to iommu group 28
[ 0.471902] pci 0000:40:03.1: Adding to iommu group 29
[ 0.471920] pci 0000:40:04.0: Adding to iommu group 30
[ 0.471937] pci 0000:40:07.0: Adding to iommu group 31
[ 0.471949] pci 0000:40:07.1: Adding to iommu group 32
[ 0.471968] pci 0000:40:08.0: Adding to iommu group 33
[ 0.471981] pci 0000:40:08.1: Adding to iommu group 34
[ 0.471994] pci 0000:41:00.0: Adding to iommu group 35
[ 0.472006] pci 0000:42:00.0: Adding to iommu group 36
[ 0.472031] pci 0000:43:00.0: Adding to iommu group 37
[ 0.472048] pci 0000:43:00.1: Adding to iommu group 38
[ 0.472061] pci 0000:44:00.0: Adding to iommu group 39
[ 0.472074] pci 0000:44:00.2: Adding to iommu group 40
[ 0.472086] pci 0000:44:00.3: Adding to iommu group 41
[ 0.472100] pci 0000:45:00.0: Adding to iommu group 42
[ 0.472113] pci 0000:45:00.2: Adding to iommu group 43
[ 0.502585] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 0.502595] pci 0000:40:00.2: AMD-Vi: Found IOMMU cap 0x40
[ 0.503499] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[ 0.503517] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[ 1.017979] AMD-Vi: AMD IOMMUv2 driver by Joerg Roedel <jroedel@suse.de>
CRAT
$ sudo dmesg | grep -i crat
[ 0.000000] ACPI: CRAT 0x0000000077CDE878 001DF8 (v01 AMD AMD CRAT 00000001 AMD 00000001)
[ 0.000000] ACPI: Reserving CRAT table memory at [mem 0x77cde878-0x77ce066f]
[ 1.168518] amdgpu: Ignoring ACPI CRAT on non-APU system
[ 1.168521] amdgpu: Virtual CRAT table created for CPU
[ 2.265620] amdgpu: Virtual CRAT table created for GPU
[ 3.261272] amdgpu: Virtual CRAT table created for GPU
Hello everyone
We recently added a second Radeon Pro VII to our simulation system. Unfortunately, though, it seems the GPUs do not want to talk to each other, although they are directly connected with an Infinity Fabric Link Bridge.
The system usually runs Arch Linux, where I also started a discussion about the issue, but testing with Ubuntu shows the same issue. Everything posted here was done on the Ubuntu system.
system
hardware setup
The GPUs are connected with an Infinity Fabric Link Bridge.
software
sudo amdgpu-install --usecase=rocm
with amdgpu-install from hereother requirements
I did verify that critical requirements according to the ROCM supported hardware page are met, eg. hardware (see above), but also
IOMMU
CRAT
Atomics
and
issues
It seems the GPUs are not connected to each other, despite the fact that they are physically connected with an Infinity Fabric Link Bridge.
tests with
rocm-smi
I also ran a few other test, but I cannot really make sense of it, given the output of the command above.
and
other benchmarks
I also ran a benchmark from the RCCL repository, which is much slower on 2 GPUs than on a single.
2 GPUs
1 GPU
Any help is highly appreciated.
The text was updated successfully, but these errors were encountered: