GPU tests are failing after moving to GCC12 #42997

smuzaffar · 2023-10-12T06:37:59Z

RelVals/unittests are failing for GPU IBs after moving to GCC12. Looks like the cuda-compatible-runtime does not find any supported cuda device [a]. Looks like cuda driver version 12.1 and our runtime cuda 12.2 are not compatible.

FYI @fwyzard

[a]

Singularity> nvidia-smi
Thu Oct 12 08:28:20 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB           Off| 00000000:00:06.0 Off |                    0 |
| N/A   29C    P0               37W / 250W|      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
Singularity> /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02806/el8_amd64_gcc12/external/cuda-compatible-runtime/1.0-e595d828442cc07b078b61d178cb6872/test/cuda-compatible-runtime -k -v
CUDA driver version 12.1
CUDA runtime version 12.2
None of the CUDA devices supports launching and running a CUDA kernel.

The text was updated successfully, but these errors were encountered:

cmsbuild · 2023-10-12T06:38:19Z

A new Issue was created by @smuzaffar Malik Shahzad Muzaffar.

@antoniovilela, @sextonkennedy, @Dr15Jones, @makortel, @rappoccio, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard · 2023-10-12T07:33:52Z

@smuzaffar where can I find a machine with CUDA 12.1 / 530.30.02 ?

smuzaffar · 2023-10-12T07:53:28Z

@fwyzard , CERN HTCondor gpu nodes have cuda 12.1. I will send you the instructions to access one of these nodes

makortel · 2023-10-12T13:31:34Z

assign core, heterogeneous

cmsbuild · 2023-10-12T13:31:52Z

New categories assigned: core,heterogeneous

@Dr15Jones,@fwyzard,@makortel,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

fwyzard · 2023-10-18T07:35:45Z

Unfortunately, it looks like this is the expected behaviour, according to NVIDIA.

CUDA 12.x requires the 525.x (or later) drivers:

Or it requires the 450.x/470.x drivers, with the compatibility layer (that we include in CMSSW):

fwyzard · 2023-10-18T07:36:48Z

However, this explains why the tests fail on machines with CUDA 11.8 .

The tests should work on machines with CUDA 12.1 .

smuzaffar · 2023-10-18T08:07:51Z

@fwyzard , following runs on node with cuda 12.1

/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02807/el8_amd64_gcc12/external/cuda-compatible-runtime/1.0-e595d828442cc07b078b61d178cb6872/test/cuda-compatible-runtime  && echo OK
12.2
OK

but it fails when -k (If there are any CUDA devices, check that at least one supports launching a CUDA kernel) option is used

Singularity> /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02807/el8_amd64_gcc12/external/cuda-compatible-runtime/1.0-e595d828442cc07b078b61d178cb6872/test/cuda-compatible-runtime  -k && echo OK
12.2

smuzaffar · 2023-10-18T08:13:00Z

ok htcondor nodes we have

NVIDIA-SMI 530.30.02
Driver Version: 530.30.02
CUDA Version: 12.1

but looks like 530.30.02 is not compatible with compatibility layer we have in cmssw

fwyzard · 2023-10-18T08:19:53Z

Mhm, with 530.x we should not need the compatibility layer, CUDA 12.2.x should work out of the box.

fwyzard · 2023-10-18T08:21:07Z

fwyzard@devfu-c2b03-44-01.cms:~$ nvidia-smi
Wed Oct 18 10:20:46 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                        On | 00000000:23:00.0 Off |                    0 |
| N/A   36C    P8               11W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                        On | 00000000:E2:00.0 Off |                    0 |
| N/A   39C    P8               11W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

fwyzard@devfu-c2b03-44-01.cms:~$ /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02807/el8_amd64_gcc12/external/cuda-compatible-runtime/1.0-e595d828442cc07b078b61d178cb6872/test/cuda-compatible-runtime -k && echo OK
12.2
OK

fwyzard · 2023-10-18T08:21:36Z

Maybe there is a problem with the container image ?

smuzaffar · 2023-10-18T08:23:19Z

Maybe there is a problem with the container image ?

can you then try running cmssw-el8 --nv and tests again?

fwyzard · 2023-10-18T08:27:16Z

It works also inside the container:

fwyzard@devfu-c2b03-44-01.cms:~$ cmssw-el8 --nv
Singularity> nvidia-smi
Wed Oct 18 10:26:35 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                        On | 00000000:23:00.0 Off |                    0 |
| N/A   36C    P8               12W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                        On | 00000000:E2:00.0 Off |                    0 |
| N/A   39C    P8               12W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
Singularity> /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02807/el8_amd64_gcc12/external/cuda-compatible-runtime/1.0-e595d828442cc07b078b61d178cb6872/test/cuda-compatible-runtime -k && echo OK
12.2
OK
Singularity> cd /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02807/el8_amd64_gcc12/cms/cmssw/CMSSW_13_3_X_2023-10-17-2300
Singularity> cmsenv
Singularity> cudaComputeCapabilities 
   0     7.5    Tesla T4
   1     7.5    Tesla T4

smuzaffar · 2023-10-18T08:28:51Z

then could it be the problem with A100 GPU? on htcondor I see we have NVIDIA A100-PCIE-40GB while you are testing Tesla T4

fwyzard · 2023-10-18T08:30:01Z

I would need to check locally.
Can you remind me how to use a condor node ?

smuzaffar · 2023-10-18T08:37:04Z

To use one of the htcondior node please do (from lxplus)

> ~cmsbuild/public/lxplus
> export _CONDOR_SCHEDD_HOST=bigbird21.cern.ch
> export _CONDOR_CREDD_HOST=bigbird21.cern.ch
> condor_ssh_to_job -auto-retry 230323.0

this node is available for next 5 hours and then it will be deleted.

smuzaffar · 2023-10-18T08:40:15Z

may be for A100, we need to build for sm_80/compute_80 ( https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_13_3_X/master/cuda-flags.file#L7 )?

smuzaffar · 2023-10-18T08:50:43Z

by the way, gpu tests for 13.2.X work

fwyzard · 2023-10-18T08:54:13Z

OK, I see the problem between gcc 11 / cuda 11.8 and gcc 12 / cuda 12.2:

bash-4.2$ /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02807/el8_amd64_gcc11/external/cuda-compatible-runtime/1.0-ca249f5e31a49bcfe53fea771aa4a1dc/test/cuda-compatible-runtime -k && echo OK
11.8
OK

vs

bash-4.2$ /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02807/el8_amd64_gcc12/external/cuda-compatible-runtime/1.0-e595d828442cc07b078b61d178cb6872/test/cuda-compatible-runtime -k && echo OK
12.2

fwyzard · 2023-10-18T09:13:05Z

So, yes, it does look related to what architectures we build for.

Running cuda-compatible-runtime -k inside cuda-gdb shows that we hit a CUDA error:

warning: Cuda API error detected: cudaLaunchKernel returned (0xde)

0xde is 222 which is cudaErrorUnsupportedPtxVersion:

cudaErrorUnsupportedPtxVersion = 222
This indicates that the provided PTX was compiled with an unsupported toolchain. The most common reason for this, is the PTX was generated by a compiler newer than what is supported by the CUDA driver and PTX JIT compiler.

So it looks like the driver-level forward compatibility applies only to compiled binaries, not to PTX.

fwyzard · 2023-10-18T09:25:29Z

If that is the reason, cms-sw/cmsdist#8767 might fix the problem.

makortel · 2023-10-23T15:19:25Z

The tests are now working, so I think we can close this issue

makortel · 2023-10-23T15:19:32Z

+1

cmsbuild · 2023-10-23T15:19:41Z

This issue is fully signed and ready to be closed.

makortel · 2023-10-23T15:20:01Z

@cmsbuild, please close

iarspider · 2023-10-24T08:46:58Z

There is still a failing GPU test: https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc12/CMSSW_13_3_GPU_X_2023-10-23-2300/unitTestLogs/PhysicsTools/TensorFlow#/154-154

@fwyzard could you please take a look?

smuzaffar · 2023-10-24T08:49:23Z

@iarspider , we should fix this test to not run as TF for GCC12/CUDA12 is not built with CUDA support

fwyzard · 2023-10-24T09:02:26Z

There is still a failing GPU test: https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc12/CMSSW_13_3_GPU_X_2023-10-23-2300/unitTestLogs/PhysicsTools/TensorFlow#/154-154

@fwyzard could you please take a look?

See #42883 .

fwyzard · 2023-10-24T09:05:01Z

@iarspider I think you were looking in it yourself with cms-sw/cmsdist#8732 ?

cmsbuild added the pending-assignment label Oct 12, 2023

cmsbuild added core-pending pending-signatures heterogeneous-pending and removed pending-assignment labels Oct 12, 2023

makortel mentioned this issue Oct 17, 2023

Updates to AlpakaCore README and tests #43033

Merged

makortel mentioned this issue Oct 19, 2023

Make every Alpaka EDProducer to store an unsigned short int for the backend #41564

Merged

cmsbuild added core-approved fully-signed heterogeneous-approved and removed core-pending pending-signatures heterogeneous-pending labels Oct 23, 2023

cmsbuild closed this as completed Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU tests are failing after moving to GCC12 #42997

GPU tests are failing after moving to GCC12 #42997

smuzaffar commented Oct 12, 2023

cmsbuild commented Oct 12, 2023

fwyzard commented Oct 12, 2023

smuzaffar commented Oct 12, 2023

makortel commented Oct 12, 2023

cmsbuild commented Oct 12, 2023

fwyzard commented Oct 18, 2023

fwyzard commented Oct 18, 2023

smuzaffar commented Oct 18, 2023

smuzaffar commented Oct 18, 2023 •

edited

Loading

fwyzard commented Oct 18, 2023

fwyzard commented Oct 18, 2023

fwyzard commented Oct 18, 2023

smuzaffar commented Oct 18, 2023

fwyzard commented Oct 18, 2023

smuzaffar commented Oct 18, 2023

fwyzard commented Oct 18, 2023

smuzaffar commented Oct 18, 2023

smuzaffar commented Oct 18, 2023

smuzaffar commented Oct 18, 2023

fwyzard commented Oct 18, 2023

fwyzard commented Oct 18, 2023

fwyzard commented Oct 18, 2023

makortel commented Oct 23, 2023

makortel commented Oct 23, 2023

cmsbuild commented Oct 23, 2023

makortel commented Oct 23, 2023

iarspider commented Oct 24, 2023

smuzaffar commented Oct 24, 2023

fwyzard commented Oct 24, 2023

fwyzard commented Oct 24, 2023

GPU tests are failing after moving to GCC12 #42997

GPU tests are failing after moving to GCC12 #42997

Comments

smuzaffar commented Oct 12, 2023

cmsbuild commented Oct 12, 2023

fwyzard commented Oct 12, 2023

smuzaffar commented Oct 12, 2023

makortel commented Oct 12, 2023

cmsbuild commented Oct 12, 2023

fwyzard commented Oct 18, 2023

fwyzard commented Oct 18, 2023

smuzaffar commented Oct 18, 2023

smuzaffar commented Oct 18, 2023 • edited Loading

fwyzard commented Oct 18, 2023

fwyzard commented Oct 18, 2023

fwyzard commented Oct 18, 2023

smuzaffar commented Oct 18, 2023

fwyzard commented Oct 18, 2023

smuzaffar commented Oct 18, 2023

fwyzard commented Oct 18, 2023

smuzaffar commented Oct 18, 2023

smuzaffar commented Oct 18, 2023

smuzaffar commented Oct 18, 2023

fwyzard commented Oct 18, 2023

fwyzard commented Oct 18, 2023

fwyzard commented Oct 18, 2023

makortel commented Oct 23, 2023

makortel commented Oct 23, 2023

cmsbuild commented Oct 23, 2023

makortel commented Oct 23, 2023

iarspider commented Oct 24, 2023

smuzaffar commented Oct 24, 2023

fwyzard commented Oct 24, 2023

fwyzard commented Oct 24, 2023

smuzaffar commented Oct 18, 2023 •

edited

Loading