Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU tests are failing after moving to GCC12 #42997

Closed
smuzaffar opened this issue Oct 12, 2023 · 30 comments
Closed

GPU tests are failing after moving to GCC12 #42997

smuzaffar opened this issue Oct 12, 2023 · 30 comments

Comments

@smuzaffar
Copy link
Contributor

RelVals/unittests are failing for GPU IBs after moving to GCC12. Looks like the cuda-compatible-runtime does not find any supported cuda device [a]. Looks like cuda driver version 12.1 and our runtime cuda 12.2 are not compatible.

FYI @fwyzard

[a]

Singularity> nvidia-smi
Thu Oct 12 08:28:20 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB           Off| 00000000:00:06.0 Off |                    0 |
| N/A   29C    P0               37W / 250W|      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
Singularity> /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02806/el8_amd64_gcc12/external/cuda-compatible-runtime/1.0-e595d828442cc07b078b61d178cb6872/test/cuda-compatible-runtime -k -v
CUDA driver version 12.1
CUDA runtime version 12.2
None of the CUDA devices supports launching and running a CUDA kernel.
@cmsbuild
Copy link
Contributor

A new Issue was created by @smuzaffar Malik Shahzad Muzaffar.

@antoniovilela, @sextonkennedy, @Dr15Jones, @makortel, @rappoccio, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor

fwyzard commented Oct 12, 2023

@smuzaffar where can I find a machine with CUDA 12.1 / 530.30.02 ?

@smuzaffar
Copy link
Contributor Author

@fwyzard , CERN HTCondor gpu nodes have cuda 12.1. I will send you the instructions to access one of these nodes

@makortel
Copy link
Contributor

assign core, heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: core,heterogeneous

@Dr15Jones,@fwyzard,@makortel,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@fwyzard
Copy link
Contributor

fwyzard commented Oct 18, 2023

Unfortunately, it looks like this is the expected behaviour, according to NVIDIA.

CUDA 12.x requires the 525.x (or later) drivers:
image

Or it requires the 450.x/470.x drivers, with the compatibility layer (that we include in CMSSW):
image

@fwyzard
Copy link
Contributor

fwyzard commented Oct 18, 2023

However, this explains why the tests fail on machines with CUDA 11.8 .

The tests should work on machines with CUDA 12.1 .

@smuzaffar
Copy link
Contributor Author

@fwyzard , following runs on node with cuda 12.1

/cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02807/el8_amd64_gcc12/external/cuda-compatible-runtime/1.0-e595d828442cc07b078b61d178cb6872/test/cuda-compatible-runtime  && echo OK
12.2
OK

but it fails when -k (If there are any CUDA devices, check that at least one supports launching a CUDA kernel) option is used

Singularity> /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02807/el8_amd64_gcc12/external/cuda-compatible-runtime/1.0-e595d828442cc07b078b61d178cb6872/test/cuda-compatible-runtime  -k && echo OK
12.2

@smuzaffar
Copy link
Contributor Author

smuzaffar commented Oct 18, 2023

ok htcondor nodes we have

NVIDIA-SMI 530.30.02
Driver Version: 530.30.02
CUDA Version: 12.1

but looks like 530.30.02 is not compatible with compatibility layer we have in cmssw

@fwyzard
Copy link
Contributor

fwyzard commented Oct 18, 2023

Mhm, with 530.x we should not need the compatibility layer, CUDA 12.2.x should work out of the box.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 18, 2023

fwyzard@devfu-c2b03-44-01.cms:~$ nvidia-smi
Wed Oct 18 10:20:46 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                        On | 00000000:23:00.0 Off |                    0 |
| N/A   36C    P8               11W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                        On | 00000000:E2:00.0 Off |                    0 |
| N/A   39C    P8               11W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

fwyzard@devfu-c2b03-44-01.cms:~$ /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02807/el8_amd64_gcc12/external/cuda-compatible-runtime/1.0-e595d828442cc07b078b61d178cb6872/test/cuda-compatible-runtime -k && echo OK
12.2
OK

@fwyzard
Copy link
Contributor

fwyzard commented Oct 18, 2023

Maybe there is a problem with the container image ?

@smuzaffar
Copy link
Contributor Author

Maybe there is a problem with the container image ?

can you then try running cmssw-el8 --nv and tests again?

@fwyzard
Copy link
Contributor

fwyzard commented Oct 18, 2023

It works also inside the container:

fwyzard@devfu-c2b03-44-01.cms:~$ cmssw-el8 --nv
Singularity> nvidia-smi
Wed Oct 18 10:26:35 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                        On | 00000000:23:00.0 Off |                    0 |
| N/A   36C    P8               12W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                        On | 00000000:E2:00.0 Off |                    0 |
| N/A   39C    P8               12W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
Singularity> /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02807/el8_amd64_gcc12/external/cuda-compatible-runtime/1.0-e595d828442cc07b078b61d178cb6872/test/cuda-compatible-runtime -k && echo OK
12.2
OK
Singularity> cd /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02807/el8_amd64_gcc12/cms/cmssw/CMSSW_13_3_X_2023-10-17-2300
Singularity> cmsenv
Singularity> cudaComputeCapabilities 
   0     7.5    Tesla T4
   1     7.5    Tesla T4

@smuzaffar
Copy link
Contributor Author

then could it be the problem with A100 GPU? on htcondor I see we have NVIDIA A100-PCIE-40GB while you are testing Tesla T4

@fwyzard
Copy link
Contributor

fwyzard commented Oct 18, 2023

I would need to check locally.
Can you remind me how to use a condor node ?

@smuzaffar
Copy link
Contributor Author

To use one of the htcondior node please do (from lxplus)

> ~cmsbuild/public/lxplus
> export _CONDOR_SCHEDD_HOST=bigbird21.cern.ch
> export _CONDOR_CREDD_HOST=bigbird21.cern.ch
> condor_ssh_to_job -auto-retry 230323.0

this node is available for next 5 hours and then it will be deleted.

@smuzaffar
Copy link
Contributor Author

may be for A100, we need to build for sm_80/compute_80 ( https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_13_3_X/master/cuda-flags.file#L7 )?

@smuzaffar
Copy link
Contributor Author

by the way, gpu tests for 13.2.X work

@fwyzard
Copy link
Contributor

fwyzard commented Oct 18, 2023

OK, I see the problem between gcc 11 / cuda 11.8 and gcc 12 / cuda 12.2:

bash-4.2$ /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02807/el8_amd64_gcc11/external/cuda-compatible-runtime/1.0-ca249f5e31a49bcfe53fea771aa4a1dc/test/cuda-compatible-runtime -k && echo OK
11.8
OK

vs

bash-4.2$ /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02807/el8_amd64_gcc12/external/cuda-compatible-runtime/1.0-e595d828442cc07b078b61d178cb6872/test/cuda-compatible-runtime -k && echo OK
12.2

@fwyzard
Copy link
Contributor

fwyzard commented Oct 18, 2023

So, yes, it does look related to what architectures we build for.

Running cuda-compatible-runtime -k inside cuda-gdb shows that we hit a CUDA error:

warning: Cuda API error detected: cudaLaunchKernel returned (0xde)

0xde is 222 which is cudaErrorUnsupportedPtxVersion:

  • cudaErrorUnsupportedPtxVersion = 222
    This indicates that the provided PTX was compiled with an unsupported toolchain. The most common reason for this, is the PTX was generated by a compiler newer than what is supported by the CUDA driver and PTX JIT compiler.

So it looks like the driver-level forward compatibility applies only to compiled binaries, not to PTX.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 18, 2023

If that is the reason, cms-sw/cmsdist#8767 might fix the problem.

@makortel
Copy link
Contributor

The tests are now working, so I think we can close this issue

@makortel
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

@cmsbuild, please close

@iarspider
Copy link
Contributor

@smuzaffar
Copy link
Contributor Author

@iarspider , we should fix this test to not run as TF for GCC12/CUDA12 is not built with CUDA support

@fwyzard
Copy link
Contributor

fwyzard commented Oct 24, 2023

@fwyzard
Copy link
Contributor

fwyzard commented Oct 24, 2023

@iarspider I think you were looking in it yourself with cms-sw/cmsdist#8732 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants