-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU tests are failing after moving to GCC12 #42997
Comments
A new Issue was created by @smuzaffar Malik Shahzad Muzaffar. @antoniovilela, @sextonkennedy, @Dr15Jones, @makortel, @rappoccio, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
@smuzaffar where can I find a machine with CUDA 12.1 / 530.30.02 ? |
@fwyzard , CERN HTCondor gpu nodes have cuda 12.1. I will send you the instructions to access one of these nodes |
assign core, heterogeneous |
New categories assigned: core,heterogeneous @Dr15Jones,@fwyzard,@makortel,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
However, this explains why the tests fail on machines with CUDA 11.8 . The tests should work on machines with CUDA 12.1 . |
@fwyzard , following runs on node with cuda 12.1
but it fails when
|
ok htcondor nodes we have
but looks like |
Mhm, with 530.x we should not need the compatibility layer, CUDA 12.2.x should work out of the box. |
|
Maybe there is a problem with the container image ? |
can you then try running |
It works also inside the container:
|
then could it be the problem with A100 GPU? on htcondor I see we have |
I would need to check locally. |
To use one of the htcondior node please do (from lxplus)
this node is available for next 5 hours and then it will be deleted. |
may be for A100, we need to build for |
by the way, gpu tests for 13.2.X work |
OK, I see the problem between gcc 11 / cuda 11.8 and gcc 12 / cuda 12.2:
vs
|
So, yes, it does look related to what architectures we build for. Running
So it looks like the driver-level forward compatibility applies only to compiled binaries, not to PTX. |
If that is the reason, cms-sw/cmsdist#8767 might fix the problem. |
The tests are now working, so I think we can close this issue |
+1 |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
There is still a failing GPU test: https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc12/CMSSW_13_3_GPU_X_2023-10-23-2300/unitTestLogs/PhysicsTools/TensorFlow#/154-154 @fwyzard could you please take a look? |
@iarspider , we should fix this test to not run as TF for GCC12/CUDA12 is not built with CUDA support |
See #42883 . |
@iarspider I think you were looking in it yourself with cms-sw/cmsdist#8732 ? |
RelVals/unittests are failing for GPU IBs after moving to GCC12. Looks like the
cuda-compatible-runtime
does not find any supported cuda device [a]. Looks like cuda driver version 12.1 and our runtime cuda 12.2 are not compatible.FYI @fwyzard
[a]
The text was updated successfully, but these errors were encountered: