Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alpaka-related updates #41340

Open
2 tasks done
fwyzard opened this issue Apr 14, 2023 · 35 comments
Open
2 tasks done

alpaka-related updates #41340

fwyzard opened this issue Apr 14, 2023 · 35 comments

Comments

@fwyzard
Copy link
Contributor

fwyzard commented Apr 14, 2023

I'm collecting some "to do" items to improve the alpaka-related developments:

  • provide a simple way to gracefully skip tests when a given back-end does not have any devices (e.g. no AMD gpus while testing the ROCm back-end); something similar to cms::cudatest::requireDevices(); possibly also some explicit support for Catch2 tests.
  • remove from the release support for the "old" GPU framework: DataFormats/Portable/interface/Product.h, HeterogeneousCore/AlpakaCore/interface/ScopedContext.h, etc.
@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 14, 2023

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 14, 2023

A new Issue was created by @fwyzard Andrea Bocci.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 14, 2023

#41341 should address the second bullet, at least in part.
To be checked if there are any other leftovers.

@makortel
Copy link
Contributor

On the first point I wonder if we could (at least eventually) utilize scram itself to avoid running the tests in the first place that would fail because of missing hardware? @smuzaffar

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 14, 2023

Interesting idea.

For a CUDA-only test we could do something like this:

<bin name="cudaTest" file="cudaTest.cu">
  <use name="cuda"/>
  <flags TEST_RUNNER_CMD="cudaIsEnabled &amp;&amp; ./cudaTest || echo 'Failed to initialise the CUDA runtime, the test will be skipped.'"/>
</bin>

For an Alpaka-based test there are two problems:

  • the name of the test binary is different for each back-end
  • the name of the check could be different for each back-end

The first problem could be solved if SCRAM could provide an environment variable that expands to the name of the binary (or to its full path) that can be used inside TEST_RUNNER_CMD, something like

  <flags TEST_RUNNER_CMD="cudaIsEnabled &amp;&amp; ./$TEST_BINARY_NAME || echo 'Failed to initialise the CUDA runtime, the test $TEST_BINARY_NAME will be skipped.'"/>

The second problem could be solved is SCRAM could provide an environment variable with the name of the Alpaka backend being tested, like SerialSync or ROCmAsync, something like

  <flags TEST_RUNNER_CMD="alpakaIsEnabled$TEST_ALPAKA_BACKEND &amp;&amp; ./$TEST_BINARY_NAME || echo 'Failed to initialise the $TEST_ALPAKA_BACKEND backend, the test $TEST_BINARY_NAME will be skipped.'"/>

Of course, it would be even nicer if SCRAM could automate this with a single flag, something like

  <flags TEST_REQUIRE_ALPAKA_BACKEND/>

and report these tests as Skip in the console, instead of Pass/Fail.

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 14, 2023

And to clarify: alpakaIsEnabled$TEST_ALPAKA_BACKEND does not exist yet, but it's relatively trivial to write; I'll see if I can make a PR for it.

@smuzaffar
Copy link
Contributor

smuzaffar commented Apr 14, 2023

@fwyzard, I was also thinking along the lines of using cudaIsEnabled and rocmIsEnabled or better alpakaIsEnabled$TEST_ALPAKA_BACKEND or alpakaIsEnabled rocm|cuda|serial|tbb|.... We can change build rules so that unit test is run only if it explicitly depend on

  • rocm and rocmIsEnabled or alpakaIsEnabledROCmAsync runs successfully
  • cuda and cudaIsEnabled or alpakaIsEnabledCudaAsync runs successfully
  • alpaka and alpakaIsEnabled$TEST_ALPAKA_BACKEND (for selected backend) runs successfully.
  • We should also have unit tests to actually make sure that alpakaIsEnabled$TEST_ALPAKA_BACKEND actually runs if hardware is available e.g. unit tests like
# For GPU IBs or if explicitly enabled cuda tests via env i.e. SCRAM_ENABLE_TESTS=cuda
<iftest name="cuda">
  <test name="alpakaIsEnabledCudaAsync" command="alpakaIsEnabledCudaAsync"/>
</iftest>
# For ROMc Ibs or if explicitly enabled rocm tests via enve.g. SCRAM_ENABLE_TESTS=rocm
<iftest name="rocm">
  <test name="alpakaIsEnabledROCmAsync" command="alpakaIsEnabledROCmAsync"/>
</iftest>

SCRAM should run alpakaIsEnabled$TEST_ALPAKA_BACKEND for each supported backend once ( before the start of tests) and reuse the results.

@aandvalenzuela is interested in improving the GPU/Alpaka tests. if the above sounds good then she can already look in to implement it.

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 14, 2023

Looks good to me.

I would suggest alpakaIsEnabled$TEST_ALPAKA_BACKEND rather than alpakaIsEnabled rocm|cuda|serial|tbb|..., because it is much simpler to make the back-end choice at compile time (and so end up with a different binary for each back-end) than at runtime (to support the choice based on the command line argument).

We have actually done the latter for pixeltrack-standalone, but not in CMSSW where we are relying on the plug-in system.

@smuzaffar
Copy link
Contributor

@makortel @fwyzard , cms-sw/cmssw-config#95 implements the new unit test rules for cuda/rocm/alpaka backends. See cms-sw/cmssw-config#95 (comment) and let me know if this is sufficient?

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 19, 2023

@smuzaffar from the description it looks good.

Since the change impacts both "cuda" and "alpaka-cuda" tests, maybe it would actually be better to check cudaIsEnabled (and rocmIsEnabled for ROCm) instead of introducing an alpaka-specific check ?

@makortel
Copy link
Contributor

@smuzaffar From the descriptions it looks good to me too. My only question is about tests that run cmsRun (e.g. directly or via a shell script). From cms-sw/cmssw-config#95 I understand the behavior is driven by the dependencies of the test. Is the <iftest name="cuda"> (from #41340 (comment)) the way to control the execution of those tests?

@smuzaffar
Copy link
Contributor

smuzaffar commented Apr 19, 2023

@makortel , <iftest name="cuda"> is not needed any more. Execution of the test is driven by the test dependency itself. All tests with direct cuda deps (i.e. they have either <use name="cuda"> or alpaka-cuda backend enabled) will be skipped if cudaIsEnabled does not run successfully. Same for rocm based tests.

For dedicated GPU IBs, we can set USER_UNIT_TESTS=cuda which will force run only unit tests with cuda or alpaka-cuda dependency (all other tests will be skipped). This will also allow us to run GPU unit tests for PRs :-)

@makortel
Copy link
Contributor

@smuzaffar Thanks, but I'm still confused :) Let's take a concrete example

<ifrelease name="_GPU_">
<test name="testHeterogeneousCoreAlpakaTestModules" command="testAlpakaModules.sh cuda"/>
<else/>
<test name="testHeterogeneousCoreAlpakaTestModules" command="testAlpakaModules.sh cpu"/>
</ifrelease>

Here the goal is to run the test script in one way when cuda tests should be run (i.e. either the machine has a GPU or always in the GPU IBs), and in other way when there are no GPUs. The script internally calls cudaIsEnabled for the cpu option to enable the CUDA code path outside of GPU IBs if the machine has a GPU.

Should this be expressed along

<test name="testHeterogeneousCoreAlpakaTestModulesCUDA" command="testAlpakaModules.sh cuda">
  <use name="cuda"/>
</test>
<test name="testHeterogeneousCoreAlpakaTestModulesCPU" command="testAlpakaModules.sh cpu"/>

then? (and probably making the cpu option to not do anything if the machine has a GPU)

@smuzaffar
Copy link
Contributor

smuzaffar commented Apr 19, 2023

yes @makortel that fregment of BuildFile should be change as you suggested.

  • With explicit USER_UNIT_TESTS=cuda (e.g. for GPU IBs), scram will only run testHeterogeneousCoreAlpakaTestModulesCUDA and skip testHeterogeneousCoreAlpakaTestModulesCPU (as this does not depend on cuda)
  • For non-GPU IBs (i.e. no explicit USER_UNIT_TESTS), scram will run testHeterogeneousCoreAlpakaTestModulesCUDA if cudaIsEnabled is successful e.g. on our ppc64le nodes (where we have GPU). scram will always run testHeterogeneousCoreAlpakaTestModulesCPU .

testAlpakaModules.sh script can be updated to not check for cudaIsEnabled and always do what is passed via command-line.

@fwyzard
Copy link
Contributor Author

fwyzard commented Apr 19, 2023

testAlpakaModules.sh script can be updated to not check for cudaIsEnabled and always do what is passed via command-line.

... and adding a ROCm version.

iarspider added a commit to iarspider-cmssw/cmssw that referenced this issue Apr 20, 2023
@smuzaffar
Copy link
Contributor

By the way, new build rules, to make use of USER_UNIT_TESTS=cuda to run cuda specific tests , are in 13.0.X and 13.1.X IBs.

@makortel
Copy link
Contributor

@fwyzard This issue has been completed, right?

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 14, 2024

I don't remember what it was about... I'll have to re-read the thread and remind me of the details.

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 19, 2024

Looking at the tests, I think we should still introduce a way to make them fail gracefully instead of crashing, when no GPU is available.

For example:

$ ./deviceVertexFinderByDensity_tROCmAsync

 *** Break *** segmentation violation



===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
   0x7f06884fb731 <waitpid+33>: ja     0x7f06884fb788 <waitpid+120>
   0x7f06884fb733 <waitpid+35>: ret    
   0x7f06884fb734 <waitpid+36>: nop    DWORD PTR [rax+0x0]
   0x7f06884fb738 <waitpid+40>: push   r12
   0x7f06884fb73a <waitpid+42>: mov    r12d,edx
   0x7f06884fb73d <waitpid+45>: push   rbp
   0x7f06884fb73e <waitpid+46>: mov    rbp,rsi
#0  0x00007f06884fb72b in waitpid () from /lib64/libc.so.6
#1  0x00007f068845ccf7 in do_system () from /lib64/libc.so.6
#2  0x00007f068fd1881d in TUnixSystem::StackTrace() () from /data/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el8_amd64_gcc12/lib/libCore.so
#3  0x00007f068fd181d4 in TUnixSystem::DispatchSignals(ESignals) () from /data/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el8_amd64_gcc12/lib/libCore.so
#4  <signal handler called>
#5  std::__shared_ptr<alpaka::detail::QueueRegistry<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtImpl<alpaka::ApiHipRt> >, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/alpaka/1.1.0-0e0b978d445f7af747cf00064c146356/include/alpaka/dev/DevUniformCudaHipRt.hpp:56
#6  std::shared_ptr<alpaka::detail::QueueRegistry<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtImpl<alpaka::ApiHipRt> > >::shared_ptr (this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/bits/shared_ptr.h:204
#7  alpaka::DevUniformCudaHipRt<alpaka::ApiHipRt>::DevUniformCudaHipRt (this=0x7ffca4511f00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/alpaka/1.1.0-0e0b978d445f7af747cf00064c146356/include/alpaka/dev/DevUniformCudaHipRt.hpp:56
#8  main () at src/RecoTracker/PixelVertexFinding/test/alpaka/VertexFinder_t.cc:28
===========================================================


The lines below might hint at the cause of the crash. If you see question
marks as part of the stack trace, try to recompile with debugging information
enabled and export CLING_DEBUG=1 environment variable before running.
You may get help by asking at the ROOT forum https://root.cern/forum
preferably using the command (.forum bug) in the ROOT prompt.
Only if you are really convinced it is a bug in ROOT then please submit a
report at https://root.cern/bugs or (preferably) using the command (.gh bug) in
the ROOT prompt. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5  std::__shared_ptr<alpaka::detail::QueueRegistry<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtImpl<alpaka::ApiHipRt> >, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/alpaka/1.1.0-0e0b978d445f7af747cf00064c146356/include/alpaka/dev/DevUniformCudaHipRt.hpp:56
#6  std::shared_ptr<alpaka::detail::QueueRegistry<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtImpl<alpaka::ApiHipRt> > >::shared_ptr (this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/bits/shared_ptr.h:204
#7  alpaka::DevUniformCudaHipRt<alpaka::ApiHipRt>::DevUniformCudaHipRt (this=0x7ffca4511f00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/alpaka/1.1.0-0e0b978d445f7af747cf00064c146356/include/alpaka/dev/DevUniformCudaHipRt.hpp:56
#8  main () at src/RecoTracker/PixelVertexFinding/test/alpaka/VertexFinder_t.cc:28
===========================================================

@makortel
Copy link
Contributor

Playing around with scram b runtests_<test_name> I think having scram b runtests_deviceVertexFinderByDensity_t to run (or skip) the tests of all the backends of the test could be useful as well, rather than having to explicitly call scram b runtests_deviceVertexFinderByDensity_tROCmAsync.

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 20, 2024

Looking at the tests, I think we should still introduce a way to make them fail gracefully instead of crashing, when no GPU is available.

This is actually as simple as

  if (cms::alpakatools::devices<Platform>().empty()) {
    std::cout << "No devices found, the test will be skipped.\n";
    exit(EXIT_SUCCESS);
  }

@makortel, do you think we need to wrap it in a function like cms::alpakatools::requireDevices<Platform>(), or is it simple enough that people can use those four lines directly ?

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 20, 2024

I've updated an example in #44036 .

@makortel
Copy link
Contributor

This is actually as simple as

  if (cms::alpakatools::devices<Platform>().empty()) {
    std::cout << "No devices found, the test will be skipped.\n";
    exit(EXIT_SUCCESS);
  }

This is similar to what we had at one point (and still have with direct CUDA). My concern for this specific approach is that an infrastructure problem, e.g. in GPU IBs, would go unnoticed because the tests report success (by default scram uses exit code of cudaIsEnabled to decide where to run or skip cuda-dependent tests, but in GPU IB tests the USER_UNIT_TESTS=cuda is set to require all the GPU tests to be run).

If the test would return EXIT_FAILURE in this case (i.e. the test still fails, but with a meaningful message), I'd be fine with it.

I don't have strong feelings whether to abstract or copy-paste. Four lines isn't much, but it's still something. On the other hand, e.g. in Catch2-based tests one could do

  REQUIRE(not cms::alpakatools::devices<Platform>().empty());

which I wouldn't consider to be worth of abstracting.

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 20, 2024

Sure, I'm OK with returning EXIT_FAILURE.

In fact I coded it that way, then I switched to EXIT_SUCCESS because that's what we used in CUDA.

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 20, 2024

(will fix tomorrow)

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 21, 2024

For Catch2, currently we have a message at the beginning of the output coming from a cout:

image
(original)

I'd like to move it to use a Catch2 macro like FAIL or INFO.
Which one do you like better:

image
(option 1)

or

image
(option2)

or the original one ?

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 21, 2024

An other option, using WARN:

image
(option 3)

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 21, 2024

#44036 / #44042 should change all alpaka-based tests to fail with a warning message when there are no devices.

@makortel
Copy link
Contributor

makortel commented Feb 21, 2024

I think

image

would be sufficient, and the message is most concise.

After that, I'd prefer

image

over

image

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 21, 2024

Mhm... I don't see the quoted images :-(

Note that the message should be the same in all cases, it's just a matter of choice.

rbhattacharya04 pushed a commit to rbhattacharya04/cmssw that referenced this issue Feb 21, 2024
@makortel
Copy link
Contributor

Ok, third attempt to quote pictures...

@makortel
Copy link
Contributor

Going back to text then. My (not very strong) order of preferences would be

  • option 2
  • option 1
  • option 3

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 21, 2024

OK, so directly FAIL with the message, instead of issuing a INFO/WARN message and failing on the REQUIRE.

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 21, 2024

I'll update the PRs accordingly.

rbhattacharya04 pushed a commit to rbhattacharya04/cmssw that referenced this issue Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants