Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework: Switch CUDA AT2 build to be non-UVM and enable tests #13439

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

sebrowne
Copy link
Contributor

@sebrowne sebrowne commented Sep 10, 2024

@trilinos/framework

Motivation

Want to align the CUDA AT2 build with the old AutoTester one.

Related Issues

https://sems-atlassian-son.sandia.gov/jira/browse/TRILFRAME-673

@sebrowne sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. AT: WIP Causes the PR autotester to not test the PR. (Remove to allow testing to occur.) labels Sep 10, 2024
Which will also cause it to start running all of the appropriate tests.
If I remember correctly, we had this disabled because the containers
were running out of disk space, but we want this enabled for the "real"
PR configuration.

Signed-off-by: Samuel E. Browne <sebrown@sandia.gov>
@sebrowne sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. and removed AT2-SpecialApprove (Beta) Special approval label for AT2. labels Sep 10, 2024
Signed-off-by: Samuel E. Browne <sebrown@sandia.gov>
@sebrowne sebrowne requested a review from a team as a code owner September 10, 2024 03:01
@sebrowne sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. and removed AT2-SpecialApprove (Beta) Special approval label for AT2. labels Sep 10, 2024
Signed-off-by: Samuel E. Browne <sebrown@sandia.gov>
We disable X11 everywhere else, so be consistent here.  In the future,
we probably want to enable this, since we DO have X11 in the containers,
but getting that hooked up and working is for another day.

Signed-off-by: Samuel E. Browne <sebrown@sandia.gov>
@sebrowne sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. and removed AT2-SpecialApprove (Beta) Special approval label for AT2. labels Sep 11, 2024
@sebrowne
Copy link
Contributor Author

The CUDA tests look good, with four exceptions, detailed here: https://sems-cdash-son.sandia.gov/cdash/viewTest.php?onlyfailed&buildid=211376

@trilinos/intrepid2 I show that failing test was set to RUN SERIAL for CUDA builds, I can do that here as well if that's still what we want to do.
@trilinos/panzer that test is on for all other configs, no obvious framework-side issues to me.
@trilinos/rol that test is on for all other configs, no obvious framework-side issues to me.
@trilinos/stratimikos we had that test disabled for our non-CUDA container as well, but again, nothing really obvious from our side.

If any developers from the tagged teams can provide any insight for the four failing tests (and they do fail reliably), it would be much appreciated! I can turn them off, but I wanted to at least do SOME due diligence and see what the community thinks.

@CamelliaDPG
Copy link
Contributor

@trilinos/intrepid2 I show that failing test was set to RUN SERIAL for CUDA builds, I can do that here as well if that's still what we want to do.

Yes, please. The MonolithicExecutable test is one that has a lot of test cases, and some of them are intensive, so that sharing compute resources with other tests can lead to timeouts. We use RUN SERIAL to mitigate.

@rppawlo
Copy link
Contributor

rppawlo commented Sep 12, 2024

@cgcgcg - would you mind taking a look at the panzer/mini-em failure here? Looks to be a linear solver issue similar to what you have fixed in the past.

@cgcgcg
Copy link
Contributor

cgcgcg commented Sep 12, 2024

I see this message in the output of the failing Stratimikos and Panzer tests:

--------------------------------------------------------------------------
The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
cannot be used.
  cuIpcGetMemHandle return value:   1
  address: 0x42b363c80
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------

Signed-off-by: Samuel E. Browne <sebrown@sandia.gov>
For some reason, there are a couple of tests that are failing when RDMA
support is initialized.  I debugged it to the point of disabling the
smcuda BTL in OpenMPI. My guess is that something is wrong with our
container build of OpenMPI, OR there is something different
hardware-wise about our new Ampere80 machines (I checked the PCI bus
addresses because that was something that a brief Google investigation
indicated, but they didn't look any worse than the Volta70 machines).

Signed-off-by: Samuel E. Browne <sebrown@sandia.gov>
@sebrowne sebrowne added AT2-SpecialApprove (Beta) Special approval label for AT2. and removed AT2-SpecialApprove (Beta) Special approval label for AT2. labels Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
AT: WIP Causes the PR autotester to not test the PR. (Remove to allow testing to occur.) AT2-SpecialApprove (Beta) Special approval label for AT2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants