Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tpetra: broken unit tests with cuda 12.4 + h100 gpus #13399

Open
vasylivy opened this issue Aug 27, 2024 · 4 comments
Open

tpetra: broken unit tests with cuda 12.4 + h100 gpus #13399

vasylivy opened this issue Aug 27, 2024 · 4 comments
Labels
pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests

Comments

@vasylivy
Copy link

Hi,

Reporting broken unit tests with cuda 12.4 + h100 gpus. See configuration 1 reported here #13397.

334:TpetraCore_CrsMatrix_MatvecFence_MPI_4

CrsMatrix_int_longlong_double_Tpetra_KokkosCompat_KokkosCudaWrapperNode_MatvecFence_UnitTest

FenceCounter::get_count_global(exec_space.name()) = 40 == expectedGlobalCount = 60
370:TpetraCore_AsyncTransfer_UnitTests_MPI_4

p=3 | The following tests FAILED:
p=3 |     13. AsyncReverseExport_double_int_longlong_LowerTriangularCrsMatrix_UnitTest ... 
p=3 |     21. TransferArrived_double_int_longlong_CrsMatrix_forwardImportTrue_UnitTest ... 
p=3 |     23. TransferArrived_double_int_longlong_CrsMatrix_forwardExportTrue_UnitTest ... 
390:TpetraCore_MatrixMarket_Tpetra_CrsMatrix_Dist_BinaryPerProcess_simple_MPI_3
Throw number = 1

Throw test that evaluated to true: npRows * npCols != np

nProcessorCols 3 * nProcessorRows 2 = 6 must equal nProcessors 3 for 2D distribution

Tests that time out with 300s, were fine with non-UVM config. I'll have to retry these later. If you have a recommended time out let me know.

367:TpetraCore_ImportExport2_UnitTests_Send_MPI_4
369:TpetraCore_ImportExport2_UnitTests_Alltoall_MPI_4
427:TpetraCore_MatrixMatrix_UnitTests_MPI_4
428:TpetraCore_FECrs_MatrixMatrix_UnitTests_MPI_4

Thanks,

Yaro

@vasylivy vasylivy added the type: bug The primary issue is a bug in Trilinos code or tests label Aug 27, 2024
@csiefer2
Copy link
Member

@vasylivy Relevant machine is down for upgrades. We will compare against our configuration and try to reproduce when it comes back up.

@vasylivy
Copy link
Author

Tested config 1 w/ the following turned off

-DKokkos_ENABLE_CUDA_UVM=OFF
-DKokkosKernels_INST_MEMSPACE_CUDAUVMSPACE=OFF
-DTpetra_ALLOCATE_IN_SHARED_SPACE=OFF

the unit tests pass, so it would appear to be UVM related.

Yaro

@csiefer2
Copy link
Member

csiefer2 commented Aug 29, 2024

@vasylivy I built all the unit tests the way the perf tests build on Hops and they all pass.

The RDC build failed because evidently you need CuSPARSE enabled to build with RDC (why?). Will fix and report back when that finishes.

I can try a UVM one as well w/o RDC.

As an aside, I just got new MPI settings from @jjellio that I need to try.

@csiefer2
Copy link
Member

csiefer2 commented Sep 3, 2024

@vasylivy Yeah, it appears to be UVM, because RDC by itself has exactly 1 failing test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: Tpetra type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

2 participants