Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

muelu: broken unit tests with cuda 12.4 + h100 gpus #13397

Open
vasylivy opened this issue Aug 27, 2024 · 6 comments
Open

muelu: broken unit tests with cuda 12.4 + h100 gpus #13397

vasylivy opened this issue Aug 27, 2024 · 6 comments
Labels
pkg: MueLu type: bug The primary issue is a bug in Trilinos code or tests

Comments

@vasylivy
Copy link

Hi,

I'm seeing various errors in muelu unit tests on nvidia h100 gpu using cuda 12.4 w/ kokkos uvm flag enabled. I'm not sure which approach is preferred for reporting here but I've tested using two different approaches to build Trilinos for h100s. Can someone with access to h100s try to reproduce the failures?

Approach 1. Use cmake directly and build Trilinos master SHA bf922e75428. The config file is shown below

export TRILINOS_DIR=/gpfs/yvvasyl/Trilinos
export TRILINOS_INSTALL=/gpfs/yvvasyl/trilinos_install

module load aue/cmake aue/binutils/2.4.1 cudatoolkit/12.4 aue/gcc/10.3.0 \
            aue/openmpi/4.1.6-gcc-10.3.0 aue/netlib-lapack/3.11.0-gcc-10.3.0 \
            aue/ninja/1.11.1 aue/python

module list

export OMPI_CXX=${TRILINOS_DIR}/packages/kokkos/bin/nvcc_wrapper
export TPETRA_ASSUME_GPU_AWARE_MPI=0
export CUDA_LAUNCH_BLOCKING=0

# CMake configuration
cmake \
-G"Ninja" \
-DCMAKE_INSTALL_PREFIX=$TRILINOS_INSTALL \
-DCMAKE_CXX_STANDARD="17" \
-DCMAKE_CXX_COMPILER="`which mpicxx`" \
-DCMAKE_C_COMPILER="`which mpicc`" \
-DCMAKE_FORTRAN_COMPILER="`which mpifort`" \
-DCMAKE_BUILD_TYPE="RELEASE" \
-DBUILD_SHARED_LIBS=OFF \
\
-DTrilinos_ENABLE_ALL_PACKAGES=OFF \
-DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES=OFF \
-DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON \
-DTrilinos_ASSERT_MISSING_PACKAGES=OFF \
-DTrilinos_ALLOW_NO_PACKAGES=OFF \
-DTrilinos_ENABLE_OpenMP=OFF \
-DTrilinos_ENABLE_TESTS=ON \
\
-DTrilinos_ENABLE_Amesos2=ON \
 -DAmesos2_ENABLE_SuperLU=OFF \
 -DAmesos2_ENABLE_KLU2=ON \
-DTrilinos_ENABLE_Belos=ON \
-DTrilinos_ENABLE_Ifpack2=ON \
-DTrilinos_ENABLE_Teko=ON \
-DTrilinos_ENABLE_Kokkos=ON \
 -DKokkos_ARCH_HOPPER90=ON \
 -DKokkos_ENABLE_CUDA=ON \
 -DKokkos_ENABLE_HIP=OFF \
 -DKokkos_ENABLE_OPENMP=OFF \
 -DKokkos_ENABLE_CUDA_UVM=ON \
 -DKokkos_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE=ON \
 -DKokkos_ENABLE_CUDA_LAMBDA=ON \
 -DKokkos_ENABLE_CUDA_CONSTEXPR=ON \
-DTrilinos_ENABLE_KokkosKernels=ON \
 -DKokkosKernels_INST_MEMSPACE_CUDAUVMSPACE=ON \
-DTrilinos_ENABLE_MueLu=ON \
-DTrilinos_ENABLE_Tpetra=ON \
 -DTpetra_ENABLE_CUDA=ON \
 -DTpetra_INST_HIP=OFF \
 -DTpetra_INST_SERIAL=OFF \
 -DTpetra_INST_OPENMP=OFF \
 -DTpetra_INST_DOUBLE=ON \
 -DTpetra_ALLOCATE_IN_SHARED_SPACE=ON \
-DTrilinos_ENABLE_Gtest=ON \
-DTrilinos_ENABLE_Teuchos=ON \
-DTrilinos_ENABLE_Xpetra=ON \
-DTrilinos_ENABLE_Zoltan2=ON \
-DTrilinos_ENABLE_Panzer=OFF \
-DTPL_ENABLE_CUSPARSE:BOOL=ON \
-DTPL_ENABLE_BLAS=ON \
  -D BLAS_LIBRARY_DIRS:FILEPATH="${LAPACK_ROOT}/lib64" \
  -D BLAS_LIBRARY_NAMES:STRING="blas" \
-DTPL_ENABLE_LAPACK=ON \
  -D LAPACK_INCLUDE_DIRS:FILEPATH="${LAPACK_ROOT}/include" \
  -D LAPACK_LIBRARY_DIRS:FILEPATH="${LAPACK_ROOT}/lib64" \
  -D LAPACK_LIBRARY_NAMES:STRING="lapack" \
-DTPL_ENABLE_Netcdf=OFF \
-DTPL_ENABLE_MPI=ON \
-DMPI_USE_COMPILER_WRAPPERS=ON \
-DMPI_EXEC="mpirun" \
-DMPI_EXEC_NUMPROCS_FLAG="-np" \
-DMPI_EXEC_POST_NUMPROCS_FLAGS:STRING="-bind-to;none" \
\
$TRILINOS_DIR

ninja -j96 -k 0

ctest -j16 --timeout 300

The following set of muelu unit tests fails, note that some of these are timeouts. Doubling the timeout didn't help with other testing so I've tried to be consistent and set it to 300 across all tests.

676:MueLu_StandardReuse-Tpetra_MPI_4
687:MueLu_UnitTestsTpetra_MPI_4
689:MueLu_UnitTestsBlockedTpetra_MPI_4
695:MueLu_UnitTestsTpetra_kokkos_MPI_4
699:MueLu_ImportPerformance_Tpetra_MPI_4
700:MueLu_ComboPTest_MPI_4
728:MueLu_CreateOperatorTpetra_MPI_4
730:MueLu_ParameterListInterpreterTpetra_MPI_4
733:MueLu_BlockedTransfer_Tpetra_MPI_4
735:MueLu_ReitzingerPFactory_MPI_4
737:MueLu_Maxwell3D-Tpetra_1_MPI_4
738:MueLu_Maxwell3D-Tpetra_2_MPI_4
739:MueLu_Maxwell3D-Tpetra_3_MPI_4
740:MueLu_Maxwell3D-Tpetra2_MPI_4
751:MueLu_MortarSurfaceCoupling_DofBased_Blocked_SimpleSmoother_2dof_medium_MPI_4
774:MueLu_Driver_TogglePFactory_semi_sa_line_easy_Tpetra_MPI_4
778:MueLu_Driver_TogglePFactory_lin_const_restrict_Tpetra_MPI_4
789:MueLu_Structured_Laplace2D_Tpetra_MPI_4
793:MueLu_Structured_Elasticity3D_Tpetra_MPI_4
801:MueLu_Structured_Interp_Laplace2D_kokkos_MPI_4
803:MueLu_Structured_Interp_SA_Laplace2D_kokkos_MPI_4

I see various errors across these tests including cuda errors e.g. triaging some of these failures

MueLu_Driver_TogglePFactory_semi_sa_line_easy_Tpetra_MPI_4
Transpose P (MueLu::TransPFactory)
(ptr->cuda_stream_synchronize_wrapper(stream)) error( cudaErrorIllegalAddress): an illegal memory access
MueLu_StandardReuse-Tpetra_MPI_4
MueLu_ImportPerformance_Tpetra_MPI_4
MueLu_ComboPTest_MPI_4

 Throw number = 1
 
 Throw test that evaluated to true: lclSuccess == 0
 
 Tpetra::createRemoteOnlyImport: this->getRemoteLIDs() has 2indices "bad" indices on this process.
MueLu_UnitTestsBlockedTpetra_MPI_4

munmap_chunk(): invalid pointer
MueLu_CreateOperatorTpetra_MPI_4
 Throw number = 1
 
 Throw test that evaluated to true: errCode != 0
 
 computeNumPacketsAndOffsets: parallel_scan error code 3 != 0.

If on the other I hand I build using the following modified trilinos config, then ALL tests pass within the same 300s timeout and on the same machine.

export TRILINOS_DIR=/gpfs/yvvasyl/Trilinos
export TRILINOS_INSTALL=/gpfs/yvvasyl/trilinos_install

module load aue/cmake aue/binutils/2.4.1 cudatoolkit/12.4 aue/gcc/10.3.0 \
            aue/openmpi/4.1.6-gcc-10.3.0 aue/netlib-lapack/3.11.0-gcc-10.3.0 \
            aue/ninja/1.11.1 aue/python

module list

export OMPI_CXX=${TRILINOS_DIR}/packages/kokkos/bin/nvcc_wrapper
export TPETRA_ASSUME_GPU_AWARE_MPI=0
export CUDA_LAUNCH_BLOCKING=0

# CMake configuration
cmake \
-G"Ninja" \
-DCMAKE_INSTALL_PREFIX=$TRILINOS_INSTALL \
-DCMAKE_CXX_STANDARD="17" \
-DCMAKE_CXX_COMPILER="`which mpicxx`" \
-DCMAKE_C_COMPILER="`which mpicc`" \
-DCMAKE_FORTRAN_COMPILER="`which mpifort`" \
-DCMAKE_BUILD_TYPE="RELEASE" \
-DBUILD_SHARED_LIBS=OFF \
\
-DTrilinos_ENABLE_ALL_PACKAGES=OFF \
-DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES=OFF \
-DTrilinos_ENABLE_EXPLICIT_INSTANTIATION=ON \
-DTrilinos_ASSERT_MISSING_PACKAGES=OFF \
-DTrilinos_ALLOW_NO_PACKAGES=OFF \
-DTrilinos_ENABLE_OpenMP=OFF \
-DTrilinos_ENABLE_TESTS=ON \
\
-DTrilinos_ENABLE_Amesos2=ON \
 -DAmesos2_ENABLE_SuperLU=OFF \
 -DAmesos2_ENABLE_KLU2=ON \
-DTrilinos_ENABLE_Belos=ON \
-DTrilinos_ENABLE_Ifpack2=ON \
-DTrilinos_ENABLE_Teko=ON \
-DTrilinos_ENABLE_Kokkos=ON \
 -DKokkos_ARCH_HOPPER90=ON \
 -DKokkos_ENABLE_CUDA=ON \
 -DKokkos_ENABLE_HIP=OFF \
 -DKokkos_ENABLE_OPENMP=OFF \
-DTrilinos_ENABLE_KokkosKernels=ON \
-DTrilinos_ENABLE_MueLu=ON \
-DTrilinos_ENABLE_Tpetra=ON \
 -DTpetra_ENABLE_CUDA=ON \
 -DTpetra_INST_HIP=OFF \
 -DTpetra_INST_SERIAL=OFF \
 -DTpetra_INST_OPENMP=OFF \
 -DTpetra_INST_DOUBLE=ON \
-DTrilinos_ENABLE_Gtest=ON \
-DTrilinos_ENABLE_Teuchos=ON \
-DTrilinos_ENABLE_Xpetra=ON \
-DTrilinos_ENABLE_Zoltan2=ON \
-DTrilinos_ENABLE_Panzer=OFF \
-DTPL_ENABLE_BLAS=ON \
  -D BLAS_LIBRARY_DIRS:FILEPATH="${LAPACK_ROOT}/lib64" \
  -D BLAS_LIBRARY_NAMES:STRING="blas" \
-DTPL_ENABLE_LAPACK=ON \
  -D LAPACK_INCLUDE_DIRS:FILEPATH="${LAPACK_ROOT}/include" \
  -D LAPACK_LIBRARY_DIRS:FILEPATH="${LAPACK_ROOT}/lib64" \
  -D LAPACK_LIBRARY_NAMES:STRING="lapack" \
-DTPL_ENABLE_Netcdf=OFF \
-DTPL_ENABLE_MPI=ON \
-DMPI_USE_COMPILER_WRAPPERS=ON \
-DMPI_EXEC="mpirun" \
-DMPI_EXEC_NUMPROCS_FLAG="-np" \
-DMPI_EXEC_POST_NUMPROCS_FLAGS:STRING="-bind-to;none" \
\
$TRILINOS_DIR

ninja -j96 -k 0

ctest -j16 --timeout 300

Approach 2. Instead of using cmake directly, use spack to install trilinos master with following spack env activated. You may need to tweak the yaml file depending on the machines / modules available.

spack:
  view:
    default:
      root: .spack_env/view
      link: all
  concretizer:
    unify: true
  specs:
  - trilinos@master +test +wrapper +uvm ~shared cxxstd=17
  - gcc@10.3.1
  - binutils@2.4.1
  - openmpi@4.1
  config:
    build_stage:
      - /gpfs/yvvasyl/spack-stage-trilinos-master/
  packages:
    openssh:
      externals:
      - prefix: /usr
        spec: openssh@default
    gcc:
      externals:
      - spec: gcc@10.3.1
        flags:
          ldlibs: -lm -lz -ldl
        modules: [gnu/10.3.1]
        extra_attributes:
          compilers:
            c: gcc
            cxx: g++
            fortran: gfortran
      buildable: false
    binutils:
      externals:
      - spec: binutils@2.4.1
        modules: [aue/binutils]
      buildable: false
    cuda:
      externals:
      - spec: cuda@12.4.0
        modules: [cudatoolkit/12.4]
      buildable: false
    openmpi:
      externals:
      - spec: openmpi@4.1
        modules: [openmpi-gnu/4.1]
      buildable: false
    all:
      require:
      - target=x86_64
      - '%gcc@10.3.1'
      variants:
      - +cuda
      - cuda_arch=90
      - generator=ninja
      providers:
        'mpi:': [openmpi]
  repos: []

Note that no one has updated spack develop branch yet, so you need to update kokkos to 4.4.00 by making the following modifications to built in package.py found under /path/to/spack/var/spack/repos/builtin/packages/package-name

  • kokkos
version("4.4.00", sha256="c638980cb62c34969b8c85b73e68327a2cb64f763dd33e5241f5fd437170205a")
  • kokkos-kernels
version("4.4.00", sha256="0fc5cc03f1888b1dcf1cd659fa68b1c34477d7cc4b90f21c220a3198df813840")
  • kokkos-nvcc-wrapper
version("4.4.00", sha256="c638980cb62c34969b8c85b73e68327a2cb64f763dd33e5241f5fd437170205a")
  • trilinos
depends_on("kokkos@4.4.00", when="@master: +kokkos")

When installing use --keep-stage to keep the build directories and run ctest from there after finishing, note Amesos tests don't build so probably turn that off. When I tested things using this approach, the Trilinos master SHA that was checked out was 2ad26029 and the following set of MueLU tests failed

1146:MueLu_StandardReuse-Tpetra_MPI_4
1154:MueLu_UnitTestsTpetra_MPI_1
1155:MueLu_UnitTestsTpetra_MPI_4
1157:MueLu_UnitTestsBlockedTpetra_MPI_4
1163:MueLu_UnitTestsTpetra_kokkos_MPI_4
1167:MueLu_ImportPerformance_Tpetra_MPI_4
1184:MueLu_CreateOperatorTpetra_MPI_1
1185:MueLu_CreateOperatorTpetra_MPI_4
1186:MueLu_BlockedTransfer_Tpetra_MPI_4
1188:MueLu_ReitzingerPFactory_MPI_4
1190:MueLu_Maxwell3D-Tpetra_1_MPI_4
1191:MueLu_Maxwell3D-Tpetra_2_MPI_4
1192:MueLu_Maxwell3D-Tpetra_3_MPI_4

I did not see the same set of errors compared to using the direct cmake approach (as the options are probably slightly different) but the tests are still hit with various errors e.g.

MueLu_UnitTestsTpetra_MPI_1
384. StructuredAggregation_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_LocalLexiTentative2D_UnitTest ... [Passed] (0.0125 sec)
(ptr->cuda_stream_synchronize_wrapper(stream)) error( cudaErrorInvalidAddressSpace): operation not supported on global/shared address space /tmp/yvvasyl/spack-stage/spack-stage-trilinos-master-isn3pbsgssa7x366i7uqbsxrlnizef2m/spack-src/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:165
Backtrace:
MueLu_ImportPerformance_Tpetra_MPI_4
 Throw test that evaluated to true: (userRemotePIDs.size () > 0 && remoteGIDs.size () != userRemotePIDs.size ())
 
 Tpetra::Import<int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Cuda, Kokkos::CudaUVMSpace> >::setupExport: remotePIDs must either be of size zero or match the size of remoteGIDs.

Thanks,

Yaro

@vasylivy vasylivy added the type: bug The primary issue is a bug in Trilinos code or tests label Aug 27, 2024
Copy link

Automatic mention of the @trilinos/muelu team

@cwpearson
Copy link
Contributor

cwpearson commented Aug 27, 2024

Am I reading this right that "approach 1", the build with passing tests has cuSPARSE, UVM, and CUDA RDC disabled, but is otherwise the same as the failing build? Or did I miss some other change?

Were those three changes together necessary to make all the tests pass?

@vasylivy
Copy link
Author

The machine is down so I haven't isolated which of the options was the culprit but yes approach 1 with those options enabled had those failing tests. The second snippet (also using cmake directly but w/ those options disabled) had all tests pass. The last config using spack env / install w/ the shown env / variants also reproduced some of the same issues as in approach 1.

Yaro

@csiefer2
Copy link
Member

I'm not surprised stuff doesn't work with RDC enabled. I don't think RDC enabled is tested on any platform, because it takes sooooooo long. @sebrowne Correct me if I'm wrong.

@vasylivy
Copy link
Author

Tested config 1 w/ the following turned off

-DKokkos_ENABLE_CUDA_UVM=OFF
-DKokkosKernels_INST_MEMSPACE_CUDAUVMSPACE=OFF
-DTpetra_ALLOCATE_IN_SHARED_SPACE=OFF

the unit tests pass, so it would appear to be UVM related.

Yaro

@vasylivy
Copy link
Author

vasylivy commented Aug 28, 2024

Triaging tests for config 1, they have the following errors

MueLu_UnitTestsTpetra_MPI_1
151. PgPFactory_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_ReUseOmegasTransP_UnitTest ... 
cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access
MueLu_UnitTestsTpetra_kokkos_MPI_1
Regression_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_H2D_UnitTest
Tpetra::Details::DeepCopyCounter::get_count_different_space() = 17 == targetNumDeepCopies = 19 : FAILED
MueLu_UnitTestsTpetra_kokkos_MPI_4
// on test 42
cudaDeviceSynchronize() error( cudaErrorInvalidAddressSpace): operation not supported on global/shared address space
// test times out and reports error
MueLu_StandardReuse-Tpetra_MPI_4
MueLu_ImportPerformance_Tpetra_MPI_4
MueLu_ComboPTest_MPI_4
MueLu_CreateOperatorTpetra_MPI_4
MueLu_Maxwell3D-Tpetra_2_MPI_4
MueLu_MortarSurfaceCoupling_DofBased_Blocked_SimpleSmoother_2dof_medium_MPI_4
MueLu_Structured_Laplace2D_Tpetra_MPI_4
Tpetra::createRemoteOnlyImport: this->getRemoteLIDs() has 1index "bad" indices on this process.

MueLu_ParameterListInterpreterTpetra_MPI_4 
 // segfaults w/ following error
Tpetra::Map constructor (noncontiguous): Minimum global ID = -1 over all process(es) is less than the given indexBase = 0.
MueLu_BlockedTransfer_Tpetra_MPI_4
logic error. full map and sub maps are inconsistently distributed over the processors
MueLu_ReitzingerPFactory_MPI_4 // reports a time out but also an error
 Error, relErr(norm[0],norm0[0]) = relErr(5.65685,0) = 1 <= tol = 2.22045e-14: failed!
 [FAILED]  (2.44 sec) ReitzingerPFactory_double_int_longlong_Tpetra_KokkosCompat_KokkosCudaWrapperNode_Setup2Level_Unsmoothed_UnitTest
MueLu_Maxwell3D-Tpetra_1_MPI_4
 ** On entry to cusparseCreateCsr(): dimension mismatch, nnz (1119) > matrix size (1104)

p=1: *** Caught standard std::exception of type 'std::runtime_error' :

 cusparseCreateCsr(&h->descr_B, n, k, entriesB.extent(0), (void *)row_mapB.data(), (void *)entriesB.data(), (void *)entriesB.data() , CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I, CUSPARSE_INDEX_BASE_ZERO, h->scalarType) error( CUSPARSE_STATUS_INVALID_VALUE): invalid value Trilinos/packages/kokkos-kernels/sparse/tpls/KokkosSparse_spgemm_symbolic_tpl_spec_decl.hpp:80
MueLu_Maxwell3D-Tpetra_3_MPI_4
NodeMatrix (Level 0): MxM: A x P
cudaDeviceSynchronize() error( cudaErrorIllegalAddress): an illegal memory access 
MueLu_Driver_TogglePFactory_semi_sa_line_easy_Tpetra_MPI_4
Transpose P (MueLu::TransPFactory)
(ptr->cuda_stream_synchronize_wrapper(stream)) error( cudaErrorIllegalAddress): an illegal memory access was encountered Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:165
Backtrace:
(ptr->cuda_stream_synchronize_wrapper(stream)) error( cudaErrorInvalidAddressSpace): operation not supported on global/shared address space Trilinos/packages/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:165
Backtrace:

Various 300s timeouts that do not occur otherwise and do not report any errors

MueLu_UnitTestsTpetra_MPI_4 // last test was 58
MueLu_UnitTestsBlockedTpetra_MPI_4 // last test was 23
MueLu_Structured_Interp_Laplace2D_kokkos_MPI_4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg: MueLu type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

3 participants