Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kokkos: snapshot from commit a5eb4d4e is causing our application to hang toward the end of the simulation #13351

Open
glhenni opened this issue Aug 13, 2024 · 20 comments
Labels
type: bug The primary issue is a bug in Trilinos code or tests

Comments

@glhenni
Copy link
Contributor

glhenni commented Aug 13, 2024

Bug Report

@crtrott it seems that commit a5eb4d4 is causing our application, GEMMA, to hang in the latter portions of the simulation. All we have to offer for diagnosing the problem so far is the stack trace below, obtained from interrupting the code while in the debugger and run through c++filt. Any suggestions on how to find the problem?

#0  0x00007fffe2b6e82d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fffe2b67ad9 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007ffff7f6a577 in __gthread_mutex_lock (__mutex=0x769ee0) at /projects/aue/cee/builds/x86_64/rhel8/ae2aa8f4/gcc-12.1.0/install/linux-rhel8-x86_64/gcc-10.3.0/gcc-12.1.0-xpyjv5s/lib/gcc/x86_64-pc-linux-gnu/12.1.0/../../../../include/c++/12.1.0/x86_64-pc-linux-gnu/bits/gthr-default.h:749
#3  0x00007ffff7f6b140 in std::mutex::lock (this=0x769ee0) at /projects/aue/cee/builds/x86_64/rhel8/ae2aa8f4/gcc-12.1.0/install/linux-rhel8-x86_64/gcc-10.3.0/gcc-12.1.0-xpyjv5s/lib/gcc/x86_64-pc-linux-gnu/12.1.0/../../../../include/c++/12.1.0/bits/std_mutex.h:100
#4  0x00007ffff7f6f31a in std::lock_guard<std::mutex>::lock_guard (this=0x7ffffffe99e0, __m=...) at /projects/aue/cee/builds/x86_64/rhel8/ae2aa8f4/gcc-12.1.0/install/linux-rhel8-x86_64/gcc-10.3.0/gcc-12.1.0-xpyjv5s/lib/gcc/x86_64-pc-linux-gnu/12.1.0/../../../../include/c++/12.1.0/bits/std_mutex.h:229
#5  0x00007fffe39c35e5 in operator() (__closure=0x7ffffffe9abf) at /ascldap/users/glhenni/Projects/Trilinos.github/packages/kokkos/core/src/OpenMP/Kokkos_OpenMP.cpp:85
#6  0x00007fffe39c393f in Kokkos::Tools::Experimental::Impl::profile_fence_event<Kokkos::OpenMP, Kokkos::OpenMP::impl_static_fence(const std::string&)::<lambda()> >(const std::string &, Kokkos::Tools::Experimental::SpecialSynchronizationCases, const struct {...} &) (name=..., reason=Kokkos::Tools::Experimental::GlobalDeviceSynchronization, func=...) at /ascldap/users/glhenni/Projects/Trilinos.github/packages/kokkos/core/src/impl/Kokkos_Profiling.hpp:208
#7  0x00007fffe39c3663 in Kokkos::OpenMP::impl_static_fence (name=...) at /ascldap/users/glhenni/Projects/Trilinos.github/packages/kokkos/core/src/OpenMP/Kokkos_OpenMP.cpp:76
#8  0x00007fffe39c5486 in Kokkos::Impl::ExecSpaceDerived<Kokkos::OpenMP>::static_fence (this=0x447d10, label=...) at /ascldap/users/glhenni/Projects/Trilinos.github/packages/kokkos/core/src/impl/Kokkos_ExecSpaceManager.hpp:131
#9  0x00007fffe39a8152 in Kokkos::Impl::ExecSpaceManager::static_fence (this=0x7fffe3a07a40 <Kokkos::Impl::ExecSpaceManager::get_instance()::space_initializer>, name=...) at /ascldap/users/glhenni/Projects/Trilinos.github/packages/kokkos/core/src/impl/Kokkos_Core.cpp:243
#10 0x00007fffe39ab81c in (anonymous namespace)::fence_internal (name=...) at /ascldap/users/glhenni/Projects/Trilinos.github/packages/kokkos/core/src/impl/Kokkos_Core.cpp:813
#11 0x00007fffe39aca75 in Kokkos::fence (name=...) at /ascldap/users/glhenni/Projects/Trilinos.github/packages/kokkos/core/src/impl/Kokkos_Core.cpp:1099
#12 0x00007fffe39b65dd in Kokkos::HostSpace::deallocate (this=0x781c60, arg_label=0x7ffffffe9c50 "an AhA", arg_alloc_ptr=0x7b1d00, arg_alloc_size=10496, arg_logical_size=10368) at /ascldap/users/glhenni/Projects/Trilinos.github/packages/kokkos/core/src/impl/Kokkos_HostSpace.cpp:100
#13 0x00007fffe39b69c8 in Kokkos::Impl::SharedAllocationRecordCommon<Kokkos::HostSpace>::~SharedAllocationRecordCommon (this=0x781c00, __in_chrg=<optimized out>) at /ascldap/users/glhenni/Projects/Trilinos.github/packages/kokkos/core/src/impl/Kokkos_SharedAlloc_timpl.hpp:39
#14 0x00007ffff7f9741c in Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, void>::~SharedAllocationRecord (this=0x781c00, __in_chrg=<optimized out>) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/Kokkos_HostSpace.hpp:178
#15 0x00007ffff7fa0aa6 in Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, double, true> >::~SharedAllocationRecord (this=0x781c00, __in_chrg=<optimized out>) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/impl/Kokkos_SharedAlloc.hpp:400
#16 0x00007ffff7fa0ac2 in Kokkos::Impl::SharedAllocationRecord<Kokkos::HostSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, double, true> >::~SharedAllocationRecord (this=0x781c00, __in_chrg=<optimized out>) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/impl/Kokkos_SharedAlloc.hpp:400
#17 0x00007ffff7fa0b17 in Kokkos::Impl::deallocate<Kokkos::HostSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, double, true> > (record_ptr=0x781c00) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/impl/Kokkos_SharedAlloc.hpp:384
#18 0x00007fffe39c0e31 in Kokkos::Impl::SharedAllocationRecord<void, void>::decrement (arg_record=0x781c00) at /ascldap/users/glhenni/Projects/Trilinos.github/packages/kokkos/core/src/impl/Kokkos_SharedAlloc.cpp:268
#19 0x00007ffff7f6cdc4 in Kokkos::Impl::SharedAllocationTracker::~SharedAllocationTracker (this=0x79d640, __in_chrg=<optimized out>) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/impl/Kokkos_SharedAlloc.hpp:544
#20 Kokkos::Impl::ViewTracker<Kokkos::View<double**, Kokkos::HostSpace> >::~ViewTracker (this=0x79d640, __in_chrg=<optimized out>) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/impl/Kokkos_ViewTracker.hpp:39
#21 0x00007ffff7f6cde0 in Kokkos::View<double**, Kokkos::HostSpace>::~View (this=0x79d640, __in_chrg=<optimized out>) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/Kokkos_View.hpp:1279
#22 0x00007ffff771b97b in Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>::operator() (this=0x7ffffffea168, i=0) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/View/Kokkos_ViewAlloc.hpp:80
#23 0x00007ffff771b689 in Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>, Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<long>, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>::DestroyTag>, Kokkos::OpenMP>::exec_work<Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>::DestroyTag> (functor=..., iwork=0) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/OpenMP/Kokkos_OpenMP_Parallel_For.hpp:73
#24 0x00007ffff771bdb0 in std::enable_if<!std::is_same<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<long>, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>::DestroyTag>::schedule_type::type, Kokkos::Dynamic>::value, void>::type Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>, Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<long>, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>::DestroyTag>, Kokkos::OpenMP>::execute_parallel<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<long>, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>::DestroyTag> >() const [clone ._omp_fn.0](void) () at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/OpenMP/Kokkos_OpenMP_Parallel_For.hpp:105
#25 0x00007fffe2db46e6 in GOMP_parallel (fn=0x7ffff771bd09 <std::enable_if<!std::is_same<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<long>, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>::DestroyTag>::schedule_type::type, Kokkos::Dynamic>::value, void>::type Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>, Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<long>, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>::DestroyTag>, Kokkos::OpenMP>::execute_parallel<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<long>, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>::DestroyTag> >() const [clone ._omp_fn.0](void)>, data=0x7ffffffea0d8, num_threads=4, flags=0) at /projects/aue/cee/builds/x86_64/rhel8/ae2aa8f4/gcc-12.1.0/spack/var/spack/stage/smbaxle/spack-stage-gcc-12.1.0-xpyjv5s3zpojwfrlm2c37v7nz4t6jp37/spack-src/libgomp/parallel.c:178
#26 0x00007ffff771b3b2 in Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>, Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<long>, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>::DestroyTag>, Kokkos::OpenMP>::execute_parallel<Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<long>, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>::DestroyTag> > (this=0x7ffffffea160) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/OpenMP/Kokkos_OpenMP_Parallel_For.hpp:97
#27 0x00007ffff7719a2d in Kokkos::Impl::ParallelFor<Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>, Kokkos::RangePolicy<Kokkos::OpenMP, Kokkos::IndexType<long>, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>::DestroyTag>, Kokkos::OpenMP>::execute (this=0x7ffffffea160) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/OpenMP/Kokkos_OpenMP_Parallel_For.hpp:119
#28 0x00007ffff771721b in Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>::parallel_for_implementation<Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>::DestroyTag> (this=0x79fe68) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/View/Kokkos_ViewAlloc.hpp:174
#29 0x00007ffff77137f2 in Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false>::destroy_shared_allocation (this=0x79fe68) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/View/Kokkos_ViewAlloc.hpp:192
#30 0x00007ffff771014d in Kokkos::Impl::deallocate<Kokkos::HostSpace, Kokkos::Impl::ViewValueFunctor<Kokkos::Device<Kokkos::OpenMP, Kokkos::HostSpace>, Kokkos::View<double**, Kokkos::HostSpace>, false> > (record_ptr=0x79fe00) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/impl/Kokkos_SharedAlloc.hpp:382
#31 0x00007fffe39c0e31 in Kokkos::Impl::SharedAllocationRecord<void, void>::decrement (arg_record=0x79fe00) at /ascldap/users/glhenni/Projects/Trilinos.github/packages/kokkos/core/src/impl/Kokkos_SharedAlloc.cpp:268
#32 0x00007ffff7464646 in Kokkos::Impl::SharedAllocationTracker::~SharedAllocationTracker (this=0x7ffffffea858, __in_chrg=<optimized out>) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/impl/Kokkos_SharedAlloc.hpp:544
#33 Kokkos::Impl::ViewTracker<Kokkos::View<Kokkos::View<double**, Kokkos::HostSpace>*, Kokkos::HostSpace> >::~ViewTracker (this=0x7ffffffea858, __in_chrg=<optimized out>) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/impl/Kokkos_ViewTracker.hpp:39
#34 0x00007ffff746468e in Kokkos::View<Kokkos::View<double**, Kokkos::HostSpace>*, Kokkos::HostSpace>::~View (this=0x7ffffffea858, __in_chrg=<optimized out>) at /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/include/Kokkos_View.hpp:1279
#35 0x00007ffff7464ab6 in gemma::linearAlgebra::MatrixBuilder::~MatrixBuilder (this=0x7ffffffea6a0, __in_chrg=<optimized out>) at /ascldap/users/glhenni/Projects/Gemma/gemma/src/linearAlgebra/MatrixBuilder.hpp:87
#36 0x00007ffff745fd49 in gemma::assembly::solveSystemWithBelos (num_excitations=1, fi_pd=..., solution=..., d_right_hand_side=..., frequency_index=0) at /ascldap/users/glhenni/Projects/Gemma/gemma/src/assembly/matrixSolve.cpp:79
#37 0x00007ffff745ff56 in gemma::assembly::solveSystem (num_excitations=1, d_right_hand_side=..., solution=..., frequency_index=0, options=..., fi_pd=..., solve_type=gemma::linearAlgebra::ONE_TIME_FACTOR_SOLVE) at /ascldap/users/glhenni/Projects/Gemma/gemma/src/assembly/matrixSolve.cpp:91
#38 0x00007ffff7764144 in gemma::MoMLoop::fillAndSolveForFrequency (pd=..., frequency=..., d_right_hand_side=..., solution=..., solution_derivative=..., options=..., compute_solution_frequency_derivative=false, spca_info=..., save_factorized_matrix=false) at /ascldap/users/glhenni/Projects/Gemma/gemma/src/MoMLoop/fillAndSolve.cpp:139
#39 0x00007ffff773d994 in gemma::MoMLoop::FrequencyIterator::computeOrReadSolution (this=0x7881c0, frequency_index_range=...) at /ascldap/users/glhenni/Projects/Gemma/gemma/src/MoMLoop/FrequencyIterator.cpp:163
#40 0x00007ffff774465e in gemma::MoMLoop::FrequencyListIterator::solveForAllFrequenciesAndExcitations (this=0x7881c0) at /ascldap/users/glhenni/Projects/Gemma/gemma/src/MoMLoop/FrequencyListIterator.cpp:53
#41 0x00007ffff7932f48 in gemma::solveMomentMethodProblem (timer=..., comm=..., run_options=...) at /ascldap/users/glhenni/Projects/Gemma/gemma/src/solvers/momentMethodSolver.cpp:96
#42 0x00007ffff78efe57 in gemma::selectSolverAndRun (timer=..., comm=..., run_options=...) at /ascldap/users/glhenni/Projects/Gemma/gemma/src/solvers/SolverInterface.cpp:109
#43 0x0000000000418e75 in main (narg=4, arg=0x7ffffffed228) at /ascldap/users/glhenni/Projects/Gemma/gemma/src/gemma.cpp:156
@glhenni glhenni added the type: bug The primary issue is a bug in Trilinos code or tests label Aug 13, 2024
@ndellingwood
Copy link
Contributor

@glhenni @vqd8a the 4.4 release included thread safety fixes that exposed issues with some incorrect usages of Views that showed up in a couple places in Trilinos resulting in a deadlock/hang of tests. The most common cases were due to View creation/destruction within parallel regions, often times with View-of-View's usage where creation and/or destruction were not properly handled. Based on your report and hanging tests, I suspect something similar might be occurring?

@ndellingwood
Copy link
Contributor

@glhenni a new tool is in progress that was very helpful in finding the View usage issues in Trilinos, kokkos/kokkos-tools#267 , I suggest running the using test with this tool to see if any culprit usage is flagged

@spdomin
Copy link
Contributor

spdomin commented Aug 13, 2024

I think I am seeing this in Nalu.... However, my final bisect iteration does not actually build:

commit f8ff2ad (HEAD)
Author: Nathan Ellingwood ndellin@sandia.gov
Date: Wed Aug 7 16:39:21 2024 -0600

stk: modify test to prevent allocation in parallel region

modify NgpMeshTest.volatileFastSharedCommMap to prevent allocation in a parallel region, which can result in deadlock with kokkos version 4.4
address issue #13328

Co-authored-by: Christian Trott <crtrott@sandia.gov>
Signed-off-by: Nathan Ellingwood <ndellin@sandia.gov>

[ 45%] Building CXX object packages/kokkos/containers/src/CMakeFiles/kokkoscontainers.dir/impl/Kokkos_UnorderedMap_impl.cpp.o
In file included from /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/View/MDSpan/Kokkos_MDSpan_Extents.hpp:25,
from /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/Kokkos_View.hpp:40,
from /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/Kokkos_Parallel.hpp:31,
from /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/Kokkos_MemoryPool.hpp:26,
from /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/Kokkos_TaskScheduler.hpp:34,
from /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/Serial/Kokkos_Serial.hpp:37,
from /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/decl/Kokkos_Declare_SERIAL.hpp:21,
from /fgs/spdomin/nightly/Trilinos/build_nightly_release_10.3.0/packages/kokkos/KokkosCore_Config_DeclareBackend.hpp:22,
from /fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/Kokkos_Core.hpp:45,
from /fgs/spdomin/nightly/Trilinos/packages/kokkos/containers/src/Kokkos_UnorderedMap.hpp:30,
from /fgs/spdomin/nightly/Trilinos/packages/kokkos/containers/src/impl/Kokkos_UnorderedMap_impl.cpp:21:
/fgs/spdomin/nightly/Trilinos/packages/kokkos/core/src/View/MDSpan/Kokkos_MDSpan_Header.hpp:47:10: fatal error: mdspan/mdspan.hpp: No such file or directory
47 | #include <mdspan/mdspan.hpp>

@ndellingwood
Copy link
Contributor

@spdomin did the build failure occur with a clean build? If needed you can disable mdspan with the -D Kokkos_ENABLE_IMPL_MDSPAN=OFF option to get past the error above

@spdomin
Copy link
Contributor

spdomin commented Aug 13, 2024

This build is part of a bisect to figure out the hang I am seeing. I configure Trilinos each step. So, yes, I think this is a clean build.

@spdomin
Copy link
Contributor

spdomin commented Aug 13, 2024

I added:

-DKokkos_ENABLE_ATOMICS_BYPASS=ON \ -DKokkos_ENABLE_IMPL_MDSPAN=OFF \

Sorry, I am somewhat taking over this support ticket... I will post back if our new hang points to this commit, while using some of the advice given above.

@ndellingwood
Copy link
Contributor

@spdomin can you post your Trilinos configuration reproducer? We saw similar issues in Trilinos builds like you posted above that were resolved by kokkos/kokkos#7103 (included in the 4.4 snapshot), we'll need to reproduce and open an issue to figure out why that does not help in your configuration

@spdomin
Copy link
Contributor

spdomin commented Aug 13, 2024

I use this:
https://github.com/NaluCFD/Nalu/blob/master/build/do-configTrilinos_release

with:

  1. binutils/2.41 3) openmpi/4.1.6-gcc-10.3.0 5) anaconda3/2023.09
    2)gcc/10.3.0 4) cmake/3.27.7

The current build with the new MDSPAN=OFF is proceeding.

@spdomin
Copy link
Contributor

spdomin commented Aug 13, 2024

There you go:)

`a5eb4d4e1436e5594ce73ffe62e1cb0f460c99b0 is the first bad commit
commit a5eb4d4
Author: Nathan Ellingwood ndellin@sandia.gov
Date: Thu Aug 8 15:37:54 2024 -0600

Snapshot of kokkos.git from commit 948c1346301ff9b42b136a8c72eed91c839e3105

From repository at git@github.com:kokkos/kokkos.git

At commit:
commit 948c1346301ff9b42b136a8c72eed91c839e3105
Author: Nathan Ellingwood <ndellin@sandia.gov>
Date:   Thu Aug 8 14:54:40 2024 -0600

`

I will review the notes above. Offhand, I do not know about this view-of-views pattern in Nalu...

@ndellingwood
Copy link
Contributor

@spdomin the tool can be used more generally beyond View of Views to detect allocation/deallocation/fences within parallel regions and such (the naming was initially inspired by the first cases that showed up with this issue). If the hang is caused by something along these lines, then the tool will be helpful to list the potentially culprit View(s)

@ndellingwood
Copy link
Contributor

@spdomin so far I am not able to reproduce the compilation error you saw. I tested on solo, which had the closest match I could find to modules that you listed, and pared back some of the configuration script you pointed to - the error occurs in kokkos, so enabling netcdf and packages using it like seacas etc. was not necessary to try to reproduce. The error should occur just attempting to build the kokkos library, though I enabled kokkos tests for added coverage but no luck.

Here is what I tried on solo with sha 1eb0af7 (includes the snapshot sha listed above)

# environment
module load gnu/10.3.1 openmpi-gnu/4.1 cmake
export blas_install_lib=/usr/lib64/libblas.so.3
export lapack_install_lib=/usr/lib64/liblapack.so.3

# build dir
mkdir -p Build
cd Build

# configuration
export TRILINOS_DIR=<path-to-Trilinos>
cmake \
-DCMAKE_INSTALL_PREFIX=$PWD/install \
-DTrilinos_ENABLE_CXX11=ON \
-DCMAKE_BUILD_TYPE=RELEASE \
-DTrilinos_ENABLE_EXPLICIT_INSTANTIATION:BOOL=ON \
-DTpetra_INST_DOUBLE:BOOL=ON \
-DTpetra_INST_INT_LONG:BOOL=ON \
-DTpetra_INST_INT_LONG_LONG:BOOL=OFF \
-DTpetra_INST_COMPLEX_DOUBLE=OFF \
-DTrilinos_ENABLE_TESTS:BOOL=OFF \
-DTrilinos_ENABLE_ALL_OPTIONAL_PACKAGES=OFF \
-DTrilinos_ALLOW_NO_PACKAGES:BOOL=OFF \
-DTPL_ENABLE_MPI=ON \
-DTPL_ENABLE_SuperLU=OFF \
-DTPL_ENABLE_Boost:BOOL=OFF \
-DTrilinos_ENABLE_Epetra:BOOL=OFF \
-DTrilinos_ENABLE_Kokkos:BOOL=ON \
 -DKokkos_ENABLE_TESTS:BOOL=ON \
-DTrilinos_ENABLE_Tpetra:BOOL=ON \
-DTrilinos_ENABLE_ML:BOOL=OFF \
-DTrilinos_ENABLE_MueLu:BOOL=ON \
-DTrilinos_ENABLE_Stratimikos:BOOL=OFF \
-DTrilinos_ENABLE_Thyra:BOOL=OFF \
-DTrilinos_ENABLE_EpetraExt:BOOL=OFF \
-DTrilinos_ENABLE_AztecOO:BOOL=OFF \
-DTrilinos_ENABLE_Belos:BOOL=ON \
-DTrilinos_ENABLE_Ifpack2:BOOL=ON \
-DTrilinos_ENABLE_Amesos2:BOOL=ON \
-DTrilinos_ENABLE_Zoltan2:BOOL=ON \
-DTrilinos_ENABLE_Ifpack:BOOL=OFF \
-DTrilinos_ENABLE_Amesos:BOOL=OFF \
-DTrilinos_ENABLE_Zoltan:BOOL=ON \
-DTrilinos_ENABLE_STKMesh:BOOL=ON \
-DTrilinos_ENABLE_STKSimd:BOOL=ON \
-DTrilinos_ENABLE_STKIO:BOOL=OFF \
-DTrilinos_ENABLE_STKTransfer:BOOL=ON \
-DTrilinos_ENABLE_STKSearch:BOOL=ON \
-DTrilinos_ENABLE_STKUtil:BOOL=ON \
-DTrilinos_ENABLE_STKTopology:BOOL=ON \
-DTrilinos_ENABLE_STKBalance:BOOL=OFF \
-DTrilinos_ENABLE_STKUnit_tests:BOOL=OFF \
-DTrilinos_ENABLE_STKUnit_test_utils:BOOL=OFF \
-DTrilinos_ENABLE_Gtest:BOOL=ON \
-DKokkos_ENABLE_ATOMICS_BYPASS=ON \
-DTPL_ENABLE_Netcdf:BOOL=OFF \
-DTPL_BLAS_LIBRARIES=${blas_install_lib} \
-DTPL_LAPACK_LIBRARIES=${lapack_install_lib} \
$EXTRA_ARGS \
$TRILINOS_DIR

# build kokkos and tests
cd packages/kokkos
make -j16

Would you be able to test the configuration above in manual build on the machine where you see the issue?

@spdomin
Copy link
Contributor

spdomin commented Aug 14, 2024

@ndellingwood, Let's take the build error during my bisect finding offline, or add a new ticket so that this particular ticket can focus on apps using "views of views". It turns out, I was able to locate the offending code in the failing unit tests, @alanw0 may have more insight. It does not look like our core Nalu assembly has this issue. The first hang occurs at: rhs_ = Kokkos::View<double*>("rhs_",rhs.extent(0));. I suppose that is the views of views:) Let me know what is conceptually wrong with this and the easiest fix.

virtual void sumInto(
      unsigned numEntities,
      const stk::mesh::Entity* entities,
      const sierra::nalu::SharedMemView<const double*> & rhs,
      const sierra::nalu::SharedMemView<const double**> & lhs,
      const sierra::nalu::SharedMemView<int*> & localIds,
      const sierra::nalu::SharedMemView<int*> & sortPermutation,
      const char * trace_tag
      )
  {
    if (numSumIntoCalls_ == 0) {
      rhs_ = Kokkos::View<double*>("rhs_",rhs.extent(0));
      for(size_t i=0; i<rhs.extent(0); ++i) {
        rhs_(i) = rhs(i);
      }
      lhs_ = Kokkos::View<double**>("lhs_",lhs.extent(0), lhs.extent(1));
      for(size_t i=0; i<lhs.extent(0); ++i) {
        for(size_t j=0; j<lhs.extent(1); ++j) {
          lhs_(i,j) = lhs(i,j);
        }
      }
    }
    Kokkos::atomic_add(&numSumIntoCalls_, 1u);
  }

@alanw0
Copy link
Contributor

alanw0 commented Aug 14, 2024

Hmm, that's not a view-of-views, but it is a view allocation which is probably happening within a Kokkos::parallel_for. It turns out that is not legal even in Kokkos::Serial. I can probably help fix this.

@spdomin
Copy link
Contributor

spdomin commented Aug 14, 2024

@spdomin
Copy link
Contributor

spdomin commented Aug 14, 2024

Hmm, that's not a view-of-views, but it is a view allocation which is probably happening within a Kokkos::parallel_for. It turns out that is not legal even in Kokkos::Serial. I can probably help fix this.

Are you sure about this not being a view of a view? rhs_ is a Kokkos::View<double*>. Why do we not simply use this view itself?

@glhenni
Copy link
Contributor Author

glhenni commented Aug 14, 2024

I did manage to build and use the vov debugger library. But it's throwing an error at a location prior to the one causing the hang. I'm assuming that will have to be fixed as well. I'm behind the curve on this one because I'm not a kokkos programmer. I'm acting as the intermediary, since the person with actual knowledge of kokkos and gemma aren't on github. Anyway, with KOKKOS_TOOLS_LIBS=<root dir>/libkp_view_of_views_bug_finder.so set this is what I see:

Total number of MPI threads: 3
Total number of Tpetra processes: 1
Tpetra in Trilinos 16.1.0-dev

Gemma: Version 2023.0.0
Parsing command line inputs.
Finished parsing command line inputs.
Kokkos execution space N6Kokkos6OpenMPE

Moment method selected
Reading input file ...
Number of Unknowns initialized: 36
Constructing and solving the matrix equation ...
dbg( lvl: 2 ): Allocated system matrix
  Constructing matrix and right-hand side for 3.00000e+07 Hz

deallocating "[unlabeled]" within parallel region "interaction_block_fill"
[cee-build032:1412318] *** Process received signal ***
[cee-build032:1412318] Signal: Aborted (6)
[cee-build032:1412318] Signal code:  (-6)
[cee-build032:1412318] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x7f992ae29cf0]
[cee-build032:1412318] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7f992aaa0acf]
[cee-build032:1412318] [ 2] /lib64/libc.so.6(abort+0x127)[0x7f992aa73ea5]
[cee-build032:1412318] [ 3] /ascldap/users/glhenni/Projects/Kokkos/kokkos-tools.dalg24/build/debugging/vov-bug-finder/libkp_view_of_views_bug_finder.so(+0xb7ab)[0x7f99280d27ab]
[cee-build032:1412318] [ 4] /ascldap/users/glhenni/Projects/Kokkos/kokkos-tools.dalg24/build/debugging/vov-bug-finder/libkp_view_of_views_bug_finder.so(kokkosp_deallocate_data+0x6c)[0x7f99280d294b]
[cee-build032:1412318] [ 5] /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/lib64/libkokkoscore.so.16(void Kokkos::Tools::Experimental::invoke_kokkosp_callback<void (*)(Kokkos_Profiling_SpaceHandle, char const*, void const*, unsigned long), Kokkos_Profiling_SpaceHandle const&, char const*, void const*&, unsigned long const&>(Kokkos::Tools::Experimental::MayRequireGlobalFencing, void (* const&)(Kokkos_Profiling_SpaceHandle, char const*, void const*, unsigned long), Kokkos_Profiling_SpaceHandle const&, char const*&&, void const*&, unsigned long const&)+0x132)[0x7f992bc78e83]
[cee-build032:1412318] [ 6] /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/lib64/libkokkoscore.so.16(Kokkos::Tools::deallocateData(Kokkos_Profiling_SpaceHandle, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, void const*, unsigned long)+0x51)[0x7f992bc7673e]
[cee-build032:1412318] [ 7] /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/lib64/libkokkoscore.so.16(Kokkos::HostSpace::impl_deallocate(char const*, void*, unsigned long, unsigned long, Kokkos_Profiling_SpaceHandle) const+0xc4)[0x7f992bc70778]
[cee-build032:1412318] [ 8] /scratch/glhenni/gemma/install/trilinos/aue.gnu.opt/lib64/libkokkoscore.so.16(Kokkos::Impl::OpenMPInternal::resize_thread_data(unsigned long, unsigned long, unsigned long, unsigned long)+0x2af)[0x7f992bc7f831]
[cee-build032:1412318] [ 9] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(Kokkos::Impl::ParallelFor<gemma::assembly::LBUComputeMatrixFunctor<gemma::assembly::LBUTestIntegrand<gemma::assembly::LBUTestArgs<gemma::assembly::LBUTopoDef<2>, gemma::assembly::LBUTopoDef<2>, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::LBUSourceIntegrand<gemma::assembly::LBUSourceIntegrandArgs<gemma::assembly::LBUTopoDef<2>, gemma::assembly::LBUTopoDef<2>, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> > >, (gemma::codeStructures::DefinedTopology)0> >, true>, gemma::assembly::SystemMatrixIncrementer<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >, false>, Kokkos::TeamPolicy<Kokkos::OpenMP>, Kokkos::OpenMP>::execute() const+0x88)[0x7f993f698ad4]
[cee-build032:1412318] [10] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(void Kokkos::parallel_for<Kokkos::TeamPolicy<Kokkos::OpenMP>, gemma::assembly::LBUComputeMatrixFunctor<gemma::assembly::LBUTestIntegrand<gemma::assembly::LBUTestArgs<gemma::assembly::LBUTopoDef<2>, gemma::assembly::LBUTopoDef<2>, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::LBUSourceIntegrand<gemma::assembly::LBUSourceIntegrandArgs<gemma::assembly::LBUTopoDef<2>, gemma::assembly::LBUTopoDef<2>, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> > >, (gemma::codeStructures::DefinedTopology)0> >, true>, gemma::assembly::SystemMatrixIncrementer<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >, false>, void>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, Kokkos::TeamPolicy<Kokkos::OpenMP> const&, gemma::assembly::LBUComputeMatrixFunctor<gemma::assembly::LBUTestIntegrand<gemma::assembly::LBUTestArgs<gemma::assembly::LBUTopoDef<2>, gemma::assembly::LBUTopoDef<2>, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::LBUSourceIntegrand<gemma::assembly::LBUSourceIntegrandArgs<gemma::assembly::LBUTopoDef<2>, gemma::assembly::LBUTopoDef<2>, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> > >, (gemma::codeStructures::DefinedTopology)0> >, true>, gemma::assembly::SystemMatrixIncrementer<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >, false> const&)+0x89)[0x7f993f69881c]
[cee-build032:1412318] [11] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(void gemma::assembly::fillMatrixInteractionBlock<gemma::assembly::LBUTestIntegrand<gemma::assembly::LBUTestArgs<gemma::assembly::LBUTopoDef<2>, gemma::assembly::LBUTopoDef<2>, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::LBUSourceIntegrand<gemma::assembly::LBUSourceIntegrandArgs<gemma::assembly::LBUTopoDef<2>, gemma::assembly::LBUTopoDef<2>, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> >, gemma::assembly::ScratchElementVertexViewSingleElement<gemma::codeStructures::ElementTraits<(gemma::codeStructures::DefinedTopology)0, 1> > >, (gemma::codeStructures::DefinedTopology)0> >, true>, gemma::assembly::SystemMatrixIncrementer<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> > >(gemma::FrequencyIndependentProblemData const&, Kokkos::View<gemma::misc::PropertyConstants*, Kokkos::HostSpace> const&, gemma::assembly::MatrixFillComputation const&, Kokkos::pair<long long, long long>, Kokkos::pair<long long, long long>, bool, gemma::assembly::SystemMatrixIncrementer<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&, Kokkos::OpenMP)+0x252)[0x7f993f698076]
[cee-build032:1412318] [12] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(auto gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}::operator()<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >(Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace>&) const+0xd6)[0x7f993f697b4a]
[cee-build032:1412318] [13] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(void std::__invoke_impl<void, gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace>&>(std::__invoke_other, gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}&&, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace>&)+0x37)[0x7f993f698c25]
[cee-build032:1412318] [14] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(std::__invoke_result<gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace>&>::type std::__invoke<gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace>&>(gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}&&, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace>&)+0x37)[0x7f993f6988de]
[cee-build032:1412318] [15] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<std::__detail::__variant::__deduce_visit_result<void> (*)(gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}&&, std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&)>, std::integer_sequence<unsigned long, 0ul> >::__visit_invoke(gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}&&, std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&)+0x3f)[0x7f993f698240]
[cee-build032:1412318] [16] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(decltype(auto) std::__do_visit<std::__detail::__variant::__deduce_visit_result<void>, gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}, std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&>(gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}&&, std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&)+0x74)[0x7f993f6982bb]
[cee-build032:1412318] [17] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(std::invoke_result<gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}, std::__conditional<is_lvalue_reference_v<std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&> >::type<std::variant_alternative<0ul, std::remove_reference<decltype (__as((declval<std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&>)()))>::type>::type&, std::variant_alternative<0ul, std::remove_reference<decltype (__as((declval<std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&>)()))>::type>::type&&> >::type std::visit<gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}, std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&>(gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)::{lambda(auto:1&)#1}&&, std::variant<Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutLeft, Kokkos::HostSpace> >&)+0x59)[0x7f993f69831c]
[cee-build032:1412318] [18] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(void gemma::assembly::fillMatrixSharedMemByUnknowns<2, 2>(gemma::FrequencyIndependentProblemData const&, double const&, gemma::codeStructures::RunOptions const&)+0x2c7)[0x7f993f69790e]
[cee-build032:1412318] [19] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(void gemma::assembly::computeSourceInteractions<2, (gemma::assembly::FILL_TOPO_TYPE)0, (gemma::assembly::FILL_TOPO_TYPE)1, (gemma::assembly::FILL_TOPO_TYPE)2, (gemma::assembly::FILL_TOPO_TYPE)3>(gemma::FrequencyIndependentProblemData const&, double, gemma::codeStructures::RunOptions const&, std::integer_sequence<gemma::assembly::FILL_TOPO_TYPE, (gemma::assembly::FILL_TOPO_TYPE)0, (gemma::assembly::FILL_TOPO_TYPE)1, (gemma::assembly::FILL_TOPO_TYPE)2, (gemma::assembly::FILL_TOPO_TYPE)3>)+0x6a)[0x7f993f7136c2]
[cee-build032:1412318] [20] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(void gemma::assembly::computeMatrixInteractions<(gemma::assembly::FILL_TOPO_TYPE)0, (gemma::assembly::FILL_TOPO_TYPE)1, (gemma::assembly::FILL_TOPO_TYPE)2, (gemma::assembly::FILL_TOPO_TYPE)3, (gemma::assembly::FILL_TOPO_TYPE)4>(gemma::FrequencyIndependentProblemData const&, double, gemma::codeStructures::RunOptions const&, std::integer_sequence<gemma::assembly::FILL_TOPO_TYPE, (gemma::assembly::FILL_TOPO_TYPE)0, (gemma::assembly::FILL_TOPO_TYPE)1, (gemma::assembly::FILL_TOPO_TYPE)2, (gemma::assembly::FILL_TOPO_TYPE)3, (gemma::assembly::FILL_TOPO_TYPE)4>)+0x69)[0x7f993f71339d]
[cee-build032:1412318] [21] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(gemma::assembly::fillSystemMatrixAndRightHandSide(gemma::FrequencyIndependentProblemData const&, double const&, Kokkos::View<gemma::source::FieldExcitation const*, Kokkos::HostSpace> const&, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutRight, Kokkos::HostSpace> const&, gemma::codeStructures::RunOptions const&)+0x3b)[0x7f993f70f90a]
[cee-build032:1412318] [22] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(gemma::MoMLoop::fillAndSolveForFrequency(gemma::ProblemData const&, gemma::source::Frequency const&, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutRight, Kokkos::HostSpace> const&, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutRight, Kokkos::HostSpace> const&, Kokkos::View<Kokkos::complex<double>**, Kokkos::LayoutRight, Kokkos::HostSpace> const&, gemma::codeStructures::RunOptions const&, bool, std::optional<gemma::linearAlgebra::SPCASolverInfo>&, bool)+0x3a4)[0x7f993fa17dfc]
[cee-build032:1412318] [23] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(gemma::MoMLoop::FrequencyIterator::computeOrReadSolution(Kokkos::pair<int, int> const&)+0x9dc)[0x7f993f9f1994]
[cee-build032:1412318] [24] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(gemma::MoMLoop::FrequencyListIterator::solveForAllFrequenciesAndExcitations()+0x100)[0x7f993f9f865e]
[cee-build032:1412318] [25] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(gemma::solveMomentMethodProblem(Teuchos::RCP<Teuchos::StackedTimer>&, Teuchos::RCP<Teuchos::Comm<int> const>, gemma::codeStructures::RunOptions)+0x511)[0x7f993fbe6f48]
[cee-build032:1412318] [26] /scratch/glhenni/gemma/build/gemma.gnu.opt/src/libgemmalibrary.so(gemma::selectSolverAndRun(Teuchos::RCP<Teuchos::StackedTimer>&, Teuchos::RCP<Teuchos::Comm<int> const>, gemma::codeStructures::RunOptions)+0x35a)[0x7f993fba3e57]
[cee-build032:1412318] [27] /scratch/glhenni/gemma/build/gemma.gnu.opt/Debug/gemma[0x418e75]
[cee-build032:1412318] [28] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f992aa8cd85]
[cee-build032:1412318] [29] /scratch/glhenni/gemma/build/gemma.gnu.opt/Debug/gemma[0x4188ce]
[cee-build032:1412318] *** End of error message ***

@ndellingwood
Copy link
Contributor

@spdomin echoing @alanw0 , the culprit may be a View construction called within a parallel_for (not a view-of-views) triggering an allocation in a parallel region which can deadlock. If the function in the code snip above is called within a parallel_*, that could be the issue.
The code snip above shows assignment of a newly constructed View to an existing View (a View of Views would look something like Kokkos::View< Kokkos::View<T*> > v_of_v("v_of_v", N);)

It looks like rhs_ and lhs_ must have already have been constructed, is it possible when each is initially allocated to do so with large enough anticipated size that you can then create subviews to assign to rhs and lhs (of sizes e.g. rhs.extent(0) and lhs.extent(0), lhs.extent(1) resp.), rather than assigning a newly constructed View?

@spdomin
Copy link
Contributor

spdomin commented Aug 14, 2024

@ndellingwood, @alanw0 and I will look into the fix... I think we were being lax within the unit test matrix assembly procedure and should be able to resolve this quickly. Thank you for the v_of_v example - it helped my understanding.

Again, apologies for doubling up on this ticket with the Nalu-specific issue. Best of luck with GEMMA fix. I will certainly keep track to learn more about how others are using Kokkos in apps.

@ndellingwood
Copy link
Contributor

@spdomin let me know how it goes, either on ticket or offline. In case useful, another thought came to mind was if you can decouple the sumInto routine to separate out the rhs_ and lhs_ View allocation steps into a separate routine called prior to sumInto , for example pseudo-code:

void sizeCheck(const sierra::nalu::SharedMemView<const double*> & rhs,
      const sierra::nalu::SharedMemView<const double**> & lhs) {

      if (rhs_.extent(0) != rhs.extent(0))
        rhs_ = Kokkos::View<double*>("rhs_",rhs.extent(0));
    // similar for lhs
}

call sizeCheck from the host prior to the call of sumInto

@ndellingwood
Copy link
Contributor

@glhenni excellent, thanks for posting the output, this line:

deallocating "[unlabeled]" within parallel region "interaction_block_fill"

points to the parallel_* call where a deallocation of a View is attempted, though the View isn't labeled so it will take a bit of checking. I'll contact you offline to see how best I can try to help more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

4 participants