Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kokkos/Phalanx: Allocation error in 3D on Summit GPUs (OLCF) #13364

Open
kincaidkc opened this issue Aug 19, 2024 · 2 comments
Open

Kokkos/Phalanx: Allocation error in 3D on Summit GPUs (OLCF) #13364

kincaidkc opened this issue Aug 19, 2024 · 2 comments

Comments

@kincaidkc
Copy link

kincaidkc commented Aug 19, 2024

Question

Hello,

I am part of the dev team for a Trilinos-based CFD application. Recently we have begun testing on GPUs on Summit at OLCF. We are able to run 2D cases with tens of millions of elements without issue. However, when moving to 3D, we can only run cases with a few thousand elements before running into an "allocation failed" error message (full message and back trace below) during the linear solve. The issue seems to be related to the boundary conditions, as we are able to run large cases in 3D as long as all boundaries are periodic. Adding even a single set of non-periodic boundaries in three dimensions results in this error:

:0: : block: [163,0,0], thread: [0,123,0] Assertion `Allocation failed.` failed.
:0: : block: [9,0,0], thread: [0,22,0] Assertion `Allocation failed.` failed.
:0: : block: [124,0,0], thread: [0,27,0] Assertion `Allocation failed.` failed.
:0: : block: [213,0,0], thread: [0,91,0] Assertion `Allocation failed.` failed.
cudaEventRecord(CudaInternal::constantMemReusable, cudaStream_t(cuda_instance->m_stream)) error( cudaErrorAssert): device-side assert triggered Trilinos-dir/include/Cuda/Kokkos_Cuda_KernelLaunch.hpp:592
Backtrace:

[0x20007fb98114] Kokkos::Impl::traceback_callstack(std::ostream&)
[0x20007fb98188] Kokkos::Impl::host_abort(char const*)
[0x20007fbafc10] Kokkos::Impl::cuda_internal_error_abort(cudaError, char const*, char const*, int)
[0x10950c44] 
[0x10961c4c] 
[0x1096cb14] 
[0x200049a86f48] PHX::DagManager<panzer::Traits>::evaluateFields(panzer::Workset const&)
[0x200049a870fc] PHX::EvaluationContainer<panzer::Traits::Jacobian, panzer::Traits>::evaluateFields(panzer::Workset const&)
[0x2000491e15c0] panzer::AssemblyEngine<panzer::Traits::Jacobian>::evaluateBCs(panzer::BCType, panzer::AssemblyEngineInArgs const&, Teuchos::RCP<panzer::LinearObjContainer>)
[0x2000491e2124] panzer::AssemblyEngine<panzer::Traits::Jacobian>::evaluateDirichletBCs(panzer::AssemblyEngineInArgs const&)
[0x2000491e2f38] panzer::AssemblyEngine<panzer::Traits::Jacobian>::evaluate(panzer::AssemblyEngineInArgs const&, panzer::AssemblyEngine<panzer::Traits::Jacobian>::EvaluationFlags)
[0x200049ed654c] panzer::ModelEvaluator<double>::evalModelImpl_basic(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const
[0x200049e8cea0] panzer::ModelEvaluator<double>::evalModelImpl(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const
[0x200045184600] Thyra::ModelEvaluatorDefaultBase<double>::evalModel(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const
[0x200053fd174c] Tempus::WrapperModelEvaluatorBasic<double>::evalModelImpl(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> const&) const
[0x200045184600] [0x20007fba5c94] Kokkos::Impl::save_stacktrace()Thyra::ModelEvaluatorDefaultBase<double>::evalModel(Thyra::ModelEvaluatorBase::InArgs<double> const&, Thyra::ModelEvaluatorBase::OutArgs<double> c
onst&) const
[0x20005b90e620] NOX::Thyra::Group::computeJacobian()
[0x20005b836ac8] NOX::Direction::Newton::compute(NOX::Abstract::Vector&, NOX::Abstract::Group&, NOX::Solver::Generic const&)
[0x20005b82fde8] NOX::Direction::Generic::compute(NOX::Abstract::Vector&, NOX::Abstract::Group&, NOX::Solver::LineSearchBased const&)
[0x20005b836488] NOX::Direction::Newton::compute(NOX::Abstract::Vector&, NOX::Abstract::Group&, NOX::Solver::LineSearchBased const&)
[0x20005b85d208] NOX::Solver::LineSearchBased::step()
[0x20005b8600f4] NOX::Solver::LineSearchBased::solve()

We do not see this issue when using the version of the code compiled on CPUs. I am not exactly sure where to begin debugging this issue, so any help would be appreciated. I am happy to provide whatever other information is needed to help diagnose the problem.

Thanks,
Kellis

@kincaidkc kincaidkc changed the title Kokkos/Phalanx: Allocation error in 3D on Summit (OLCF) Kokkos/Phalanx: Allocation error in 3D on Summit GPUs (OLCF) Aug 19, 2024
@cgcgcg
Copy link
Contributor

cgcgcg commented Aug 19, 2024

@rppawlo

@rppawlo
Copy link
Contributor

rppawlo commented Aug 19, 2024

@kincaidkc - that looks like an allocation failure on device. The only time we use allocations on device are for DFad objects when evaluating the Jacobian. It could be a bug in an evaluator, or it could be that you are running out of memory due to AD requirements. Unfortunately, I don't see any information about what evaluator is failing in the stack trace. I would probably start by using the kokkos-tools to look at the highwater memory mark on the gpus. Using more nodes to reduce the per-node memory requirement could be a quick way to check that as well. To figure out what evaluator the code is failing in, you could export the flag TEUCHOS_ENABLE_VERBOSE_TIMERS=1. This will dump every timer during runtime. This is a ton of data and the output will have to separated for each mpi process. This should print the last evaluator called.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants