[GraphBolt][CUDA] hetero_rgcn example crashes #7296

mfbalin · 2024-04-10T08:36:44Z

🔨Work Item

IMPORTANT:

This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates.
DO NOT create a new work item if the purpose is to fix an existing issue or feature request. We will directly use the issue in the project tracker.

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

Crash probably due to a bug in #7239, investigating. Crash happens with CUDA 12.3 and CUDA 12.4 so far. CUDA 12.3 is in the NVIDIA torch container.

mfbalin@BALIN-PC:~/dgl-1/examples/sampling/graphbolt/lightning$ CUDA_LAUNCH_BLOCKING=1 compute-sanitizer --tool memcheck python ../rgcn/hetero_rgcn.py
========= COMPUTE-SANITIZER
The dataset is already preprocessed.
Loaded dataset: node_classification
node_num for rel_graph_embed: {'author': tensor(1134649, device='cuda:0', dtype=torch.int32), 'field_of_study': tensor(59965, device='cuda:0', dtype=torch.int32), 'institution': tensor(8740, device='cuda:0', dtype=torch.int32)}
Number of embedding parameters: 154029312
Number of model parameters: 337460
Start to train...
Training~Epoch 01: 375it [04:27,  1.36it/s]========= Invalid __shared__ read of size 4 bytes
=========     at void cusparse::csr2csc_rows_expansion_kernel<(int)128, (int)8, int>(const T3 *, int, T3, const T3 *, T3 *)+0x20e0
=========     by thread (99,0,0) in block (78,0,0)
=========     Address 0x1400 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x252ff7]
=========                in /usr/lib/wsl/drivers/nv_dispi.inf_amd64_268e85175aa9e991/libcuda.so.1.1
=========     Host Frame: [0x93c3fa]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12
=========     Host Frame: [0x99859a]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12
=========     Host Frame: [0x79b7db]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12
=========     Host Frame: [0x79965f]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12
=========     Host Frame: [0x799bec]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12
=========     Host Frame:cusparseCsr2cscEx2 [0xf0b22]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12
=========     Host Frame:dgl::aten::CSRMatrix dgl::aten::impl::CSRTranspose<(DGLDeviceType)2, int>(dgl::aten::CSRMatrix) [0x9c93a0]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:dgl::aten::CSRTranspose(dgl::aten::CSRMatrix) [0x32a0b5]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:dgl::UnitGraph::GetInCSR(bool) const [0x9937d2]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:dgl::UnitGraph::GetCSCMatrix(unsigned long) const [0x993d79]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:dgl::HeteroGraph::GetCSCMatrix(unsigned long) const [0x8a9b46]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:dgl::aten::SpMM(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<dgl::BaseHeteroGraph>, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> >) [0x7ea870]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:dgl::aten::__mk_DGL0::{lambda(dgl::runtime::DGLArgs, dgl::aten::__mk_DGL0::DGLRetValue*)#1}::operator()(dgl::runtime, dgl::aten::__mk_DGL0::DGLRetValue) const [clone .constprop.0] [0x80c3ee]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:std::_Function_handler<void (dgl::runtime::DGLArgs, dgl::runtime::DGLRetValue*), dgl::aten::__mk_DGL0::{lambda(dgl::runtime::DGLArgs, dgl::runtime::DGLRetValue*)#1}>::_M_invoke(std::_Any_data const&, dgl::runtime::DGLArgs&&, dgl::runtime::DGLRetValue*&&) [0x80ca7d]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:DGLFuncCall [0x84a168]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:__pyx_f_3dgl_4_ffi_4_cy3_4core_FuncCall(void*, _object*, DGLValue*, int*) in dgl/_ffi/_cython/core.cpp:6805 [0x18ef7]
=========                in /home/mfbalin/dgl-1/python/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so
=========     Host Frame:__pyx_pw_3dgl_4_ffi_4_cy3_4core_12FunctionBase_5__call__(_object*, _object*, _object*) in dgl/_ffi/_cython/core.cpp:7629 [0x197cf]
=========                in /home/mfbalin/dgl-1/python/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so
=========     Host Frame:_PyObject_MakeTpCall [0x150a7a]
=========                in /usr/bin/python
=========     Host Frame:_PyEval_EvalFrameDefault [0x149095]
=========                in /usr/bin/python
=========     Host Frame:_PyFunction_Vectorcall [0x15a9fb]
=========                in /usr/bin/python
=========     Host Frame:_PyEval_EvalFrameDefault [0x14326c]
=========                in /usr/bin/python
=========     Host Frame:_PyFunction_Vectorcall [0x15a9fb]
=========                in /usr/bin/python
=========     Host Frame:THPFunction_apply(_object*, _object*) [0x7e7d30]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so
=========     Host Frame: [0x15a137]
=========                in /usr/bin/python
=========     Host Frame:PyObject_Call [0x16942a]
=========                in /usr/bin/python
=========     Host Frame:_PyEval_EvalFrameDefault [0x1455d6]
=========                in /usr/bin/python
=========     Host Frame: [0x16893d]
=========                in /usr/bin/python
=========     Host Frame:_PyEval_EvalFrameDefault [0x1455d6]
=========                in /usr/bin/python
=========     Host Frame:_PyFunction_Vectorcall [0x15a9fb]
=========                in /usr/bin/python
=========     Host Frame:_PyEval_EvalFrameDefault [0x14326c]
=========                in /usr/bin/python
=========     Host Frame:_PyFunction_Vectorcall [0x15a9fb]
=========                in /usr/bin/python
=========     Host Frame:_PyEval_EvalFrameDefault [0x1455d6]
=========                in /usr/bin/python
=========     Host Frame: [0x16893d]
=========                in /usr/bin/python
=========     Host Frame:torch::autograd::PyNode::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) [0x7e0002]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so
=========     Host Frame:torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) [0x4ea4d3a]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) [0x4e9e815]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) [0x4e9f467]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) [0x4e96b75]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) [0x7dc04b]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so
=========     Host Frame: [0xe62b2]
=========                in /lib/x86_64-linux-gnu/libstdc++.so.6
=========     Host Frame:start_thread in ./nptl/pthread_create.c:442 [0x94ac2]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame: [0x12684f]
=========                in /lib/x86_64-linux-gnu/libc.so.6
========= 
========= Program hit cudaErrorUnknown (error 999) due to "unknown error" on CUDA API call to cudaLaunchKernel.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x34d0a2]
=========                in /usr/lib/wsl/drivers/nv_dispi.inf_amd64_268e85175aa9e991/libcuda.so.1.1
=========     Host Frame: [0x9985da]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12
=========     Host Frame: [0x79b7db]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12
=========     Host Frame: [0x79965f]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12
=========     Host Frame: [0x799bec]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12
=========     Host Frame:cusparseCsr2cscEx2 [0xf0b22]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12
=========     Host Frame:dgl::aten::CSRMatrix dgl::aten::impl::CSRTranspose<(DGLDeviceType)2, int>(dgl::aten::CSRMatrix) [0x9c93a0]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:dgl::aten::CSRTranspose(dgl::aten::CSRMatrix) [0x32a0b5]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:dgl::UnitGraph::GetInCSR(bool) const [0x9937d2]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:dgl::UnitGraph::GetCSCMatrix(unsigned long) const [0x993d79]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:dgl::HeteroGraph::GetCSCMatrix(unsigned long) const [0x8a9b46]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:dgl::aten::SpMM(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<dgl::BaseHeteroGraph>, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> >) [0x7ea870]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:dgl::aten::__mk_DGL0::{lambda(dgl::runtime::DGLArgs, dgl::aten::__mk_DGL0::DGLRetValue*)#1}::operator()(dgl::runtime, dgl::aten::__mk_DGL0::DGLRetValue) const [clone .constprop.0] [0x80c3ee]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:std::_Function_handler<void (dgl::runtime::DGLArgs, dgl::runtime::DGLRetValue*), dgl::aten::__mk_DGL0::{lambda(dgl::runtime::DGLArgs, dgl::runtime::DGLRetValue*)#1}>::_M_invoke(std::_Any_data const&, dgl::runtime::DGLArgs&&, dgl::runtime::DGLRetValue*&&) [0x80ca7d]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:DGLFuncCall [0x84a168]
=========                in /home/mfbalin/dgl-1/build/libdgl.so
=========     Host Frame:__pyx_f_3dgl_4_ffi_4_cy3_4core_FuncCall(void*, _object*, DGLValue*, int*) in dgl/_ffi/_cython/core.cpp:6805 [0x18ef7]
=========                in /home/mfbalin/dgl-1/python/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so
=========     Host Frame:__pyx_pw_3dgl_4_ffi_4_cy3_4core_12FunctionBase_5__call__(_object*, _object*, _object*) in dgl/_ffi/_cython/core.cpp:7629 [0x197cf]
=========                in /home/mfbalin/dgl-1/python/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so
=========     Host Frame:_PyObject_MakeTpCall [0x150a7a]
=========                in /usr/bin/python
=========     Host Frame:_PyEval_EvalFrameDefault [0x149095]
=========                in /usr/bin/python
=========     Host Frame:_PyFunction_Vectorcall [0x15a9fb]
=========                in /usr/bin/python
=========     Host Frame:_PyEval_EvalFrameDefault [0x14326c]
=========                in /usr/bin/python
=========     Host Frame:_PyFunction_Vectorcall [0x15a9fb]
=========                in /usr/bin/python
=========     Host Frame:THPFunction_apply(_object*, _object*) [0x7e7d30]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so
=========     Host Frame: [0x15a137]
=========                in /usr/bin/python
=========     Host Frame:PyObject_Call [0x16942a]
=========                in /usr/bin/python
=========     Host Frame:_PyEval_EvalFrameDefault [0x1455d6]
=========                in /usr/bin/python
=========     Host Frame: [0x16893d]
=========                in /usr/bin/python
=========     Host Frame:_PyEval_EvalFrameDefault [0x1455d6]
=========                in /usr/bin/python
=========     Host Frame:_PyFunction_Vectorcall [0x15a9fb]
=========                in /usr/bin/python
=========     Host Frame:_PyEval_EvalFrameDefault [0x14326c]
=========                in /usr/bin/python
=========     Host Frame:_PyFunction_Vectorcall [0x15a9fb]
=========                in /usr/bin/python
=========     Host Frame:_PyEval_EvalFrameDefault [0x1455d6]
=========                in /usr/bin/python
=========     Host Frame: [0x16893d]
=========                in /usr/bin/python
=========     Host Frame:torch::autograd::PyNode::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) [0x7e0002]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so
=========     Host Frame:torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) [0x4ea4d3a]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) [0x4e9e815]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) [0x4e9f467]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) [0x4e96b75]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
=========     Host Frame:torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) [0x7dc04b]
=========                in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so
=========     Host Frame: [0xe62b2]
=========                in /lib/x86_64-linux-gnu/libstdc++.so.6
=========     Host Frame:start_thread in ./nptl/pthread_create.c:442 [0x94ac2]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame: [0x12684f]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========

Depending work items or issues

The text was updated successfully, but these errors were encountered:

mfbalin · 2024-04-10T10:48:37Z

I suspect cusparse might be buggy as they changed the relevant code in CUDA 12.2 as #7295 fixes the issue.

mfbalin · 2024-04-10T10:49:02Z

And #7297 does not catch any error in any of the asserts.

mfbalin · 2024-04-10T18:20:16Z

CUDA 11.8 does not crash.

mfbalin · 2024-04-10T18:22:04Z

CUDA 12.0 had the following update:
CUDA 12.0 Improved cusparseCsr2cscEx2() performance.

They probably have introduced a bug in CUSparse in CUDA 12.

mfbalin added the Work Item Work items tracked in project tracker label Apr 10, 2024

mfbalin self-assigned this Apr 10, 2024

mfbalin linked a pull request Apr 10, 2024 that will close this issue

[CUDA][Bug] CSR transpose bug in CUDA 12 #7295

Merged

8 tasks

mfbalin closed this as completed in #7295 Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GraphBolt][CUDA] hetero_rgcn example crashes #7296

[GraphBolt][CUDA] hetero_rgcn example crashes #7296

mfbalin commented Apr 10, 2024 •

edited

Loading

mfbalin commented Apr 10, 2024

mfbalin commented Apr 10, 2024

mfbalin commented Apr 10, 2024

mfbalin commented Apr 10, 2024

[GraphBolt][CUDA] hetero_rgcn example crashes #7296

[GraphBolt][CUDA] hetero_rgcn example crashes #7296

Comments

mfbalin commented Apr 10, 2024 • edited Loading

🔨Work Item

Description

Depending work items or issues

mfbalin commented Apr 10, 2024

mfbalin commented Apr 10, 2024

mfbalin commented Apr 10, 2024

mfbalin commented Apr 10, 2024

mfbalin commented Apr 10, 2024 •

edited

Loading