We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IMPORTANT:
Project tracker: https://github.com/orgs/dmlc/projects/2
Crash probably due to a bug in #7239, investigating. Crash happens with CUDA 12.3 and CUDA 12.4 so far. CUDA 12.3 is in the NVIDIA torch container.
mfbalin@BALIN-PC:~/dgl-1/examples/sampling/graphbolt/lightning$ CUDA_LAUNCH_BLOCKING=1 compute-sanitizer --tool memcheck python ../rgcn/hetero_rgcn.py ========= COMPUTE-SANITIZER The dataset is already preprocessed. Loaded dataset: node_classification node_num for rel_graph_embed: {'author': tensor(1134649, device='cuda:0', dtype=torch.int32), 'field_of_study': tensor(59965, device='cuda:0', dtype=torch.int32), 'institution': tensor(8740, device='cuda:0', dtype=torch.int32)} Number of embedding parameters: 154029312 Number of model parameters: 337460 Start to train... Training~Epoch 01: 375it [04:27, 1.36it/s]========= Invalid __shared__ read of size 4 bytes ========= at void cusparse::csr2csc_rows_expansion_kernel<(int)128, (int)8, int>(const T3 *, int, T3, const T3 *, T3 *)+0x20e0 ========= by thread (99,0,0) in block (78,0,0) ========= Address 0x1400 is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: [0x252ff7] ========= in /usr/lib/wsl/drivers/nv_dispi.inf_amd64_268e85175aa9e991/libcuda.so.1.1 ========= Host Frame: [0x93c3fa] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12 ========= Host Frame: [0x99859a] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12 ========= Host Frame: [0x79b7db] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12 ========= Host Frame: [0x79965f] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12 ========= Host Frame: [0x799bec] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12 ========= Host Frame:cusparseCsr2cscEx2 [0xf0b22] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12 ========= Host Frame:dgl::aten::CSRMatrix dgl::aten::impl::CSRTranspose<(DGLDeviceType)2, int>(dgl::aten::CSRMatrix) [0x9c93a0] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:dgl::aten::CSRTranspose(dgl::aten::CSRMatrix) [0x32a0b5] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:dgl::UnitGraph::GetInCSR(bool) const [0x9937d2] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:dgl::UnitGraph::GetCSCMatrix(unsigned long) const [0x993d79] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:dgl::HeteroGraph::GetCSCMatrix(unsigned long) const [0x8a9b46] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:dgl::aten::SpMM(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<dgl::BaseHeteroGraph>, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> >) [0x7ea870] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:dgl::aten::__mk_DGL0::{lambda(dgl::runtime::DGLArgs, dgl::aten::__mk_DGL0::DGLRetValue*)#1}::operator()(dgl::runtime, dgl::aten::__mk_DGL0::DGLRetValue) const [clone .constprop.0] [0x80c3ee] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:std::_Function_handler<void (dgl::runtime::DGLArgs, dgl::runtime::DGLRetValue*), dgl::aten::__mk_DGL0::{lambda(dgl::runtime::DGLArgs, dgl::runtime::DGLRetValue*)#1}>::_M_invoke(std::_Any_data const&, dgl::runtime::DGLArgs&&, dgl::runtime::DGLRetValue*&&) [0x80ca7d] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:DGLFuncCall [0x84a168] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:__pyx_f_3dgl_4_ffi_4_cy3_4core_FuncCall(void*, _object*, DGLValue*, int*) in dgl/_ffi/_cython/core.cpp:6805 [0x18ef7] ========= in /home/mfbalin/dgl-1/python/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so ========= Host Frame:__pyx_pw_3dgl_4_ffi_4_cy3_4core_12FunctionBase_5__call__(_object*, _object*, _object*) in dgl/_ffi/_cython/core.cpp:7629 [0x197cf] ========= in /home/mfbalin/dgl-1/python/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so ========= Host Frame:_PyObject_MakeTpCall [0x150a7a] ========= in /usr/bin/python ========= Host Frame:_PyEval_EvalFrameDefault [0x149095] ========= in /usr/bin/python ========= Host Frame:_PyFunction_Vectorcall [0x15a9fb] ========= in /usr/bin/python ========= Host Frame:_PyEval_EvalFrameDefault [0x14326c] ========= in /usr/bin/python ========= Host Frame:_PyFunction_Vectorcall [0x15a9fb] ========= in /usr/bin/python ========= Host Frame:THPFunction_apply(_object*, _object*) [0x7e7d30] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so ========= Host Frame: [0x15a137] ========= in /usr/bin/python ========= Host Frame:PyObject_Call [0x16942a] ========= in /usr/bin/python ========= Host Frame:_PyEval_EvalFrameDefault [0x1455d6] ========= in /usr/bin/python ========= Host Frame: [0x16893d] ========= in /usr/bin/python ========= Host Frame:_PyEval_EvalFrameDefault [0x1455d6] ========= in /usr/bin/python ========= Host Frame:_PyFunction_Vectorcall [0x15a9fb] ========= in /usr/bin/python ========= Host Frame:_PyEval_EvalFrameDefault [0x14326c] ========= in /usr/bin/python ========= Host Frame:_PyFunction_Vectorcall [0x15a9fb] ========= in /usr/bin/python ========= Host Frame:_PyEval_EvalFrameDefault [0x1455d6] ========= in /usr/bin/python ========= Host Frame: [0x16893d] ========= in /usr/bin/python ========= Host Frame:torch::autograd::PyNode::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) [0x7e0002] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so ========= Host Frame:torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) [0x4ea4d3a] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so ========= Host Frame:torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) [0x4e9e815] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so ========= Host Frame:torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) [0x4e9f467] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so ========= Host Frame:torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) [0x4e96b75] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so ========= Host Frame:torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) [0x7dc04b] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so ========= Host Frame: [0xe62b2] ========= in /lib/x86_64-linux-gnu/libstdc++.so.6 ========= Host Frame:start_thread in ./nptl/pthread_create.c:442 [0x94ac2] ========= in /lib/x86_64-linux-gnu/libc.so.6 ========= Host Frame: [0x12684f] ========= in /lib/x86_64-linux-gnu/libc.so.6 ========= ========= Program hit cudaErrorUnknown (error 999) due to "unknown error" on CUDA API call to cudaLaunchKernel. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: [0x34d0a2] ========= in /usr/lib/wsl/drivers/nv_dispi.inf_amd64_268e85175aa9e991/libcuda.so.1.1 ========= Host Frame: [0x9985da] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12 ========= Host Frame: [0x79b7db] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12 ========= Host Frame: [0x79965f] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12 ========= Host Frame: [0x799bec] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12 ========= Host Frame:cusparseCsr2cscEx2 [0xf0b22] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12 ========= Host Frame:dgl::aten::CSRMatrix dgl::aten::impl::CSRTranspose<(DGLDeviceType)2, int>(dgl::aten::CSRMatrix) [0x9c93a0] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:dgl::aten::CSRTranspose(dgl::aten::CSRMatrix) [0x32a0b5] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:dgl::UnitGraph::GetInCSR(bool) const [0x9937d2] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:dgl::UnitGraph::GetCSCMatrix(unsigned long) const [0x993d79] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:dgl::HeteroGraph::GetCSCMatrix(unsigned long) const [0x8a9b46] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:dgl::aten::SpMM(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<dgl::BaseHeteroGraph>, dgl::runtime::NDArray, dgl::runtime::NDArray, dgl::runtime::NDArray, std::vector<dgl::runtime::NDArray, std::allocator<dgl::runtime::NDArray> >) [0x7ea870] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:dgl::aten::__mk_DGL0::{lambda(dgl::runtime::DGLArgs, dgl::aten::__mk_DGL0::DGLRetValue*)#1}::operator()(dgl::runtime, dgl::aten::__mk_DGL0::DGLRetValue) const [clone .constprop.0] [0x80c3ee] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:std::_Function_handler<void (dgl::runtime::DGLArgs, dgl::runtime::DGLRetValue*), dgl::aten::__mk_DGL0::{lambda(dgl::runtime::DGLArgs, dgl::runtime::DGLRetValue*)#1}>::_M_invoke(std::_Any_data const&, dgl::runtime::DGLArgs&&, dgl::runtime::DGLRetValue*&&) [0x80ca7d] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:DGLFuncCall [0x84a168] ========= in /home/mfbalin/dgl-1/build/libdgl.so ========= Host Frame:__pyx_f_3dgl_4_ffi_4_cy3_4core_FuncCall(void*, _object*, DGLValue*, int*) in dgl/_ffi/_cython/core.cpp:6805 [0x18ef7] ========= in /home/mfbalin/dgl-1/python/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so ========= Host Frame:__pyx_pw_3dgl_4_ffi_4_cy3_4core_12FunctionBase_5__call__(_object*, _object*, _object*) in dgl/_ffi/_cython/core.cpp:7629 [0x197cf] ========= in /home/mfbalin/dgl-1/python/dgl/_ffi/_cy3/core.cpython-310-x86_64-linux-gnu.so ========= Host Frame:_PyObject_MakeTpCall [0x150a7a] ========= in /usr/bin/python ========= Host Frame:_PyEval_EvalFrameDefault [0x149095] ========= in /usr/bin/python ========= Host Frame:_PyFunction_Vectorcall [0x15a9fb] ========= in /usr/bin/python ========= Host Frame:_PyEval_EvalFrameDefault [0x14326c] ========= in /usr/bin/python ========= Host Frame:_PyFunction_Vectorcall [0x15a9fb] ========= in /usr/bin/python ========= Host Frame:THPFunction_apply(_object*, _object*) [0x7e7d30] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so ========= Host Frame: [0x15a137] ========= in /usr/bin/python ========= Host Frame:PyObject_Call [0x16942a] ========= in /usr/bin/python ========= Host Frame:_PyEval_EvalFrameDefault [0x1455d6] ========= in /usr/bin/python ========= Host Frame: [0x16893d] ========= in /usr/bin/python ========= Host Frame:_PyEval_EvalFrameDefault [0x1455d6] ========= in /usr/bin/python ========= Host Frame:_PyFunction_Vectorcall [0x15a9fb] ========= in /usr/bin/python ========= Host Frame:_PyEval_EvalFrameDefault [0x14326c] ========= in /usr/bin/python ========= Host Frame:_PyFunction_Vectorcall [0x15a9fb] ========= in /usr/bin/python ========= Host Frame:_PyEval_EvalFrameDefault [0x1455d6] ========= in /usr/bin/python ========= Host Frame: [0x16893d] ========= in /usr/bin/python ========= Host Frame:torch::autograd::PyNode::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) [0x7e0002] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so ========= Host Frame:torch::autograd::Node::operator()(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) [0x4ea4d3a] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so ========= Host Frame:torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) [0x4e9e815] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so ========= Host Frame:torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&) [0x4e9f467] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so ========= Host Frame:torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) [0x4e96b75] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so ========= Host Frame:torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool) [0x7dc04b] ========= in /home/mfbalin/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so ========= Host Frame: [0xe62b2] ========= in /lib/x86_64-linux-gnu/libstdc++.so.6 ========= Host Frame:start_thread in ./nptl/pthread_create.c:442 [0x94ac2] ========= in /lib/x86_64-linux-gnu/libc.so.6 ========= Host Frame: [0x12684f] ========= in /lib/x86_64-linux-gnu/libc.so.6 =========
The text was updated successfully, but these errors were encountered:
I suspect cusparse might be buggy as they changed the relevant code in CUDA 12.2 as #7295 fixes the issue.
Sorry, something went wrong.
And #7297 does not catch any error in any of the asserts.
CUDA 11.8 does not crash.
CUDA 12.0 had the following update: CUDA 12.0 Improved cusparseCsr2cscEx2() performance.
They probably have introduced a bug in CUSparse in CUDA 12.
mfbalin
Successfully merging a pull request may close this issue.
🔨Work Item
IMPORTANT:
Project tracker: https://github.com/orgs/dmlc/projects/2
Description
Crash probably due to a bug in #7239, investigating. Crash happens with CUDA 12.3 and CUDA 12.4 so far. CUDA 12.3 is in the NVIDIA torch container.
Depending work items or issues
The text was updated successfully, but these errors were encountered: