Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt] Add offline sampling support #7679

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Liu-rj
Copy link
Contributor

@Liu-rj Liu-rj commented Aug 10, 2024

Description

Sample minibatches in advance and use MinibatchLoader to load during online training.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@dgl-bot
Copy link
Collaborator

dgl-bot commented Aug 10, 2024

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Aug 10, 2024

Commit ID: 91de8e76f513c184bbe11a6425d58b31765c8eaf

Build ID: 1

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

minibatch.seeds.cpu(),
minibatch.input_nodes.cpu(),
minibatch.labels.cpu(),
[block.cpu() for block in minibatch.blocks],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only work for DGL. I would recommend changing your code so that it works with any GNN framework. Otherwise, your methods should be called DGLMinibatchProvider or something similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now this is just for alignment with DiskGNN, as if I directly save all attributes in a minibatch, the size will be large and loading the minibatches is time consuming.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to save all attributes in minibatch, you can only save minibatch.sampled_subgraphs as it is the counterpart to DGL's blocks.

if args.cpu_cache_size_in_gigabytes > 0 and isinstance(
features[("node", None, "feat")], gb.DiskBasedFeature
):
features[("node", None, "feat")] = gb.CPUCachedFeature(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
features[("node", None, "feat")] = gb.CPUCachedFeature(
features[("node", None, "feat")] = features[("node", None, "feat")].read_into_memory()
features[("node", None, "feat")] = gb.CPUCachedFeature(

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have this line to debug, that way you won't be affected by any potential bugs inside DiskBasedFeature, it might not be handling exceptions correctly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That way, the error will be inside TorchBasedFeature if the rest of your code has a bug.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Aug 11, 2024

Commit ID: 3df7de9

Build ID: 2

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@Liu-rj
Copy link
Contributor Author

Liu-rj commented Aug 12, 2024

I sill encounter similar issues with the updated master. It indicates a worker thread does not exit normally. And this error disappears when I read the whole feature into main memory. Here is the error message:

(graphbolt) ➜  disk_based_feature git:(gb_offline) ✗ python node_classification_offline.py --gpu-cache-size-in-gigabytes=0 --cpu-cache-size-in-gigabytes=15 --dataset=ogbn-papers100M --epochs=3 --root=/nvme2n1/graphbolt_dataset
/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/__init__.py:109: GBWarning: 
An experimental feature for CUDA allocations is turned on for better allocation
pattern resulting in better memory usage for minibatch GNN training workloads.
See https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf,
and set the environment variable `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False`
if you want to disable it and set it True to acknowledge and disable the warning.

  gb_warning(WARNING_STR_TO_BE_SHOWN)
Namespace(epochs=3, lr=0.001, num_hidden=256, dropout=0.2, batch_size=1024, num_workers=0, dataset='ogbn-papers100M', root='/nvme2n1/graphbolt_dataset', fanout='10,10,10', mode='pinned-pinned-cuda', layer_dependency=False, batch_dependency=1, cpu_feature_cache_policy=None, cpu_cache_size_in_gigabytes=15.0, gpu_cache_size_in_gigabytes=0.0, early_stopping_patience=25, sample_mode='sample_neighbor', precision='high', enable_inference=False, subgraph_dir='/nvme2n1/graphbolt_dataset/ogbn-papers100M-1024-10,10,10')
Training in pinned-pinned-cuda mode.
Loading data...
The dataset is already preprocessed.
/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/impl/torch_based_feature_store.py:524: GBWarning: `DiskBasedFeature.pin_memory_()` is not supported. Leaving unmodified.
  gb_warning(
/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/dataloader.py:259: GBWarning: Multiple CopyTo operations were found in the datapipe graph. This case is not officially supported.
  gb_warning(
Prepare time: 23.05s
Sampling time: 0.00s
Training: 0it [00:00, ?it/s]/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/itemset.py:181: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  return torch.tensor(index, dtype=dtype)
Training: 1179it [09:33,  2.05it/s, num_nodes=674826, gpu_cache_miss=1, cpu_cache_miss=0.0475]
Evaluating: 123it [00:35,  3.43it/s, num_nodes=626865, gpu_cache_miss=1, cpu_cache_miss=0.0458]
Epoch 00, Loss: 1.4201, Approx. Train: 0.5782, Approx. Val: 0.6183, Time: 573.9403512477875s
Training: 5it [00:00,  7.24it/s, num_nodes=677453, gpu_cache_miss=1, cpu_cache_miss=0.0456]terminate called after throwing an instance of 'c10::Error'
  what():  An io_uring worker thread didn't not exit.
Exception raised from ~QueueAndBufferAcquirer at /home/ubuntu/dgl/graphbolt/src/./cnumpy.h:186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7674ef0cf897 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7674ef07fbee in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: graphbolt::storage::OnDiskNpyArray::QueueAndBufferAcquirer::~QueueAndBufferAcquirer() + 0x9e (0x767390033a66 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #3: graphbolt::storage::OnDiskNpyArray::IndexSelectIOUringImpl(at::Tensor) + 0x4f8 (0x76739002016a in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #4: <unknown function> + 0x62020b (0x76739002020b in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #5: <unknown function> + 0x626f12 (0x767390026f12 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #6: <unknown function> + 0x626cc9 (0x767390026cc9 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #7: <unknown function> + 0x6269c1 (0x7673900269c1 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #8: <unknown function> + 0x6276dd (0x7673900276dd in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #9: <unknown function> + 0x627553 (0x767390027553 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #10: <unknown function> + 0x627234 (0x767390027234 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #11: <unknown function> + 0x626fbb (0x767390026fbb in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #12: std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>::operator()() const + 0x50 (0x7673900356f4 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #13: std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 0x3a (0x767390029aa2 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #14: void std::__invoke_impl<void, void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::__invoke_memfun_deref, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&) + 0x9c (0x767390045de7 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #15: std::__invoke_result<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>::type std::__invoke<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&) + 0x6b (0x76739003dd0b in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #16: std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()#1}::operator()() const + 0x6e (0x7673900351be in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #17: std::once_flag::_Prepare_execution::_Prepare_execution<std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()#1}>(void (std::__future_base::_State_baseV2::*&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*))::{lambda()#1}::operator()() const + 0x2b (0x76739003dd43 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #18: std::once_flag::_Prepare_execution::_Prepare_execution<std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()#1}>(void (std::__future_base::_State_baseV2::*&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*))::{lambda()#1}::_FUN() + 0x12 (0x76739003dd58 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #19: <unknown function> + 0x99ee8 (0x7674f0499ee8 in /lib/x86_64-linux-gnu/libc.so.6)
frame #20: <unknown function> + 0x61d754 (0x76739001d754 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #21: void std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&) + 0x79 (0x767390035243 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #22: std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) + 0xa8 (0x767390029656 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #23: <unknown function> + 0x626a4e (0x767390026a4e in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #24: std::packaged_task<at::Tensor ()>::operator()() + 0x37 (0x76739003cc43 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #25: <unknown function> + 0x62028e (0x76739002028e in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #26: <unknown function> + 0x6258d2 (0x7673900258d2 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #27: <unknown function> + 0x625688 (0x767390025688 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #28: <unknown function> + 0x6254a6 (0x7673900254a6 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #29: std::function<void ()>::operator()() const + 0x36 (0x767390034cf8 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #30: tf::Executor::_invoke_async_task(tf::Worker&, tf::Node*) + 0x82 (0x767390032508 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #31: tf::Executor::_invoke(tf::Worker&, tf::Node*) + 0x2e6 (0x76739003102e in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #32: tf::Executor::_exploit_task(tf::Worker&, tf::Node*&) + 0x34 (0x76739002faca in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #33: tf::Executor::_spawn(unsigned long)::{lambda()#1}::operator()() const + 0x115 (0x76739002f5e5 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #34: void std::__invoke_impl<void, tf::Executor::_spawn(unsigned long)::{lambda()#1}>(std::__invoke_other, tf::Executor::_spawn(unsigned long)::{lambda()#1}&&) + 0x24 (0x76739005f985 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #35: std::__invoke_result<tf::Executor::_spawn(unsigned long)::{lambda()#1}>::type std::__invoke<tf::Executor::_spawn(unsigned long)::{lambda()#1}>(tf::Executor::_spawn(unsigned long)::{lambda()#1}&&) + 0x24 (0x76739005f940 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #36: void std::thread::_Invoker<std::tuple<tf::Executor::_spawn(unsigned long)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) + 0x2c (0x76739005f8ce in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #37: std::thread::_Invoker<std::tuple<tf::Executor::_spawn(unsigned long)::{lambda()#1}> >::operator()() + 0x1c (0x76739005f666 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #38: std::thread::_State_impl<std::thread::_Invoker<std::tuple<tf::Executor::_spawn(unsigned long)::{lambda()#1}> > >::_M_run() + 0x20 (0x76739005f3d8 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #39: <unknown function> + 0xd3b55 (0x7674ef4f0b55 in /home/ubuntu/miniconda3/envs/graphbolt/bin/../lib/libstdc++.so.6)
frame #40: <unknown function> + 0x94ac3 (0x7674f0494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #41: <unknown function> + 0x126850 (0x7674f0526850 in /lib/x86_64-linux-gnu/libc.so.6)

[1]    1437649 IOT instruction (core dumped)  python node_classification_offline.py --gpu-cache-size-in-gigabytes=0

@mfbalin
Copy link
Collaborator

mfbalin commented Aug 12, 2024

I sill encounter similar issues with the updated master. It indicates a worker thread does not exit normally. And this error disappears when I read the whole feature into main memory. Here is the error message:

I couldn't reproduce the issue on my local machine. Trying on another machine.

@mfbalin
Copy link
Collaborator

mfbalin commented Aug 12, 2024

@Liu-rj Can you provide more information on the machine you are getting this error on?

@Liu-rj
Copy link
Contributor Author

Liu-rj commented Aug 12, 2024

The machine is g5.8xlarge with 32 cores, 128 RAM, one 24GB A10G GPU. And the error is got on EBS io2 SSD. The same instruction runs normally on instance NVMe storage. And other interactions adjusting CPU and GPU cache can also run normally on EBS io2 SSD. I don't know if it's the io_uring issue (maybe related to hardware) or bugs in the code.

@mfbalin
Copy link
Collaborator

mfbalin commented Aug 12, 2024

The machine is g5.8xlarge with 32 cores, 128 RAM, one 24GB A10G GPU. And the error is got on EBS io2 SSD. The same instruction runs normally on instance NVMe storage. And other interactions adjusting CPU and GPU cache can also run normally on EBS io2 SSD. I don't know if it's the io_uring issue (maybe related to hardware) or bugs in the code.

Since I can't reproduce the issue "yet", I might ask you to run the code with a modification to see if it will fix it. I am currently using thread sanitizer to see if it will catch anything.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Aug 13, 2024

Commit ID: 332c034

Build ID: 3

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link



def main():
start = time.time()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
start = time.time()
torch.ops.graphbolt.set_num_io_uring_threads(4)
start = time.time()

@mfbalin
Copy link
Collaborator

mfbalin commented Aug 14, 2024

I sill encounter similar issues with the updated master. It indicates a worker thread does not exit normally. And this error disappears when I read the whole feature into main memory. Here is the error message:

#7698 fixes the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants