[GraphBolt] Add offline sampling support #7679

Liu-rj · 2024-08-10T14:45:41Z

Description

Sample minibatches in advance and use MinibatchLoader to load during online training.

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
I've leverage the tools to beautify the python and c++ code.
The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
Related issue is referred in this PR
If the PR is for a new model/paper, I've updated the example index here.

Changes

dgl-bot · 2024-08-10T14:46:11Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot · 2024-08-10T14:48:01Z

Commit ID: 91de8e76f513c184bbe11a6425d58b31765c8eaf

Build ID: 1

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

mfbalin · 2024-08-10T14:51:16Z

examples/graphbolt/disk_based_feature/node_classification_offline.py

+                minibatch.seeds.cpu(),
+                minibatch.input_nodes.cpu(),
+                minibatch.labels.cpu(),
+                [block.cpu() for block in minibatch.blocks],


This will only work for DGL. I would recommend changing your code so that it works with any GNN framework. Otherwise, your methods should be called DGLMinibatchProvider or something similar.

For now this is just for alignment with DiskGNN, as if I directly save all attributes in a minibatch, the size will be large and loading the minibatches is time consuming.

You don't have to save all attributes in minibatch, you can only save minibatch.sampled_subgraphs as it is the counterpart to DGL's blocks.

mfbalin · 2024-08-10T14:52:31Z

examples/graphbolt/disk_based_feature/node_classification_offline.py

+    if args.cpu_cache_size_in_gigabytes > 0 and isinstance(
+        features[("node", None, "feat")], gb.DiskBasedFeature
+    ):
+        features[("node", None, "feat")] = gb.CPUCachedFeature(


Suggested change

features[("node", None, "feat")] = gb.CPUCachedFeature(

features[("node", None, "feat")] = features[("node", None, "feat")].read_into_memory()

features[("node", None, "feat")] = gb.CPUCachedFeature(

You can have this line to debug, that way you won't be affected by any potential bugs inside DiskBasedFeature, it might not be handling exceptions correctly.

That way, the error will be inside TorchBasedFeature if the rest of your code has a bug.

dgl-bot · 2024-08-11T03:43:34Z

Commit ID: 3df7de9

Build ID: 2

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

Liu-rj · 2024-08-12T05:38:04Z

I sill encounter similar issues with the updated master. It indicates a worker thread does not exit normally. And this error disappears when I read the whole feature into main memory. Here is the error message:

(graphbolt) ➜  disk_based_feature git:(gb_offline) ✗ python node_classification_offline.py --gpu-cache-size-in-gigabytes=0 --cpu-cache-size-in-gigabytes=15 --dataset=ogbn-papers100M --epochs=3 --root=/nvme2n1/graphbolt_dataset
/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/__init__.py:109: GBWarning: 
An experimental feature for CUDA allocations is turned on for better allocation
pattern resulting in better memory usage for minibatch GNN training workloads.
See https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf,
and set the environment variable `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False`
if you want to disable it and set it True to acknowledge and disable the warning.

  gb_warning(WARNING_STR_TO_BE_SHOWN)
Namespace(epochs=3, lr=0.001, num_hidden=256, dropout=0.2, batch_size=1024, num_workers=0, dataset='ogbn-papers100M', root='/nvme2n1/graphbolt_dataset', fanout='10,10,10', mode='pinned-pinned-cuda', layer_dependency=False, batch_dependency=1, cpu_feature_cache_policy=None, cpu_cache_size_in_gigabytes=15.0, gpu_cache_size_in_gigabytes=0.0, early_stopping_patience=25, sample_mode='sample_neighbor', precision='high', enable_inference=False, subgraph_dir='/nvme2n1/graphbolt_dataset/ogbn-papers100M-1024-10,10,10')
Training in pinned-pinned-cuda mode.
Loading data...
The dataset is already preprocessed.
/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/impl/torch_based_feature_store.py:524: GBWarning: `DiskBasedFeature.pin_memory_()` is not supported. Leaving unmodified.
  gb_warning(
/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/dataloader.py:259: GBWarning: Multiple CopyTo operations were found in the datapipe graph. This case is not officially supported.
  gb_warning(
Prepare time: 23.05s
Sampling time: 0.00s
Training: 0it [00:00, ?it/s]/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/itemset.py:181: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  return torch.tensor(index, dtype=dtype)
Training: 1179it [09:33,  2.05it/s, num_nodes=674826, gpu_cache_miss=1, cpu_cache_miss=0.0475]
Evaluating: 123it [00:35,  3.43it/s, num_nodes=626865, gpu_cache_miss=1, cpu_cache_miss=0.0458]
Epoch 00, Loss: 1.4201, Approx. Train: 0.5782, Approx. Val: 0.6183, Time: 573.9403512477875s
Training: 5it [00:00,  7.24it/s, num_nodes=677453, gpu_cache_miss=1, cpu_cache_miss=0.0456]terminate called after throwing an instance of 'c10::Error'
  what():  An io_uring worker thread didn't not exit.
Exception raised from ~QueueAndBufferAcquirer at /home/ubuntu/dgl/graphbolt/src/./cnumpy.h:186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7674ef0cf897 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7674ef07fbee in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: graphbolt::storage::OnDiskNpyArray::QueueAndBufferAcquirer::~QueueAndBufferAcquirer() + 0x9e (0x767390033a66 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #3: graphbolt::storage::OnDiskNpyArray::IndexSelectIOUringImpl(at::Tensor) + 0x4f8 (0x76739002016a in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #4: <unknown function> + 0x62020b (0x76739002020b in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #5: <unknown function> + 0x626f12 (0x767390026f12 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #6: <unknown function> + 0x626cc9 (0x767390026cc9 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #7: <unknown function> + 0x6269c1 (0x7673900269c1 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #8: <unknown function> + 0x6276dd (0x7673900276dd in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #9: <unknown function> + 0x627553 (0x767390027553 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #10: <unknown function> + 0x627234 (0x767390027234 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #11: <unknown function> + 0x626fbb (0x767390026fbb in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #12: std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>::operator()() const + 0x50 (0x7673900356f4 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #13: std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 0x3a (0x767390029aa2 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #14: void std::__invoke_impl<void, void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::__invoke_memfun_deref, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&) + 0x9c (0x767390045de7 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #15: std::__invoke_result<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>::type std::__invoke<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&) + 0x6b (0x76739003dd0b in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #16: std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()#1}::operator()() const + 0x6e (0x7673900351be in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #17: std::once_flag::_Prepare_execution::_Prepare_execution<std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()#1}>(void (std::__future_base::_State_baseV2::*&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*))::{lambda()#1}::operator()() const + 0x2b (0x76739003dd43 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #18: std::once_flag::_Prepare_execution::_Prepare_execution<std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&)::{lambda()#1}>(void (std::__future_base::_State_baseV2::*&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*))::{lambda()#1}::_FUN() + 0x12 (0x76739003dd58 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #19: <unknown function> + 0x99ee8 (0x7674f0499ee8 in /lib/x86_64-linux-gnu/libc.so.6)
frame #20: <unknown function> + 0x61d754 (0x76739001d754 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #21: void std::call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*>(std::once_flag&, void (std::__future_base::_State_baseV2::*&&)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*), std::__future_base::_State_baseV2*&&, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*&&, bool*&&) + 0x79 (0x767390035243 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #22: std::__future_base::_State_baseV2::_M_set_result(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>, bool) + 0xa8 (0x767390029656 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #23: <unknown function> + 0x626a4e (0x767390026a4e in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #24: std::packaged_task<at::Tensor ()>::operator()() + 0x37 (0x76739003cc43 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #25: <unknown function> + 0x62028e (0x76739002028e in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #26: <unknown function> + 0x6258d2 (0x7673900258d2 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #27: <unknown function> + 0x625688 (0x767390025688 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #28: <unknown function> + 0x6254a6 (0x7673900254a6 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #29: std::function<void ()>::operator()() const + 0x36 (0x767390034cf8 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #30: tf::Executor::_invoke_async_task(tf::Worker&, tf::Node*) + 0x82 (0x767390032508 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #31: tf::Executor::_invoke(tf::Worker&, tf::Node*) + 0x2e6 (0x76739003102e in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #32: tf::Executor::_exploit_task(tf::Worker&, tf::Node*&) + 0x34 (0x76739002faca in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #33: tf::Executor::_spawn(unsigned long)::{lambda()#1}::operator()() const + 0x115 (0x76739002f5e5 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #34: void std::__invoke_impl<void, tf::Executor::_spawn(unsigned long)::{lambda()#1}>(std::__invoke_other, tf::Executor::_spawn(unsigned long)::{lambda()#1}&&) + 0x24 (0x76739005f985 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #35: std::__invoke_result<tf::Executor::_spawn(unsigned long)::{lambda()#1}>::type std::__invoke<tf::Executor::_spawn(unsigned long)::{lambda()#1}>(tf::Executor::_spawn(unsigned long)::{lambda()#1}&&) + 0x24 (0x76739005f940 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #36: void std::thread::_Invoker<std::tuple<tf::Executor::_spawn(unsigned long)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) + 0x2c (0x76739005f8ce in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #37: std::thread::_Invoker<std::tuple<tf::Executor::_spawn(unsigned long)::{lambda()#1}> >::operator()() + 0x1c (0x76739005f666 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #38: std::thread::_State_impl<std::thread::_Invoker<std::tuple<tf::Executor::_spawn(unsigned long)::{lambda()#1}> > >::_M_run() + 0x20 (0x76739005f3d8 in /home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/libgraphbolt_pytorch_2.3.1.so)
frame #39: <unknown function> + 0xd3b55 (0x7674ef4f0b55 in /home/ubuntu/miniconda3/envs/graphbolt/bin/../lib/libstdc++.so.6)
frame #40: <unknown function> + 0x94ac3 (0x7674f0494ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #41: <unknown function> + 0x126850 (0x7674f0526850 in /lib/x86_64-linux-gnu/libc.so.6)

[1]    1437649 IOT instruction (core dumped)  python node_classification_offline.py --gpu-cache-size-in-gigabytes=0

mfbalin · 2024-08-12T15:10:25Z

I sill encounter similar issues with the updated master. It indicates a worker thread does not exit normally. And this error disappears when I read the whole feature into main memory. Here is the error message:

I couldn't reproduce the issue on my local machine. Trying on another machine.

mfbalin · 2024-08-12T15:24:53Z

@Liu-rj Can you provide more information on the machine you are getting this error on?

Liu-rj · 2024-08-12T16:14:48Z

The machine is g5.8xlarge with 32 cores, 128 RAM, one 24GB A10G GPU. And the error is got on EBS io2 SSD. The same instruction runs normally on instance NVMe storage. And other interactions adjusting CPU and GPU cache can also run normally on EBS io2 SSD. I don't know if it's the io_uring issue (maybe related to hardware) or bugs in the code.

mfbalin · 2024-08-12T16:18:34Z

The machine is g5.8xlarge with 32 cores, 128 RAM, one 24GB A10G GPU. And the error is got on EBS io2 SSD. The same instruction runs normally on instance NVMe storage. And other interactions adjusting CPU and GPU cache can also run normally on EBS io2 SSD. I don't know if it's the io_uring issue (maybe related to hardware) or bugs in the code.

Since I can't reproduce the issue "yet", I might ask you to run the code with a modification to see if it will fix it. I am currently using thread sanitizer to see if it will catch anything.

dgl-bot · 2024-08-13T02:51:16Z

Commit ID: 332c034

Build ID: 3

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

mfbalin · 2024-08-14T03:44:02Z

examples/graphbolt/disk_based_feature/node_classification_offline.py

+
+
+def main():
+    start = time.time()


Suggested change

start = time.time()

torch.ops.graphbolt.set_num_io_uring_threads(4)

start = time.time()

mfbalin · 2024-08-14T20:17:09Z

I sill encounter similar issues with the updated master. It indicates a worker thread does not exit normally. And this error disappears when I read the whole feature into main memory. Here is the error message:

#7698 fixes the issue.

mfbalin reviewed Aug 10, 2024

View reviewed changes

dev gb offline sampling

3df7de9

Liu-rj force-pushed the gb_offline branch from cc8f407 to 3df7de9 Compare August 11, 2024 03:41

Merge branch 'master' into gb_offline

332c034

mfbalin reviewed Aug 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GraphBolt] Add offline sampling support #7679

[GraphBolt] Add offline sampling support #7679

Liu-rj commented Aug 10, 2024

dgl-bot commented Aug 10, 2024

dgl-bot commented Aug 10, 2024

mfbalin Aug 10, 2024

Liu-rj Aug 10, 2024

mfbalin Aug 10, 2024

mfbalin Aug 10, 2024

mfbalin Aug 10, 2024

mfbalin Aug 10, 2024

dgl-bot commented Aug 11, 2024

Liu-rj commented Aug 12, 2024

mfbalin commented Aug 12, 2024 •

edited

Loading

mfbalin commented Aug 12, 2024

Liu-rj commented Aug 12, 2024

mfbalin commented Aug 12, 2024 •

edited

Loading

dgl-bot commented Aug 13, 2024

mfbalin Aug 14, 2024

mfbalin commented Aug 14, 2024 •

edited

Loading

	features[("node", None, "feat")] = gb.CPUCachedFeature(
	features[("node", None, "feat")] = features[("node", None, "feat")].read_into_memory()
	features[("node", None, "feat")] = gb.CPUCachedFeature(

	start = time.time()
	torch.ops.graphbolt.set_num_io_uring_threads(4)
	start = time.time()

[GraphBolt] Add offline sampling support #7679

Are you sure you want to change the base?

[GraphBolt] Add offline sampling support #7679

Conversation

Liu-rj commented Aug 10, 2024

Description

Checklist

Changes

dgl-bot commented Aug 10, 2024

dgl-bot commented Aug 10, 2024

mfbalin Aug 10, 2024

Choose a reason for hiding this comment

Liu-rj Aug 10, 2024

Choose a reason for hiding this comment

mfbalin Aug 10, 2024

Choose a reason for hiding this comment

mfbalin Aug 10, 2024

Choose a reason for hiding this comment

mfbalin Aug 10, 2024

Choose a reason for hiding this comment

mfbalin Aug 10, 2024

Choose a reason for hiding this comment

dgl-bot commented Aug 11, 2024

Liu-rj commented Aug 12, 2024

mfbalin commented Aug 12, 2024 • edited Loading

mfbalin commented Aug 12, 2024

Liu-rj commented Aug 12, 2024

mfbalin commented Aug 12, 2024 • edited Loading

dgl-bot commented Aug 13, 2024

mfbalin Aug 14, 2024

Choose a reason for hiding this comment

mfbalin commented Aug 14, 2024 • edited Loading

mfbalin commented Aug 12, 2024 •

edited

Loading

mfbalin commented Aug 12, 2024 •

edited

Loading

mfbalin commented Aug 14, 2024 •

edited

Loading