Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt] Check data alignment before copying the file #7641

Merged
merged 1 commit into from
Aug 2, 2024

Conversation

Liu-rj
Copy link
Contributor

@Liu-rj Liu-rj commented Aug 2, 2024

Description

The node-feat.npy of products dataset is not saved in C_CONTIGUOUS (and possibly the same for other small datasets). Currently there is no check on the flags before directly copying the file, which will leads to errors here if the data file is not C_CONTIGUOUS.

A simple solution here is to judge whether the data file is C_CONTIGUOUS. If yes, we can directly copy the file, otherwise we will need to proceed the save process.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@dgl-bot
Copy link
Collaborator

dgl-bot commented Aug 2, 2024

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Aug 2, 2024

Commit ID: 6993e91

Build ID: 1

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator

mfbalin commented Aug 2, 2024

Does this fix running our examples with products?

@mfbalin mfbalin requested a review from Rhett-Ying August 2, 2024 10:54
@Liu-rj
Copy link
Contributor Author

Liu-rj commented Aug 2, 2024

It fixes loading the feature file for DiskBasedFeature.

But I find there is another issue. The training will stuck at the first iteration of the first epoch when running on products and arxiv. I don't think it has to do with this PR as I reproduce the results on ogbn-arxiv without applying changes in this PR. But I will still post it here FYI.

I run this:

python examples/graphbolt/pyg/labor/node_classification.py --num-gpu-cached-features=0 --num-cpu-cached-features=1000 --dataset=ogbn-arxiv --sample-mode=sample_neighbor

When I interrupt the process by crtl+c, the full output is:

(graphbolt) ➜  disk_based_feature git:(DiskBasedFeature_dglexample) ✗ python node_classification.py --gpu-cache-size-in-gigabytes=0 --cpu-cache-size-in-gigabytes=0.1 --dataset=ogbn-arxiv --epochs=3
/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/__init__.py:17: GBWarning: 
An experimental feature for CUDA allocations is turned on for better allocation
pattern resulting in better memory usage for minibatch GNN training workloads.
See https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf,
and set the environment variable `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False`
if you want to disable it.

  gb_warning(CUDA_ALLOCATOR_ENV_WARNING_STR)
Namespace(epochs=3, lr=0.001, num_hidden=256, dropout=0.2, batch_size=1024, num_workers=0, dataset='ogbn-arxiv', root='datasets', fanout='10,10,10', mode='pinned-pinned-cuda', layer_dependency=False, batch_dependency=1, cpu_feature_cache_policy=None, cpu_cache_size_in_gigabytes=0.1, gpu_cache_size_in_gigabytes=0.0, early_stopping_patience=25, sample_mode='sample_neighbor', precision='high', enable_inference=False)
Training in pinned-pinned-cuda mode.
Loading data...
The dataset is already preprocessed.
/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/impl/torch_based_feature_store.py:524: GBWarning: `DiskBasedFeature.pin_memory_()` is not supported. Leaving unmodified.
  gb_warning(
Training: 0it [01:20, ?it/s]^C^C
Traceback (most recent call last):
  File "/home/ubuntu/dgl/examples/graphbolt/disk_based_feature/node_classification.py", line 518, in <module>
    main()
  File "/home/ubuntu/dgl/examples/graphbolt/disk_based_feature/node_classification.py", line 489, in main
    best_model = train(
                 ^^^^^^
  File "/home/ubuntu/dgl/examples/graphbolt/disk_based_feature/node_classification.py", line 202, in train
    train_loss, train_acc, duration = train_helper(
                                      ^^^^^^^^^^^^^
  File "/home/ubuntu/dgl/examples/graphbolt/disk_based_feature/node_classification.py", line 163, in train_helper
    for step, minibatch in enumerate(dataloader):
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/tqdm-4.66.4-py3.11.egg/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch
    data = next(self.dataset_iter)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 152, in __next__
    return self._get_next()
           ^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 140, in _get_next
    result = next(self.iterator)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 224, in wrap_next
    result = next_func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/datapipe.py", line 383, in __next__
    return next(self._datapipe_iter)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
    response = gen.send(None)
               ^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/base.py", line 385, in __iter__
    yield from self.datapipe
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
    response = gen.send(None)
               ^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in __iter__
    for data in self.datapipe:
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
    response = gen.send(None)
               ^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/base.py", line 411, in __iter__
    for data in self.datapipe:
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 193, in wrap_generator
    response = gen.send(request)
               ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in __iter__
    for data in self.datapipe:
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 193, in wrap_generator
    response = gen.send(request)
               ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/base.py", line 411, in __iter__
    for data in self.datapipe:
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 193, in wrap_generator
    response = gen.send(request)
               ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/iter/callable.py", line 125, in __iter__
    yield self._apply_fn(data)
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/iter/callable.py", line 90, in _apply_fn
    return self.fn(data)
           ^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/minibatch_transformer.py", line 38, in _transformer
    minibatch = self.transformer(minibatch)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/feature_fetcher.py", line 132, in _execute_stage
    value = next(handle)
            ^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/impl/cpu_cached_feature.py", line 165, in read_async
    missing_values = missing_values_future.wait()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

@mfbalin
Copy link
Collaborator

mfbalin commented Aug 2, 2024

It fixes loading the feature file for DiskBasedFeature.

But I find there is another issue. The training will stuck at the first iteration of the first epoch when running on products and arxiv. I don't think it has to do with this PR as I reproduce the results on ogbn-arxiv without applying changes in this PR. But I will still post it here FYI.

I run this:

python examples/graphbolt/pyg/labor/node_classification.py --num-gpu-cached-features=0 --num-cpu-cached-features=1000 --dataset=ogbn-arxiv --sample-mode=sample_neighbor

When I interrupt the process by crtl+c, the full output is:

(graphbolt) ➜  disk_based_feature git:(DiskBasedFeature_dglexample) ✗ python node_classification.py --gpu-cache-size-in-gigabytes=0 --cpu-cache-size-in-gigabytes=0.1 --dataset=ogbn-arxiv --epochs=3
/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/__init__.py:17: GBWarning: 
An experimental feature for CUDA allocations is turned on for better allocation
pattern resulting in better memory usage for minibatch GNN training workloads.
See https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf,
and set the environment variable `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False`
if you want to disable it.

  gb_warning(CUDA_ALLOCATOR_ENV_WARNING_STR)
Namespace(epochs=3, lr=0.001, num_hidden=256, dropout=0.2, batch_size=1024, num_workers=0, dataset='ogbn-arxiv', root='datasets', fanout='10,10,10', mode='pinned-pinned-cuda', layer_dependency=False, batch_dependency=1, cpu_feature_cache_policy=None, cpu_cache_size_in_gigabytes=0.1, gpu_cache_size_in_gigabytes=0.0, early_stopping_patience=25, sample_mode='sample_neighbor', precision='high', enable_inference=False)
Training in pinned-pinned-cuda mode.
Loading data...
The dataset is already preprocessed.
/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/impl/torch_based_feature_store.py:524: GBWarning: `DiskBasedFeature.pin_memory_()` is not supported. Leaving unmodified.
  gb_warning(
Training: 0it [01:20, ?it/s]^C^C
Traceback (most recent call last):
  File "/home/ubuntu/dgl/examples/graphbolt/disk_based_feature/node_classification.py", line 518, in <module>
    main()
  File "/home/ubuntu/dgl/examples/graphbolt/disk_based_feature/node_classification.py", line 489, in main
    best_model = train(
                 ^^^^^^
  File "/home/ubuntu/dgl/examples/graphbolt/disk_based_feature/node_classification.py", line 202, in train
    train_loss, train_acc, duration = train_helper(
                                      ^^^^^^^^^^^^^
  File "/home/ubuntu/dgl/examples/graphbolt/disk_based_feature/node_classification.py", line 163, in train_helper
    for step, minibatch in enumerate(dataloader):
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/tqdm-4.66.4-py3.11.egg/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch
    data = next(self.dataset_iter)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 152, in __next__
    return self._get_next()
           ^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 140, in _get_next
    result = next(self.iterator)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 224, in wrap_next
    result = next_func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/datapipe.py", line 383, in __next__
    return next(self._datapipe_iter)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
    response = gen.send(None)
               ^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/base.py", line 385, in __iter__
    yield from self.datapipe
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
    response = gen.send(None)
               ^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in __iter__
    for data in self.datapipe:
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 181, in wrap_generator
    response = gen.send(None)
               ^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/base.py", line 411, in __iter__
    for data in self.datapipe:
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 193, in wrap_generator
    response = gen.send(request)
               ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/iter/callable.py", line 124, in __iter__
    for data in self.datapipe:
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 193, in wrap_generator
    response = gen.send(request)
               ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/base.py", line 411, in __iter__
    for data in self.datapipe:
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 193, in wrap_generator
    response = gen.send(request)
               ^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/iter/callable.py", line 125, in __iter__
    yield self._apply_fn(data)
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/torch/utils/data/datapipes/iter/callable.py", line 90, in _apply_fn
    return self.fn(data)
           ^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/minibatch_transformer.py", line 38, in _transformer
    minibatch = self.transformer(minibatch)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/feature_fetcher.py", line 132, in _execute_stage
    value = next(handle)
            ^^^^^^^^^^^^
  File "/home/ubuntu/miniconda3/envs/graphbolt/lib/python3.11/site-packages/dgl-2.4-py3.11-linux-x86_64.egg/dgl/graphbolt/impl/cpu_cached_feature.py", line 165, in read_async
    missing_values = missing_values_future.wait()
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

This issue is known and it is the same issue as batch dependency 4096 taking longer than batch dependency=1 and 64. It will be fixed today.

@mfbalin
Copy link
Collaborator

mfbalin commented Aug 2, 2024

It fixes loading the feature file for DiskBasedFeature.

But I find there is another issue. The training will stuck at the first iteration of the first epoch when running on products and arxiv. I don't think it has to do with this PR as I reproduce the results on ogbn-arxiv without applying changes in this PR. But I will still post it here FYI.

@Liu-rj When the cache capacity is not at least a few multiples larger than the requests being made to the cache (the number of sampled nodes in a minibatch), this is expected behavior. I will update the documentation to indicate that the cache size should be larger than the largest request being made. It is a bit hard to fix the infinite loop issue, if the user is using the cache like this, then the user is doing it wrong.

@mfbalin mfbalin merged commit ea33b40 into dmlc:master Aug 2, 2024
1 check passed
@Liu-rj Liu-rj deleted the fix_dataset branch August 3, 2024 03:57
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants