Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt][CUDA] Use better memory allocation algorithm to avoid OOM. #7618

Merged
merged 8 commits into from
Aug 2, 2024

Conversation

mfbalin
Copy link
Collaborator

@mfbalin mfbalin commented Jul 30, 2024

Description

I have 24GB GPU. Without this, some examples go OOM. With this, they don't even use half the GPU memory. This is due to the irregular sizes of the minibatches in GNN training. Each minibatch has different sized node_ids() for example. The default allocator does not work well in this use case. The added option is much better suited for GNN minibatch training with GraphBolt.

See https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@mfbalin mfbalin requested a review from frozenbugs July 30, 2024 14:21
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 30, 2024

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 30, 2024

Commit ID: bb73447

Build ID: 1

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 30, 2024

Commit ID: da2f788

Build ID: 2

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 30, 2024

Commit ID: 5cb555d

Build ID: 3

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@TristonC TristonC self-assigned this Jul 30, 2024
@TristonC
Copy link
Collaborator

Docs need to be added to let user know this option. And it might be good to let user to turn on and off this allocation method.

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jul 30, 2024

Docs need to be added to let user know this option. And it might be good to let user to turn on and off this allocation method.

@TristonC The user can turn it off by setting the pytorch environment value already. This PR sets it only if user has not provided a value for it already.

Where in the documentation should this be documented? 99% of the users won't even touch this option I feel.

@TristonC
Copy link
Collaborator

TristonC commented Jul 30, 2024

Here is my concern, although I really want it default on. The Pytorch has this option experimental and default value off.
It will be perfect if this works out of box. When it goes wrong, we need to have a way of letting user know this could be a root cause. Doc can be made to be viewed on https://docs.dgl.ai/api/python/dgl.graphbolt.html#

  • expandable_segments (experimental, default: False)

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jul 30, 2024

Here is my concern, although I really want it default on. The Pytorch has this option experimental and default value off. It will be perfect if this works out of box. When it goes wrong, we need to have a way of letting user know this could be a root cause. Doc can be made to be viewed on https://docs.dgl.ai/api/python/dgl.graphbolt.html#

  • expandable_segments (experimental, default: False)

@TristonC Do we let the users simply go OOM then? The new feature fetching pipeline creates a lot more feature tensors due to pipeline being longer now with Disk -> CPUCache-> GPU Cache and there can be 2x memory usage difference between enabling it and not. Do we let users run our new features and think that it does not work well?

I think a better way is to enable it and give a warning to the user saying we enabled it. If they want to disable it, they can set bla bla environment variable etc. That way, we don't need documentation as the warning will provide the links to the user. Nobody will see the documentation anyway.

@TristonC
Copy link
Collaborator

I like the idea of the warning.

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jul 30, 2024

I like the idea of the warning.

Will change the PR accordingly, appreciate the review.

@mfbalin mfbalin requested a review from TristonC July 31, 2024 02:57
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 31, 2024

Commit ID: 51c6e16220a46b464fe9e9cd5d5e387c4130abbd

Build ID: 4

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 31, 2024

Commit ID: 892b5984e7ee242accf61e7177a36e06d70a7c85

Build ID: 5

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 31, 2024

Commit ID: 9856159c3cab16d93e0de87ac10753fecfff717c

Build ID: 6

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 31, 2024

Commit ID: 90e41bd6d95f3fac86e47fc334d283d80dbc9a14

Build ID: 7

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 31, 2024

Commit ID: ecef989

Build ID: 8

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@TristonC
Copy link
Collaborator

LGTM

@mfbalin
Copy link
Collaborator Author

mfbalin commented Aug 1, 2024

@TristonC @frozenbugs pytorch/pytorch#130330 for relevant discussion.

This PR is proposing to enable it by default: pytorch/pytorch#130959 and it is approved.

Since it looks like it is almost a production-level feature, I am hoping that there should be no harm in us enabling it by default and if a user ever have a problem, then they can consider turning it off since we display a warning.

@frozenbugs
Copy link
Collaborator

Sounds good, does it also mean that we can remove the flag changing code in the future?

@mfbalin
Copy link
Collaborator Author

mfbalin commented Aug 2, 2024

Sounds good, does it also mean that we can remove the flag changing code in the future?

İf it is turned on by default, we can turn it on like this for only older torch versions in the future and not have to display a warning.

@mfbalin mfbalin merged commit 56a1e64 into dmlc:master Aug 2, 2024
2 checks passed
@mfbalin mfbalin deleted the gb_cuda_memory_alloc_env branch August 2, 2024 11:25
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants