Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory consumption when create node/edge split for distributed training #3107

Closed
JingchengYu94 opened this issue Jul 6, 2021 · 3 comments

Comments

@JingchengYu94
Copy link
Contributor

🚀 Feature

Make node_split and edge_split methods in python/dgl/distributed/dist_graph.py take less memory on each worker when number of workers increasing.

Motivation

DGL provides node_split() and edge_split() to split the training, validation and test set at runtime for distributed training. But _split_even_to_part method called by node_split and edge_split in python/dgl/distributed/dist_graph.py creates tensor with length of total number of nodes/edges.
In my case, I am training a large graph with 13 billion edges (training set) on a cluster of 200 workers. The edge_split method takes over 100GB memory.

To be more specific, this part will create a boolean tensor and a int64 tensor whose length are both 13 billion. And the total memory consumption is 13 billion * (1 byte + 8 byte) = 108GB.

This is a bottleneck of scalability because adding workers cannot decrease the memory consumption.

Alternatives

Pitch

  1. A more scalable implementation of _split_even_to_part. We can compute the partition interval beforehand, and then create the nonzero_1d tensor based on the interval. There might be some extra math work, but it is worthy if 100GB memory can be saved.
  2. [Optional] A compression method for creating node/edge split. For example, if the generated split consists of consecutive ids (1, 2, 3, .. 1 million), we can simply use an interval to represent it, and save more memory.

Additional context

@JingchengYu94
Copy link
Contributor Author

@zheng-da Do you think this feature request reasonable? I am working on this to unblock my experiments. Maybe I can contribute my solution if you are interested.

@zheng-da
Copy link
Collaborator

zheng-da commented Jul 7, 2021

If you can contribute, it'll be great. our solution isn't very good. we definitely need to find a better way to split nodes/edges.

@JingchengYu94
Copy link
Contributor Author

PR merged. Close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants