Reduce memory consumption when create node/edge split for distributed training #3107

JingchengYu94 · 2021-07-06T07:47:00Z

🚀 Feature

Make node_split and edge_split methods in python/dgl/distributed/dist_graph.py take less memory on each worker when number of workers increasing.

Motivation

DGL provides node_split() and edge_split() to split the training, validation and test set at runtime for distributed training. But _split_even_to_part method called by node_split and edge_split in python/dgl/distributed/dist_graph.py creates tensor with length of total number of nodes/edges.
In my case, I am training a large graph with 13 billion edges (training set) on a cluster of 200 workers. The edge_split method takes over 100GB memory.

To be more specific, this part will create a boolean tensor and a int64 tensor whose length are both 13 billion. And the total memory consumption is 13 billion * (1 byte + 8 byte) = 108GB.

This is a bottleneck of scalability because adding workers cannot decrease the memory consumption.

Alternatives

Pitch

A more scalable implementation of _split_even_to_part. We can compute the partition interval beforehand, and then create the nonzero_1d tensor based on the interval. There might be some extra math work, but it is worthy if 100GB memory can be saved.
[Optional] A compression method for creating node/edge split. For example, if the generated split consists of consecutive ids (1, 2, 3, .. 1 million), we can simply use an interval to represent it, and save more memory.

Additional context

JingchengYu94 · 2021-07-07T03:46:51Z

@zheng-da Do you think this feature request reasonable? I am working on this to unblock my experiments. Maybe I can contribute my solution if you are interested.

zheng-da · 2021-07-07T04:41:47Z

If you can contribute, it'll be great. our solution isn't very good. we definitely need to find a better way to split nodes/edges.

JingchengYu94 · 2021-07-15T03:52:46Z

PR merged. Close this issue.

JingchengYu94 mentioned this issue Jul 12, 2021

[Feature] Optimize dist_graph/_split_even_to_part memory usage #3132

Merged

7 tasks

JingchengYu94 closed this as completed Jul 15, 2021

freeliuzc mentioned this issue Jul 30, 2021

[feature] add count_nonzero function for DistTensor #3203

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory consumption when create node/edge split for distributed training #3107

Reduce memory consumption when create node/edge split for distributed training #3107

JingchengYu94 commented Jul 6, 2021

JingchengYu94 commented Jul 7, 2021

zheng-da commented Jul 7, 2021

JingchengYu94 commented Jul 15, 2021

Reduce memory consumption when create node/edge split for distributed training #3107

Reduce memory consumption when create node/edge split for distributed training #3107

Comments

JingchengYu94 commented Jul 6, 2021

🚀 Feature

Motivation

Alternatives

Pitch

Additional context

JingchengYu94 commented Jul 7, 2021

zheng-da commented Jul 7, 2021

JingchengYu94 commented Jul 15, 2021