You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make node_split and edge_split methods in python/dgl/distributed/dist_graph.py take less memory on each worker when number of workers increasing.
Motivation
DGL provides node_split() and edge_split() to split the training, validation and test set at runtime for distributed training. But _split_even_to_part method called by node_split and edge_split in python/dgl/distributed/dist_graph.py creates tensor with length of total number of nodes/edges.
In my case, I am training a large graph with 13 billion edges (training set) on a cluster of 200 workers. The edge_split method takes over 100GB memory.
To be more specific, this part will create a boolean tensor and a int64 tensor whose length are both 13 billion. And the total memory consumption is 13 billion * (1 byte + 8 byte) = 108GB.
This is a bottleneck of scalability because adding workers cannot decrease the memory consumption.
Alternatives
Pitch
A more scalable implementation of _split_even_to_part. We can compute the partition interval beforehand, and then create the nonzero_1d tensor based on the interval. There might be some extra math work, but it is worthy if 100GB memory can be saved.
[Optional] A compression method for creating node/edge split. For example, if the generated split consists of consecutive ids (1, 2, 3, .. 1 million), we can simply use an interval to represent it, and save more memory.
Additional context
The text was updated successfully, but these errors were encountered:
@zheng-da Do you think this feature request reasonable? I am working on this to unblock my experiments. Maybe I can contribute my solution if you are interested.
🚀 Feature
Make node_split and edge_split methods in python/dgl/distributed/dist_graph.py take less memory on each worker when number of workers increasing.
Motivation
DGL provides node_split() and edge_split() to split the training, validation and test set at runtime for distributed training. But _split_even_to_part method called by node_split and edge_split in python/dgl/distributed/dist_graph.py creates tensor with length of total number of nodes/edges.
In my case, I am training a large graph with 13 billion edges (training set) on a cluster of 200 workers. The edge_split method takes over 100GB memory.
To be more specific, this part will create a boolean tensor and a int64 tensor whose length are both 13 billion. And the total memory consumption is 13 billion * (1 byte + 8 byte) = 108GB.
This is a bottleneck of scalability because adding workers cannot decrease the memory consumption.
Alternatives
Pitch
Additional context
The text was updated successfully, but these errors were encountered: