[Feature] Optimize dist_graph/_split_even_to_part memory usage #3132

JingchengYu94 · 2021-07-12T16:34:42Z

Description

According to this issue, this PR will make the memory consumption of edge_split and node_split decrease when number of worker increases. In my case (graph with 13 billion edges), the memory consumption of edge_split decreases from ~108GB to ~15GB.

@zheng-da Could you please take a look at this PR?

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the my best knowledge, examples are either not affected by this change,
or have been fixed to be compatible with this change
Related issue is referred in this PR
If the PR is for a new model/paper, I've updated the example index here.

Changes

Add API count_nonzero in dgl backend, to count the number of nonzero element in a tensor
Add API add in dgl backend, to addup a tensor with a scalar (element-wise)
Change the process of split_even
- Old: compute nonzero tensor over whole elements, then compute the offset and take the slice
- New: count the number of nonzero elements, compute offset, iterate through whole elements by blocks, compute nonzero tensor of each block, and concatenate to final answer. Block size is set to be #elements / #partitions, so it will decrease with worker increases.
Note: If elements is a dist_tensor, whole elements will still be pulled as before. We need to implement count_nonzero for kvstore before applying this change to the case of dist_tensor. We can discuss about this further.

dgl-bot · 2021-07-12T16:35:41Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

BarclayII · 2021-07-13T03:57:35Z

We support PyTorch as old as 1.5.0 so I would recommend you use numpy's count_nonzero for PyTorch as well.

python/dgl/backend/backend.py

…part_memory_usage

JingchengYu94 · 2021-07-14T16:18:19Z

@zheng-da Could you please take a look at this PR?

…part_memory_usage

zheng-da · 2021-07-15T05:21:03Z

python/dgl/backend/pytorch/tensor.py

@@ -290,6 +291,10 @@ def clamp(data, min_val, max_val):
 def replace_inf_with_zero(x):
    return th.masked_fill(x, th.isinf(x), 0)

+def count_nonzero(input):
+    # TODO: fallback to numpy for backward compatibility
+    return np.count_nonzero(input)


do we not need to convert it into numpy array first?

It will convert implicitly.

zheng-da · 2021-07-15T05:26:23Z

python/dgl/distributed/dist_graph.py

-    # Get the elements that belong to the partition.
-    partid = partition_book.partid
-    part_eles = eles[offsets[partid] : offsets[partid + 1]]
+        elements = F.tensor(elements)


here elements is a vector stored in a local machine?
if we can store the entire elements array in the local machine, why do we still have memory issue?

elements is a boolean mask to split train, validate and test sets (I learnt how to use it from this example). Since it is boolean type, it cost 1/8 memory of entire nonzero_1d tensor (which is int64 type), so at least my machines have enough memory for it.
Do you have any plan about redesign the data split part to further reduce the memory cost?

See also https://discuss.dgl.ai/t/edge-split-consumes-too-much-memory/2130.

i see. then my general comment is that we can unify the code for DistTensor and local tensor.

Yes. But we need to implement a distributed count_nonzero method first. I can do it next week, what do you think?

JingchengYu94 force-pushed the feature_optimize_dist_graph_split_even_to_part_memory_usage branch 2 times, most recently from 9fb8771 to 86dcfac Compare July 13, 2021 03:11

JingchengYu94 force-pushed the feature_optimize_dist_graph_split_even_to_part_memory_usage branch from 86dcfac to 297c9aa Compare July 13, 2021 04:46

classicsong requested a review from zheng-da July 13, 2021 14:10

BarclayII reviewed Jul 14, 2021

View reviewed changes

python/dgl/backend/backend.py Outdated Show resolved Hide resolved

Optimize dist_graph/_split_even_to_part memory usage

9420a3d

JingchengYu94 force-pushed the feature_optimize_dist_graph_split_even_to_part_memory_usage branch from c645210 to 9420a3d Compare July 14, 2021 02:46

Merge branch 'master' into feature_optimize_dist_graph_split_even_to_…

e656d86

…part_memory_usage

JingchengYu94 requested a review from BarclayII July 14, 2021 06:38

Merge branch 'master' into feature_optimize_dist_graph_split_even_to_…

0a148f3

…part_memory_usage

BarclayII approved these changes Jul 15, 2021

View reviewed changes

BarclayII added 2 commits July 15, 2021 10:21

Merge branch 'master' into feature_optimize_dist_graph_split_even_to_…

16027a6

…part_memory_usage

Merge branch 'master' into feature_optimize_dist_graph_split_even_to_…

39ed1c0

…part_memory_usage

BarclayII merged commit b379dbd into dmlc:master Jul 15, 2021

zheng-da reviewed Jul 15, 2021

View reviewed changes

JingchengYu94 mentioned this pull request Jul 19, 2021

Unify DistTensor with local tensor in dist_graph/split_even_to_part #3153

Closed

freeliuzc mentioned this pull request Jul 30, 2021

[feature] add count_nonzero function for DistTensor #3203

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Optimize dist_graph/_split_even_to_part memory usage #3132

[Feature] Optimize dist_graph/_split_even_to_part memory usage #3132

JingchengYu94 commented Jul 12, 2021

dgl-bot commented Jul 12, 2021

BarclayII commented Jul 13, 2021

JingchengYu94 commented Jul 14, 2021

zheng-da Jul 15, 2021

JingchengYu94 Jul 15, 2021

zheng-da Jul 15, 2021

JingchengYu94 Jul 15, 2021 •

edited

Loading

BarclayII Jul 15, 2021 •

edited

Loading

zheng-da Jul 15, 2021

JingchengYu94 Jul 15, 2021

[Feature] Optimize dist_graph/_split_even_to_part memory usage #3132

[Feature] Optimize dist_graph/_split_even_to_part memory usage #3132

Conversation

JingchengYu94 commented Jul 12, 2021

Description

Checklist

Changes

dgl-bot commented Jul 12, 2021

BarclayII commented Jul 13, 2021

JingchengYu94 commented Jul 14, 2021

zheng-da Jul 15, 2021

Choose a reason for hiding this comment

JingchengYu94 Jul 15, 2021

Choose a reason for hiding this comment

zheng-da Jul 15, 2021

Choose a reason for hiding this comment

JingchengYu94 Jul 15, 2021 • edited Loading

Choose a reason for hiding this comment

BarclayII Jul 15, 2021 • edited Loading

Choose a reason for hiding this comment

zheng-da Jul 15, 2021

Choose a reason for hiding this comment

JingchengYu94 Jul 15, 2021

Choose a reason for hiding this comment

JingchengYu94 Jul 15, 2021 •

edited

Loading

BarclayII Jul 15, 2021 •

edited

Loading