Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Optimize dist_graph/_split_even_to_part memory usage #3132

Conversation

JingchengYu94
Copy link
Contributor

Description

According to this issue, this PR will make the memory consumption of edge_split and node_split decrease when number of worker increases. In my case (graph with 13 billion edges), the memory consumption of edge_split decreases from ~108GB to ~15GB.

@zheng-da Could you please take a look at this PR?

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

  • Add API count_nonzero in dgl backend, to count the number of nonzero element in a tensor
  • Add API add in dgl backend, to addup a tensor with a scalar (element-wise)
  • Change the process of split_even
    • Old: compute nonzero tensor over whole elements, then compute the offset and take the slice
    • New: count the number of nonzero elements, compute offset, iterate through whole elements by blocks, compute nonzero tensor of each block, and concatenate to final answer. Block size is set to be #elements / #partitions, so it will decrease with worker increases.
  • Note: If elements is a dist_tensor, whole elements will still be pulled as before. We need to implement count_nonzero for kvstore before applying this change to the case of dist_tensor. We can discuss about this further.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 12, 2021

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@JingchengYu94 JingchengYu94 force-pushed the feature_optimize_dist_graph_split_even_to_part_memory_usage branch 2 times, most recently from 9fb8771 to 86dcfac Compare July 13, 2021 03:11
@BarclayII
Copy link
Collaborator

We support PyTorch as old as 1.5.0 so I would recommend you use numpy's count_nonzero for PyTorch as well.

@JingchengYu94 JingchengYu94 force-pushed the feature_optimize_dist_graph_split_even_to_part_memory_usage branch from 86dcfac to 297c9aa Compare July 13, 2021 04:46
@JingchengYu94 JingchengYu94 force-pushed the feature_optimize_dist_graph_split_even_to_part_memory_usage branch from c645210 to 9420a3d Compare July 14, 2021 02:46
@JingchengYu94
Copy link
Contributor Author

@zheng-da Could you please take a look at this PR?

@BarclayII BarclayII merged commit b379dbd into dmlc:master Jul 15, 2021
@@ -290,6 +291,10 @@ def clamp(data, min_val, max_val):
def replace_inf_with_zero(x):
return th.masked_fill(x, th.isinf(x), 0)

def count_nonzero(input):
# TODO: fallback to numpy for backward compatibility
return np.count_nonzero(input)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not need to convert it into numpy array first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will convert implicitly.

# Get the elements that belong to the partition.
partid = partition_book.partid
part_eles = eles[offsets[partid] : offsets[partid + 1]]
elements = F.tensor(elements)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here elements is a vector stored in a local machine?
if we can store the entire elements array in the local machine, why do we still have memory issue?

Copy link
Contributor Author

@JingchengYu94 JingchengYu94 Jul 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elements is a boolean mask to split train, validate and test sets (I learnt how to use it from this example). Since it is boolean type, it cost 1/8 memory of entire nonzero_1d tensor (which is int64 type), so at least my machines have enough memory for it.
Do you have any plan about redesign the data split part to further reduce the memory cost?

Copy link
Collaborator

@BarclayII BarclayII Jul 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. then my general comment is that we can unify the code for DistTensor and local tensor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But we need to implement a distributed count_nonzero method first. I can do it next week, what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants