[HybridParallel]Support 1f1b for PipelineParallel #34483

ForFishes · 2021-07-29T08:52:22Z

PR types

New features

PR changes

Others

Describe

[HybridParallel]Support 1f1b for PipelineParallel

修改当前流水线并行的调度方式，采用更省显存的1f1b的调度方式,类似于Megatron的 https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/schedules.py。
具体的调度图如下：

GPT-117M模型，V100-32G，PP=8， mircrobatch=2

global batch	优化前显存	优化后显存
128	OOM	5876
512	OOM	5882
1024	OOM	5886

paddle-bot-old · 2021-07-29T08:52:26Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

wangxicoding

LGTM

python/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py

wangxicoding · 2021-08-02T06:44:14Z

python/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py

-                paddle.autograd.backward(
-                    self.scaler.scale(self.caches['outputs'][cache_id]))
+            input_tensor_grad = self._backward_step(input_tensor, output_tensor,
+                                                    output_tensor_grad)


output_tensor和output_tensor_grad用完了，貌似可以先手工释放一下

应该不能手动设置为None，让它释放吧。host端提前释放，可能device还没开始计算。

没问题的，gpu kernel调度拿到了地址，运行时候不被覆盖及被别人覆盖就行，可以试试🌚

python/paddle/distributed/fleet/meta_parallel/pipeline_parallel.py

wangxicoding · 2021-08-02T07:35:17Z

python/paddle/distributed/fleet/meta_parallel/pp_utils/p2p_communication.py

+        paddle.distributed.send(dtype, dst=1, group=group)
+
+    def send_meta(self, tensor, group):
+        if isinstance(tensor, paddle.Tensor):


提个建议，在pipeline_parallel.py里也有一大堆isinstance(tensor, tuple)的逻辑，不如把单个的paddle.Tensor封装成tuple，统一走tuple的逻辑

确实！这个后面重写代码的时候，可以改的优美一些。

python/paddle/distributed/fleet/meta_parallel/pp_utils/p2p_communication.py

sandyhouse

LGTM

ForFishes added 8 commits July 21, 2021 20:18

support 1f1b

9ea9841

support 1f1b for pipeline

9449f28

add train_loss for pipeline

e59aea2

rm part of code

0928542

add send_recv_meta

f076ac0

support tuple

c636a31

add utest

1a55ccb

add send_partial/recv_partial

7a2af57

ForFishes added 6 commits July 29, 2021 20:19

send/recv

31d1b66

send/recv

d6abdb7

support amp for pp

d8fa25f

support amp for pp

bff3839

fix bug

c15c141

fix logger

aa97008

wangxicoding self-requested a review August 2, 2021 05:21

wangxicoding approved these changes Aug 2, 2021

View reviewed changes

sandyhouse approved these changes Aug 2, 2021

View reviewed changes

ForFishes merged commit 9e0bb91 into PaddlePaddle:develop Aug 2, 2021

ForFishes deleted the support_1f1b branch August 2, 2021 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HybridParallel]Support 1f1b for PipelineParallel #34483

[HybridParallel]Support 1f1b for PipelineParallel #34483

ForFishes commented Jul 29, 2021 •

edited

Loading

paddle-bot-old bot commented Jul 29, 2021

wangxicoding left a comment

wangxicoding Aug 2, 2021

ForFishes Aug 2, 2021

wangxicoding Aug 2, 2021

wangxicoding Aug 2, 2021

ForFishes Aug 2, 2021

sandyhouse left a comment

[HybridParallel]Support 1f1b for PipelineParallel #34483

[HybridParallel]Support 1f1b for PipelineParallel #34483

Conversation

ForFishes commented Jul 29, 2021 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Jul 29, 2021

wangxicoding left a comment

Choose a reason for hiding this comment

wangxicoding Aug 2, 2021

Choose a reason for hiding this comment

ForFishes Aug 2, 2021

Choose a reason for hiding this comment

wangxicoding Aug 2, 2021

Choose a reason for hiding this comment

wangxicoding Aug 2, 2021

Choose a reason for hiding this comment

ForFishes Aug 2, 2021

Choose a reason for hiding this comment

sandyhouse left a comment

Choose a reason for hiding this comment

ForFishes commented Jul 29, 2021 •

edited

Loading