[hybrid] out data parallel as optimizer sharding parallel #35593

wangxicoding · 2021-09-08T12:26:33Z

PR types

New features

PR changes

Others

Describe

ShardingConfig中添加_dp_as_optimizer_sharding设置，将最外层的data parallel当做optimizer sharding parallel，对optimizer进行切分，各个sharding rank存储自己的优化器状态，可以减少优化器相关的persistable var的存储，减少显存占用。
在梯度通信上，使用c_reduce_sum通信，更新完参数后使用c_broadcast广播，在通信量复杂度上与c_allreduce_sum一致。
目前_dp_as_optimzier_sharding的方案不是最终的方案，只是一个临时的方案。
由于sharding切分optimizer存在rank未分配到参数的情况，比如只有一个参数，但有两个sharding，那么后一个sharding的optimizer则不会分配到参数。
在存在AMP和GlobalGradientClip的情况，对应使用到的check_finite_and_unscale_op、update_loss_scaling_op、sum_op的输入会被裁剪为空。对于AMP，我们修改check_finite_and_unscale_op和update_loss_scaling_op的逻辑，即使输入为空也可正常执行。对于GlobalGradientClip，我们将sum_op替换为fill_constant(0.0)，并不会对程序数值计算正确性造成影响。
该边界条件测试程序可见 https://gist.github.com/wangxicoding/d3b27289a545f62bec5130fc2952a542
_dp_as_optimizer_sharding已支持fuse_allreduce和fuse_grad_merge。后续TODO: 支持optimize_cast，在前反向中使用fp16的梯度，可以减少cast数量，减少fp32参数的存储，减少参数广播的通信量。

精度测试

Ernie3.0，base模型，单机8卡
baseline=2mp+2pp+2dp， optimzier_sharding=2mp+2pp+2opt_sharding

显存测试

Ernie3.0，单机八卡

模型配置	值
hidden_size	3072
num_attention_heads	48
num_hidden_layers	39
num_sharding_layers	36
branch_hidden_size	256
branch_num_attention_heads	4

卡id	baseline(MB)	opt_sharding(MB)	节省显存(MB)
0	23298	18720	4578
1	23274	18690	4584
2	27442	22296	5146
3	27414	22274	5140

paddle-bot-old · 2021-09-08T12:27:55Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

JZ-LIANG · 2021-09-14T11:22:21Z

paddle/fluid/framework/distributed_strategy.proto

@@ -43,6 +43,8 @@ message ShardingConfig {
  optional bool pp_allreduce_in_optimize = 10 [ default = false ];
  optional int32 pp_degree = 11 [ default = 1 ];
  optional bool optimize_cast = 12 [ default = false ];
+  // Optimizer sharding. Temporary plans and may be deprecated
+  optional bool _dp_as_optimizer_sharding = 13 [ default = false ];


why not new a config call stage, and allow two value: stage=1 and stage=3 by now

update later

JZ-LIANG

LGTM

sandyhouse · 2021-09-14T12:28:40Z

paddle/fluid/operators/amp/check_finite_and_unscale_op.cc

-            ctx->Inputs("X").size(), ctx->Outputs("Out").size()));
-    auto x_dims = ctx->GetInputsDim("X");
-    ctx->SetOutputsDim("Out", x_dims);
+    if (ctx->HasInputs("X") || ctx->HasOutputs("Out")) {


Why supports op without input/output?

sandyhouse

LGTM

…le#35593)

wangxicoding force-pushed the out_dp_as_optimzier_sharding branch from 7c0c5d8 to 35f0a4b Compare September 13, 2021 10:15

wangxicoding added 22 commits September 14, 2021 10:10

out dp as optimizer sharding

6fbdd8a

refine

81240b9

fix adapt amp clip

9df39b4

add reduce and broadcast op

d14845b

reserved param

bbbb760

fix cast

6ce85dc

fix c_broadcast root

d6732cc

fix startup program

6bc6962

fix prune and broadcast order

39a5626

fix

6ea8f6a

fix

408f217

add fuse reduce_sum

d534f9f

add fuse broadcast

b5ba5b1

add fuse broadcast

31101be

optimizer sharding support fuse gradient merge

ccb3aa4

fix insert_num

09cc1ac

clean

fcf1e8d

fix sharding prune

01af762

fix sharding prune

c8974c8

fix boundary

6d60e93

add optimizer sharding unittest

6a35978

fix ci

b042db4

fix develop Linear matmul_v2

f5a5597

wangxicoding force-pushed the out_dp_as_optimzier_sharding branch from bbac81d to f5a5597 Compare September 14, 2021 02:28

wangxicoding requested review from JZ-LIANG, gongweibao and sandyhouse September 14, 2021 11:12

JZ-LIANG reviewed Sep 14, 2021

View reviewed changes

JZ-LIANG self-requested a review September 14, 2021 11:48

JZ-LIANG approved these changes Sep 14, 2021

View reviewed changes

sandyhouse reviewed Sep 14, 2021

View reviewed changes

sandyhouse approved these changes Sep 14, 2021

View reviewed changes

wangxicoding requested a review from fuyinno4 September 15, 2021 01:48

wangxicoding merged commit 7846570 into PaddlePaddle:develop Sep 15, 2021

wangxicoding deleted the out_dp_as_optimzier_sharding branch September 15, 2021 03:32

AnnaTrainingG pushed a commit to AnnaTrainingG/Paddle that referenced this pull request Sep 29, 2021

[hybrid] out data parallel as optimizer sharding parallel (PaddlePadd…

bc43b5d

…le#35593)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hybrid] out data parallel as optimizer sharding parallel #35593

[hybrid] out data parallel as optimizer sharding parallel #35593

wangxicoding commented Sep 8, 2021 •

edited

Loading

paddle-bot-old bot commented Sep 8, 2021

JZ-LIANG Sep 14, 2021

wangxicoding Sep 14, 2021

JZ-LIANG left a comment

sandyhouse Sep 14, 2021

wangxicoding Sep 14, 2021

sandyhouse left a comment

[hybrid] out data parallel as optimizer sharding parallel #35593

[hybrid] out data parallel as optimizer sharding parallel #35593

Conversation

wangxicoding commented Sep 8, 2021 • edited Loading

PR types

PR changes

Describe

精度测试

显存测试

paddle-bot-old bot commented Sep 8, 2021

JZ-LIANG Sep 14, 2021

Choose a reason for hiding this comment

wangxicoding Sep 14, 2021

Choose a reason for hiding this comment

JZ-LIANG left a comment

Choose a reason for hiding this comment

sandyhouse Sep 14, 2021

Choose a reason for hiding this comment

wangxicoding Sep 14, 2021

Choose a reason for hiding this comment

sandyhouse left a comment

Choose a reason for hiding this comment

wangxicoding commented Sep 8, 2021 •

edited

Loading