[hybrid performance] Grad fuse for gradient merge under pipeline mode #35004

FeixLiu · 2021-08-19T02:40:37Z

PR types

Performance optimization

PR changes

Others

Describe

Fused gradient merge under pipeline mode

The following test are using Ernie 3.0 model on 8 V100 GPUs, with PP=2, MP=2 and DP=2

Throughput compared tokens/s (increase compared with baseline)

Baseline speed	fp16 allreduce speed	grad fuse speed	fp16 allreduce + grad fuse speed	fp16 allreduce + optimize cast + grad fuse speed
34285	35228 (+2.8%)	36144 (+5.5%)	35596 (+3.9%)	39145(+14.2%)

Loss compared between baseline and fp16 allreduce

Loss compared between baseline and grad fuse

Loss compared between baseline and fp16 allreduce with grad fuse

Loss compared between baseline and fp16 allreduce, optimizer cast with grad fuse

NPU Loss diff （By Peng Liu)

paddle-bot-old · 2021-08-19T02:40:48Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

python/paddle/fluid/optimizer.py

python/paddle/fluid/tests/unittests/test_fleet_sharding_meta_optimizer.py

wangxicoding

另外再测一组 optimize_cast + fp16_allreduce + fuse_grad_merge 吧

FeixLiu · 2021-08-19T09:34:51Z

另外再测一组 optimize_cast + fp16_allreduce + fuse_grad_merge 吧

正在跑～～

FeixLiu · 2021-08-20T01:38:58Z

还有优化空间，grad和param按dtype分类之后再fuse，可以减少coalesce op个数以及fused var的个数。下一个pr可以继续优化。

wangxicoding

LGTM

…pipeline mode (PaddlePaddle#35004)

…pipeline mode (#35004) (#35299)

…e under pipeline mode (PaddlePaddle#35004) (PaddlePaddle#35299)" This reverts commit e931cd1.

PaddlePaddle#35116) (PaddlePaddle#35301)" This reverts commit 2931df5. Revert "[cherry-pick][hybrid performance] optim npu coalesce set constant (PaddlePaddle#35105) (PaddlePaddle#35302)" This reverts commit 12260bd. Revert "[cherry-pick][hybrid performance] optim the grad fuse for pipeline mode by sorting the grad by dtype (PaddlePaddle#35070) (PaddlePaddle#35300)" This reverts commit e69cc21. Revert "[cherry-pick][hybrid performance] Grad fuse for gradient merge under pipeline mode (PaddlePaddle#35004) (PaddlePaddle#35299)" This reverts commit e931cd1. Revert "Add flags to control whether to check Nan value of hccl_allreduce_sum. (PaddlePaddle#35093) (PaddlePaddle#35298)" This reverts commit d4948bc. Revert "[hybrid] Fix row parallel linear bias (PaddlePaddle#35186) (PaddlePaddle#35297)" This reverts commit b36fb03. Revert "[hybrid][npu] fix npu clear float status in pipeline (PaddlePaddle#35165) (PaddlePaddle#35295)" This reverts commit 167685e. Revert "[hybrid npu] fix npu found_finite in hybrid (PaddlePaddle#35134) (PaddlePaddle#35291)" This reverts commit e64105f. Revert "[cherry-pick][Hybrid Performance] Move the cast op of AMP which cast fp32 param to fp16 param to the optimizer (PaddlePaddle#34965) (PaddlePaddle#35296)" This reverts commit 6fb58ae. Revert "[cherry-pick] NPU use squared_l2_norm in GradientClipByGlobalNorm (PaddlePaddle#34836) (PaddlePaddle#35289)" This reverts commit 38c27d5.

FeixLiu added 22 commits August 19, 2021 10:00

fuse grad startup

037d647

remove shape, move fused_merged_grad to startup

c9c9e84

remove merged grad back to main

2562a29

rewrite the dtype

b497b1e

add fp32 cast back

8553cd9

create @Grad@MERGED vars

2899fae

resource the dtype

1e89e78

add shape info for MERGED var

4faee51

update loop condition

3b64dc1

rewrite cast logic

1b9b657

remove cast_fp16 in grad's name

22bfde7

update grad name generate method

72a3120

copy data for colaesce

9050872

fix dtype of coalesce op

ab60822

dytpe supports for colaesce op

4a26c7f

no fp16 allreduce supports

d7f0da3

update dytpe value

ff62965

add size of dtype

b112278

remove startup param

3549d31

rename the dist strategy flag

447a193

rename the size_of_dtype attr for coalesce op

d2734bc

update some comment

00b7636

FeixLiu added 7 commits August 19, 2021 10:41

add todo, test=notest

e982959

support NPU set constant for coalesce op, test=notest

f120ed7

add test for optimizer pass, test=notest

91f05de

update NPU fill constant, test=notest

1257a61

gpu test for coalesce, test=notest

5b02696

npu test for coalesce

98fa317

remove useless todo

bb61c9d

fix bug if user_defined_strategy is None

9316944

wangxicoding reviewed Aug 19, 2021

View reviewed changes

FeixLiu added 2 commits August 19, 2021 17:19

tmp solution for fuse_grad_merge + optimize_cast

f68967a

add more comment, rename var name

c87cb9d

FeixLiu changed the title ~~Gard fuse for gradient merge under pipeline mode~~ Grad fuse for gradient merge under pipeline mode Aug 19, 2021

wangxicoding approved these changes Aug 20, 2021

View reviewed changes

wangxicoding requested review from gongweibao, fuyinno4, sandyhouse and JZ-LIANG August 20, 2021 02:21

wangxicoding changed the title ~~Grad fuse for gradient merge under pipeline mode~~ [hybrid performance] Grad fuse for gradient merge under pipeline mode Aug 20, 2021

wangxicoding merged commit 4d9b2d6 into PaddlePaddle:develop Aug 20, 2021

FeixLiu deleted the gard_fuse_for_sum branch August 20, 2021 09:13

FeixLiu added a commit to FeixLiu/Paddle that referenced this pull request Aug 31, 2021

[cherry-pick][hybrid performance] Grad fuse for gradient merge under …

c0363cd

…pipeline mode (PaddlePaddle#35004)

wangxicoding pushed a commit that referenced this pull request Aug 31, 2021

[cherry-pick][hybrid performance] Grad fuse for gradient merge under …

e931cd1

…pipeline mode (#35004) (#35299)

FeixLiu added a commit to FeixLiu/Paddle that referenced this pull request Sep 2, 2021

Revert "[cherry-pick][hybrid performance] Grad fuse for gradient merg…

a5a7918

…e under pipeline mode (PaddlePaddle#35004) (PaddlePaddle#35299)" This reverts commit e931cd1.

FeixLiu added a commit to FeixLiu/Paddle that referenced this pull request Sep 3, 2021

copy FillConstantVisitor from PaddlePaddle#35004 and PaddlePaddle#35105

2e374d2

FeixLiu added a commit to FeixLiu/Paddle that referenced this pull request Sep 3, 2021

copy FillConstantVisitor from PaddlePaddle#35004 and PaddlePaddle#35105

70e3970

FeixLiu mentioned this pull request Sep 3, 2021

New cherry pick #35424

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hybrid performance] Grad fuse for gradient merge under pipeline mode #35004

[hybrid performance] Grad fuse for gradient merge under pipeline mode #35004

FeixLiu commented Aug 19, 2021 •

edited

Loading

paddle-bot-old bot commented Aug 19, 2021

wangxicoding left a comment

FeixLiu commented Aug 19, 2021

FeixLiu commented Aug 20, 2021 •

edited

Loading

wangxicoding left a comment

[hybrid performance] Grad fuse for gradient merge under pipeline mode #35004

[hybrid performance] Grad fuse for gradient merge under pipeline mode #35004

Conversation

FeixLiu commented Aug 19, 2021 • edited Loading

PR types

PR changes

Describe

Throughput compared tokens/s (increase compared with baseline)

Loss compared between baseline and fp16 allreduce

Loss compared between baseline and grad fuse

Loss compared between baseline and fp16 allreduce with grad fuse

Loss compared between baseline and fp16 allreduce, optimizer cast with grad fuse

NPU Loss diff （By Peng Liu)

paddle-bot-old bot commented Aug 19, 2021

wangxicoding left a comment

Choose a reason for hiding this comment

FeixLiu commented Aug 19, 2021

FeixLiu commented Aug 20, 2021 • edited Loading

wangxicoding left a comment

Choose a reason for hiding this comment

FeixLiu commented Aug 19, 2021 •

edited

Loading

FeixLiu commented Aug 20, 2021 •

edited

Loading