fix gpt2 train loss Nan problem by add a line __syncthreads in BlockR… #33659

zhiboniu · 2021-06-18T09:02:35Z

Bug fixes

OPs

背景：
gpt2 训练过程中出现loss不稳定、不收敛、最终变成nan的情况。

经排查：
1）在P40上训练正常，V100上训练出现异常。
2）添加一行log打印训练正常，无log打印训练异常。
3）使用原线性相加方式训练正常，使用BlockReduceSum训练异常。

最终添加一行__syncthreads后解决训练异常问题。
同时对另外两个BlockReduceSum做了同步修改。

对于shared[32]使用的共享内存大小数据32，来源是：
int wid = threadIdx.x / warpSize;
nvidia gpu blockdim最大1024，warpSize 32，所以改大小不超过maxblockdim/warpsize=32。

cherry-pick from:
#33658

paddle-bot-old · 2021-06-18T09:02:43Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

…educeSum

ForFishes

LGTM

fix gpt2 train loss Nan problem by add a line __syncthreads in BlockR…

5fcbed0

…educeSum

ForFishes approved these changes Jun 18, 2021

View reviewed changes

XiaoguangHu01 merged commit cdeffff into PaddlePaddle:release/2.1 Jun 21, 2021

Provide feedback