-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify the reduce op according to the kernel primitive api #35282
Modify the reduce op according to the kernel primitive api #35282
Conversation
Thanks for your contribution! |
paddle/fluid/operators/kernel_primitives/datamover_primitives.h
Outdated
Show resolved
Hide resolved
LGTM |
1 similar comment
LGTM |
LGTM for modifications in attn_bias_add.cu.h. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. 一些代码层面的优化建议,可以后续提PR修改下。
PR types
Performance optimization
PR changes
OPs
Describe
1.Modify the reduce OP according to the kernel primitive api
适配kernel primitives api, 保证性能功能和提前保持一致
reduce Any 替换前后性能变化:
reduceHigher 替换前后性能变化:
2: Add ReduceHigherDimKernel and ReduceAnyKernel for higher performance in reduce_op.cu.h
背景:reduce在adaptive_avg_pool fp16类型适配时出现性能下降问题,case: 4 2048 64 128, 从之前的153.44us下降至174us,性能下降超过10%,
原因:定位发现在reduceLastDim和ReduceAny代码整合后,在只进行最后一维度reduce时,相比之前存在额外的index计算操作,导致性能下降。
解决办法: 在CPU端根据reduce_type 分别调用各自的ReduceKernel,1. 能够减少在GPU reduce_type判断,2.根据是否为最后一维reduce设置index计算规则。
修改后的性能比对数据如下:
benchmark adaptive_avg_pool性能变化, fp16类型使用fp32类型计算,因此性能会出现下降
主要修改如下:
1.规范专用于reduce操作的变量命名,修改为kReduceMaxThread
2.统一设定block_offset表示当前block的数据起始位置。
3.统一设定thread_offset表示当前线程的数据起始位置。
4.添加关于ReduceMode的说明,kGlobalMode 表示block内线程间的规约,需要使用到shareMem以及线程同步,一个输出依赖于block内所有线程的数据;kLocalMode:表示线程内的数据规约,线程间没有数据依赖,每个线程计算结束得到一个结果。