Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify the reduce op according to the kernel primitive api #35282

Merged
merged 30 commits into from
Sep 8, 2021

Conversation

AnnaTrainingG
Copy link
Contributor

@AnnaTrainingG AnnaTrainingG commented Aug 31, 2021

PR types

Performance optimization

PR changes

OPs

Describe

  1. Modify the reduce OP according to the kernel primitive api
  2. Add ReduceHigherDimKernel and ReduceAnyKernel for higher performance in reduce_op.cu.h
  3. Add API comments and specify variable names

1.Modify the reduce OP according to the kernel primitive api
适配kernel primitives api, 保证性能功能和提前保持一致

reduce Any 替换前后性能变化:

axis case old us api us speed up
[2, 3] [16, 2048, 33, 33] 175.75 176.65 0.99
[0, 3] [32, 12, 128, 128] 35.268 35.559 0.99
[1, 3] [16, 32, 32, 32] 4.936 4.287 1.15
[1, 3] [16, 64, 512, 64] 155.58 155.06 1.00
[1, 3] [16, 2048, 32, 32] 172.312 172.162 1.00
[1, 3] [16, 32, 2048, 32] 157.6 157.53 1.00
[0, 2] [16, 2048, 32, 32] 160.95 160.53 1.00
[0, 2] [16, 32, 2048, 32] 159.832 160.614 1.00
[0, 2] [16, 2048, 33, 33] 179.86 178.95 1.01
[0, 2] [16, 33, 2048, 33] 236.8 231.78 1.02

reduceHigher 替换前后性能变化:

  axis case pytorch us paddle_old us api us speed up
0 axis=1 [16, 8, 128] 3.48 1.571 1.577 1.00
1 axis=0 [512    2048] 12.32 11.65 11.662 1.00
2 axis=0 [30522  1024] 160.66 152.68 153.41 1.00
3 axis=0 [32768  1280] 205.95 196.35 197.328 1.00
4 axis=0 [30522  10240] 1414.6 1409.20 1407.32 1.00
5 axis=0 [1024   1280] 8.265 9.37 9.39 1.00
6 axis=0 [30522  10240] 1415.5 1409.22 1407.18 1.00
7 axis=0 [2560   10240] 127.21 126.91 126.672 1.00
8 axis=0 [10240  1280] 77.276 69.44 69.418 1.00
9 axis=0 [32768  2560] 389.59 384.98 386.04 1.00
10 axis=0 [30522  1024] 161.01 152.43 152.966 1.00
11 axis=0 [32768  1280] 207.58 196.70 197.694 0.99
12 axis=0 [1024   1280] 7.949 9.06 9.39 0.97
13 axis=0 [256    12800] 18.259 20.65 21.592 0.96
14 axis=0 [256    10240] 15.742 19.10 20.039 0.95
15 axis=0 [128    1024] 5.535 4.88 5.23 0.93
16 axis=0 [16, 16, 1, 1] 3.117 1.882 2.262 0.83
17 axis=0 [1024   16] 4.656 4.07 5.36 0.76

2: Add ReduceHigherDimKernel and ReduceAnyKernel for higher performance in reduce_op.cu.h
背景:reduce在adaptive_avg_pool fp16类型适配时出现性能下降问题,case: 4 2048 64 128, 从之前的153.44us下降至174us,性能下降超过10%,
原因:定位发现在reduceLastDim和ReduceAny代码整合后,在只进行最后一维度reduce时,相比之前存在额外的index计算操作,导致性能下降。
解决办法: 在CPU端根据reduce_type 分别调用各自的ReduceKernel,1. 能够减少在GPU reduce_type判断,2.根据是否为最后一维reduce设置index计算规则。
修改后的性能比对数据如下:
benchmark adaptive_avg_pool性能变化, fp16类型使用fp32类型计算,因此性能会出现下降

case dtype old us new us sped up
4 2048 64 128 fp32 303.55 304.69 1.00
4 2048 64 128 fp16 153.44 155.83 0.98
  1. Add API comments and specify variable names
    主要修改如下:
    1.规范专用于reduce操作的变量命名,修改为kReduceMaxThread
    2.统一设定block_offset表示当前block的数据起始位置。
    3.统一设定thread_offset表示当前线程的数据起始位置。
    4.添加关于ReduceMode的说明,kGlobalMode 表示block内线程间的规约,需要使用到shareMem以及线程同步,一个输出依赖于block内所有线程的数据;kLocalMode:表示线程内的数据规约,线程间没有数据依赖,每个线程计算结束得到一个结果。

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@xingfeng01
Copy link
Contributor

LGTM

1 similar comment
@ZzSean
Copy link
Contributor

ZzSean commented Sep 7, 2021

LGTM

@limin2021
Copy link
Contributor

LGTM for modifications in attn_bias_add.cu.h.

Copy link
Contributor

@lanxianghit lanxianghit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 一些代码层面的优化建议,可以后续提PR修改下。

@AnnaTrainingG AnnaTrainingG changed the title Add ReduceHigherDimKernel and ReduceAnyKernel for higher performance in reduce_op.cu.h Modify the Reduce OP according to the kernel primitive API Sep 7, 2021
@AnnaTrainingG AnnaTrainingG changed the title Modify the Reduce OP according to the kernel primitive API Modify the Reduce OP according to the kernel primitive api Sep 7, 2021
@AnnaTrainingG AnnaTrainingG changed the title Modify the Reduce OP according to the kernel primitive api Modify the reduce op according to the kernel primitive api Sep 7, 2021
@Xreki Xreki merged commit 82b33be into PaddlePaddle:develop Sep 8, 2021
2742195759 pushed a commit to 2742195759/Paddle that referenced this pull request Sep 10, 2021
AnnaTrainingG added a commit to AnnaTrainingG/Paddle that referenced this pull request Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants