modified reduce_max reduce_min reduce_prod for higher_performance and fix a bug in reduce_op.cuh #32974

AnnaTrainingG · 2021-05-18T15:22:29Z

PR types

Function optimization

PR changes

OPs

Describe

modified reduce_min reduce_max reduce_prod reduce_all reduce_any

ctest结果：

Test project /paddle_test/commit/Paddle/build
Start 709: test_max_op
100% tests passed, 0 tests failed out of 1

Total Test time (real) = 7.51 sec
Test project /paddle_test/commit/Paddle/build
Start 719: test_min_op
100% tests passed, 0 tests failed out of 1

Total Test time (real) = 6.81 sec
Test project /paddle_test/commit/Paddle/build
Start 826: test_prod_op

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 7.78 sec

以max 为例进行性能比对：

axis	case	pytorch us	paddle_old us	paddle_new us	加速比 old/new	加速比pytorch/padle_new	是否为benchmark
axis=0	[512 2048]	12.442	28.272	10.821	2.61	1.15	否
axis=0	[128 1024]	5.595	5.181	3.711	1.40	1.51	否
axis=0	[30522 1024]	162.77	1767.3	152.229	11.61	1.07	否
axis=0	[1024 16]	4.703	2.471	3.509	0.70	1.34	否
axis=0	[256 12800]	18.756	81.647	17.734	4.60	1.06	否
axis=0	[256 10240]	15.742	59.888	15.379	3.89	1.02	否
axis=0	[1024 1280]	11.625	33.204	8.399	3.95	1.38	否
axis=0	[32768 1280]	205.95	3504.7	198.15	17.69	1.04	否
axis=0	[30522 10240]	1414.6	32643	1437.523	22.71	0.98	否
axis=0	[256 10240]	15.257	65.901	14.79	4.46	1.03	否
axis=0	[1024 1280]	8.265	31.31	7.158	4.37	1.15	否
axis=0	[32768 1280]	207.58	3501	198.297	17.66	1.05	否
axis=0	[30522 10240]	1415.5	32554	1438.646	22.63	0.98	否
axis=0	[2560 10240]	127.21	585.19	126.275	4.63	1.01	否
axis=0	[10240 1280]	76.668	413.34	67.667	6.11	1.13	否
axis=0	[32768 2560]	390.23	8323.7	383.609	21.70	1.02	否
axis=0	[30522 1024]	160.21	1808.7	151.341	11.95	1.06	否
axis=0	[16 16 1 1]	2.884	1.332	1.44	0.93	2.00	是

benchmark性能数据如下：

axis	case	pytorch	paddle	paddle_new_last	old/new	pytorch/new
axis: [2, 3]	[16 2048 33 33]	171.1	199.8	164.36	1.22	1.04
axis: [1]	[16 8 128]	3.285	4.234	1.322	3.20	2.48
axis: [0]	[16 16 1 1]	2.884	1.568	1.44	1.09	2.00
axis: []	[30522 1024]	146.45	143.12	142.99	1.00	1.02

reduce_sum优化前后性能变化
reduce维度	加速比	与pytorch对比情况
axis = 0	1.4 ～ 22.7	打平或者超过pytorch。
axis = -1	1.0 ～1.3	打平或者超过pytorch，17个case中有2个case差于pytorch，约为pytorch时间的2倍
axis = 1	2.44 ～24.88	打平或者超过pytorch， 17个case中有1个case差于pytorch，约为pytorch时间的2倍
axis =[]	1.0 ~1.03	打平或者超过pytorch， 17个case中有1个case差于pytorch，约为pytorch时间的2倍

update

paddle-bot-old · 2021-05-18T15:22:32Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

…iling123/Paddle into reduce_max_min_prod_all_any

update

…iling123/Paddle into reduce_max_min_prod_all_any

CLAassistant · 2021-05-27T02:18:25Z

All committers have signed the CLA.

paddle/fluid/operators/reduce_ops/reduce_functor_op.h

paddle/fluid/operators/reduce_ops/reduce_max_op.cu

paddle/fluid/operators/reduce_ops/reduce_op.cuh

…iling123/Paddle into reduce_max_min_prod_all_any

paddle/fluid/operators/reduce_ops/reduce_functor_op.h

paddle/fluid/operators/reduce_ops/reduce_max_op.cu

paddle/fluid/operators/reduce_ops/reduce_op.cu.h

update

paddle/fluid/operators/reduce_ops/reduce_op.cu.h

xingfeng01 · 2021-06-21T03:33:13Z

LGTM

ZzSean · 2021-06-21T07:39:55Z

LGTM

Xreki

LGTM

Xreki · 2021-06-22T04:51:08Z

paddle/fluid/operators/reduce_ops/reduce_op.cu.h

  }
 }

+// module function designed for global function


感觉模板可以再简化一下，一些参数没有必要通过模板传，比如ReduceType。关于ReduceType的if判断只执行一次，并没有在循环里面，所以通过输入参数传也不会多影响性能。减少一些模板，应该能够剪短一些编译时间。

对于TransformOp，感觉LaunchReduceKernel和LaunchKernel这两个函数是不需要将TransformOp作为模板的？ReduceKernelFunction看起来是需要的。另外，LaunchReduceKernel和LaunchKernel函数命名缺乏辨识度，不能准确地表达函数的功能。

Xreki · 2021-06-22T05:09:20Z

paddle/fluid/operators/reduce_ops/reduce_op.cu.h

@@ -141,21 +174,24 @@ struct ReduceConfig {
  void Run() {


L170的comment：输入参数都建议改成const std::vector &类型。

Xreki · 2021-06-22T05:18:14Z

paddle/fluid/operators/reduce_ops/reduce_op.cu.h

@@ -523,22 +606,22 @@ static void launchKernel(const Tx* x_data, Ty* y_data,
    ReduceKernelFunction<
        Ty, Ty, ReduceOp, detail::IdentityFunctor<Ty>, 128, kRank, kReduceRank,
        ReduceType::kReduceHigherDim><<<grid, block, 0, stream>>>(


L597 - L599 comment：直接写成CUB_REDUCE_TYPE_CASE(ReduceType::kReduceLastDim)这样？若ReduceType不作为模板，也就不需要这个swith case了。

Xreki · 2021-06-22T05:32:30Z

paddle/fluid/operators/reduce_ops/reduce_op.cu.h

  // SetOutputData for ReduceHigherDim when should_reduce_again is true,
  //   temp_output should be stored temp_data in output_data space or stored in
  //   y_data;
-  config.SetOutputData(y_data, x.place(), tmp);
+  framework::Tensor tmp;
+  config.SetOutputData(y_data, x.place(), &tmp);

  if (config.reduce_num == 1) {
    auto out_dims = y->dims();
    framework::TensorCopy(x, y->place(), y);
    y->Resize(out_dims);
    return;
  }


L684 - L689可以挪到L674或L677前面？

Xreki · 2021-06-22T05:37:48Z

paddle/fluid/operators/reduce_ops/reduce_op.cu.h

+  }
+};
+
+template <typename T, template <typename, typename> class ReduceOp>


上面的实现都可能需要复用到别的算子里面（比如broadcast反向），但ReduceCudaKernel只用于reduce_xxx算子的实现，所以L749 - L771最好不要放到这个头文件里面。

Xreki · 2021-06-22T05:39:19Z

paddle/fluid/operators/reduce_ops/reduce_prod_op.cu

-                                          int, ops::ProdFunctor>,
-                        ops::ReduceKernel<paddle::platform::CUDADeviceContext,
-                                          int64_t, ops::ProdFunctor>);
+REGISTER_OP_CUDA_KERNEL(


从注释来看，原来之所以加这个ifdef是因为原来的reduce采用Eigen实现，而Eigen对double的支持有问题。我们已经全部改成了cuda+cub的方式，或许这个ifdef可以去掉。

AnnaTrainingG and others added 5 commits March 25, 2021 16:46

Merge pull request #1 from PaddlePaddle/develop

7d58b91

update

Merge pull request #2 from PaddlePaddle/develop

1021e08

update

Merge pull request #3 from PaddlePaddle/develop

43f53fe

update

Merge pull request #4 from PaddlePaddle/develop

d25ab26

update

max_min_prod_all_any

a244f18

AnnaTrainingG mentioned this pull request May 20, 2021

change the name of activation kernel #32374

Closed

AnnaTrainingG and others added 16 commits May 24, 2021 16:12

Update reduce_any_op.cu

af4db5d

modified

d804066

Merge branch 'reduce_max_min_prod_all_any' of https://github.com/niul…

c7826e8

…iling123/Paddle into reduce_max_min_prod_all_any

copyright

6ea9e9a

Merge pull request #5 from PaddlePaddle/develop

8c8717f

update

modified and {} for loop

ff0a6e9

max_min_prod_all_any

7ddaf91

Update reduce_any_op.cu

a43af7d

modified

0a70b82

copyright

c91b26b

modified and {} for loop

54651e0

Merge branch 'reduce_max_min_prod_all_any' of https://github.com/niul…

37fbd4c

…iling123/Paddle into reduce_max_min_prod_all_any

add notes for reduce_op.cuh

35411f7

update

8cea954

update

a719c3c

update

2e8ad8f

AnnaTrainingG force-pushed the reduce_max_min_prod_all_any branch from 3069b09 to 2e8ad8f Compare May 27, 2021 02:19

fix a bug in reduce_Op.cuh

a60b90a

AnnaTrainingG changed the title ~~Reduce max min prod all any~~ Reduce max min prod May 28, 2021

AnnaTrainingG changed the title ~~Reduce max min prod~~ modified reduce_max reduce_min reduce_prod for higher_performance and fix a bug in reduce_op.cuh May 28, 2021

reset reduce_any and reduce_all

4bd9644

AnnaTrainingG force-pushed the reduce_max_min_prod_all_any branch from 4ce20bb to 4bd9644 Compare May 28, 2021 11:28

Update reduce_functor_op.h

790173a

AnnaTrainingG force-pushed the reduce_max_min_prod_all_any branch from d2dea81 to 469e0a5 Compare June 2, 2021 11:56

update TensorReduceFunc

8700894

AnnaTrainingG force-pushed the reduce_max_min_prod_all_any branch from 469e0a5 to 8700894 Compare June 2, 2021 11:59

niuliling123 and others added 2 commits June 3, 2021 02:39

add reduce_functor_op.h pragma once

9e32b0f

update BOUND and kMaxTHread

17dcaf8

Xreki reviewed Jun 7, 2021

View reviewed changes

AnnaTrainingG added 5 commits June 9, 2021 03:16

modified max min prod for cu.h

cb2b619

update for struct

6541ffb

code style reduce_op.cu.h

719e435

device to HOSTDEVICE

5045a49

Merge branch 'reduce_max_min_prod_all_any' of https://github.com/niul…

a5dedb1

…iling123/Paddle into reduce_max_min_prod_all_any

Xreki reviewed Jun 11, 2021

View reviewed changes

AnnaTrainingG and others added 3 commits June 15, 2021 07:01

ReduceCudaKernel

fb69e3d

Merge pull request #15 from PaddlePaddle/develop

24633a5

update

REDUCE_SPLIT_BOUNDARY

b841b34

ZzSean reviewed Jun 15, 2021

View reviewed changes

paddle/fluid/operators/reduce_ops/reduce_op.cu.h Outdated Show resolved Hide resolved

Update reduce_op.cu.h

1fda4d5

ZzSean reviewed Jun 15, 2021

View reviewed changes

paddle/fluid/operators/reduce_ops/reduce_op.cu.h Show resolved Hide resolved

ZzSean reviewed Jun 15, 2021

View reviewed changes

paddle/fluid/operators/reduce_ops/reduce_op.cu.h Outdated Show resolved Hide resolved

ZzSean reviewed Jun 15, 2021

View reviewed changes

paddle/fluid/operators/reduce_ops/reduce_op.cu.h Outdated Show resolved Hide resolved

ZzSean reviewed Jun 16, 2021

View reviewed changes

paddle/fluid/operators/reduce_ops/reduce_op.cu.h Outdated Show resolved Hide resolved

AnnaTrainingG added 4 commits June 16, 2021 06:27

rename reduceTensorFunctor

c85ca05

rename TensorReduceFunc

9cc8ac3

delete HOSTDEVICE

140779d

add left_num * grid.z * grid.y

fa3411c

Xreki approved these changes Jun 22, 2021

View reviewed changes

Xreki merged commit 480b284 into PaddlePaddle:develop Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modified reduce_max reduce_min reduce_prod for higher_performance and fix a bug in reduce_op.cuh #32974

modified reduce_max reduce_min reduce_prod for higher_performance and fix a bug in reduce_op.cuh #32974

AnnaTrainingG commented May 18, 2021 •

edited

Loading

paddle-bot-old bot commented May 18, 2021

CLAassistant commented May 27, 2021 •

edited

Loading

xingfeng01 commented Jun 21, 2021

ZzSean commented Jun 21, 2021

Xreki left a comment

Xreki Jun 22, 2021 •

edited

Loading

Xreki Jun 22, 2021

Xreki Jun 22, 2021

Xreki Jun 22, 2021

Xreki Jun 22, 2021

Xreki Jun 22, 2021

modified reduce_max reduce_min reduce_prod for higher_performance and fix a bug in reduce_op.cuh #32974

modified reduce_max reduce_min reduce_prod for higher_performance and fix a bug in reduce_op.cuh #32974

Conversation

AnnaTrainingG commented May 18, 2021 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented May 18, 2021

CLAassistant commented May 27, 2021 • edited Loading

xingfeng01 commented Jun 21, 2021

ZzSean commented Jun 21, 2021

Xreki left a comment

Choose a reason for hiding this comment

Xreki Jun 22, 2021 • edited Loading

Choose a reason for hiding this comment

Xreki Jun 22, 2021

Choose a reason for hiding this comment

Xreki Jun 22, 2021

Choose a reason for hiding this comment

Xreki Jun 22, 2021

Choose a reason for hiding this comment

Xreki Jun 22, 2021

Choose a reason for hiding this comment

Xreki Jun 22, 2021

Choose a reason for hiding this comment

AnnaTrainingG commented May 18, 2021 •

edited

Loading

CLAassistant commented May 27, 2021 •

edited

Loading

Xreki Jun 22, 2021 •

edited

Loading