-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modified reduce_max reduce_min reduce_prod for higher_performance and fix a bug in reduce_op.cuh #32974
modified reduce_max reduce_min reduce_prod for higher_performance and fix a bug in reduce_op.cuh #32974
Conversation
Thanks for your contribution! |
…iling123/Paddle into reduce_max_min_prod_all_any
…iling123/Paddle into reduce_max_min_prod_all_any
3069b09
to
2e8ad8f
Compare
4ce20bb
to
4bd9644
Compare
d2dea81
to
469e0a5
Compare
469e0a5
to
8700894
Compare
LGTM |
1 similar comment
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
} | ||
} | ||
|
||
// module function designed for global function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉模板可以再简化一下,一些参数没有必要通过模板传,比如ReduceType
。关于ReduceType
的if
判断只执行一次,并没有在循环里面,所以通过输入参数传也不会多影响性能。减少一些模板,应该能够剪短一些编译时间。
对于TransformOp
,感觉LaunchReduceKernel
和LaunchKernel
这两个函数是不需要将TransformOp
作为模板的?ReduceKernelFunction
看起来是需要的。另外,LaunchReduceKernel
和LaunchKernel
函数命名缺乏辨识度,不能准确地表达函数的功能。
@@ -141,21 +174,24 @@ struct ReduceConfig { | |||
void Run() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L170的comment:输入参数都建议改成const std::vector &
类型。
@@ -523,22 +606,22 @@ static void launchKernel(const Tx* x_data, Ty* y_data, | |||
ReduceKernelFunction< | |||
Ty, Ty, ReduceOp, detail::IdentityFunctor<Ty>, 128, kRank, kReduceRank, | |||
ReduceType::kReduceHigherDim><<<grid, block, 0, stream>>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L597 - L599 comment:直接写成CUB_REDUCE_TYPE_CASE(ReduceType::kReduceLastDim)
这样?若ReduceType
不作为模板,也就不需要这个swith case
了。
// SetOutputData for ReduceHigherDim when should_reduce_again is true, | ||
// temp_output should be stored temp_data in output_data space or stored in | ||
// y_data; | ||
config.SetOutputData(y_data, x.place(), tmp); | ||
framework::Tensor tmp; | ||
config.SetOutputData(y_data, x.place(), &tmp); | ||
|
||
if (config.reduce_num == 1) { | ||
auto out_dims = y->dims(); | ||
framework::TensorCopy(x, y->place(), y); | ||
y->Resize(out_dims); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
L684 - L689可以挪到L674或L677前面?
} | ||
}; | ||
|
||
template <typename T, template <typename, typename> class ReduceOp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
上面的实现都可能需要复用到别的算子里面(比如broadcast反向),但ReduceCudaKernel
只用于reduce_xxx算子的实现,所以L749 - L771最好不要放到这个头文件里面。
int, ops::ProdFunctor>, | ||
ops::ReduceKernel<paddle::platform::CUDADeviceContext, | ||
int64_t, ops::ProdFunctor>); | ||
REGISTER_OP_CUDA_KERNEL( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
从注释来看,原来之所以加这个ifdef
是因为原来的reduce采用Eigen
实现,而Eigen
对double
的支持有问题。我们已经全部改成了cuda+cub
的方式,或许这个ifdef
可以去掉。
PR types
Function optimization
PR changes
OPs
Describe
modified reduce_min reduce_max reduce_prod reduce_all reduce_any
ctest结果:
Test project /paddle_test/commit/Paddle/build
Start 709: test_max_op
100% tests passed, 0 tests failed out of 1
Total Test time (real) = 7.51 sec
Test project /paddle_test/commit/Paddle/build
Start 719: test_min_op
100% tests passed, 0 tests failed out of 1
Total Test time (real) = 6.81 sec
Test project /paddle_test/commit/Paddle/build
Start 826: test_prod_op
100% tests passed, 0 tests failed out of 1
Total Test time (real) = 7.78 sec
以max 为例进行性能比对:
benchmark性能数据如下: