Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify the elementwise op according to the kernel primitive API #34456

Merged

Conversation

AnnaTrainingG
Copy link
Contributor

@AnnaTrainingG AnnaTrainingG commented Jul 28, 2021

PR types

Function optimization

PR changes

APIs

Describe

Modify the elementwise op according to the kernel primitive API

1.将elementwise_op_impl.cu.h 中的ElementwiseVectorKernel 根据ET类型拆分成3个cuda kernel 以适配primivetive_api
2.将elementwise_op_broadcast.cu.h 中的ElementwiseBroadcastKernel 根据ET类型拆分成3个cuda kernel 以适配primivetive_api;
3.重构elementwise_op_broadcast.cu.h 中函数调用结构,定义BroadcastConfig结构体,简化broadcastConfig配置方式;

性能: 替换前后 性能与替换之前性能打平,部分case超越原始性能

same_dim add            
  case x_shape dtype old us API us pytorch us speed up
0 [50L, 128L, 1000L] fp32 92.16 92.21 92.361 1.00
1 [50L, 128L, 1000L fp32 92.21 92.2 92.277 1.00
2 [-1L, 2048L, 7L, 7L] fp32 24.58 24.574 25.3 1.00
3 [-1L, 2048L, -1L, -1L] fp32 120.11 120.08 120.33 1.00
4 [-1L, 1L, 513L, 513L] fp32 61.256 61.376 62 1.00
5 [512L, 896L, 4L, 12L] fp32 311.73 311.9 312.4 1.00
6 [512L, 896L, 4L, 12L] fp16 158.29 158.27 158.76 1.00
8 [32L, 1L, 1L, 128L] fp16 1.317 1.29 1.275 1.02
broadcast_add            
  case x_shape case y_shape dtype old us API us speed up
0 [50L, 128L, 1000L] [128L, 1000L] fp32 64.65 64.382 1.00
1 [50L, 128L, 1000L [1L, 128L, 1000L] fp32 65.058 64.31 1.01
2 [-1L, 2048L, 7L, 7L] [-1L, 2048L] fp32 18.332 17.637 1.04
3 [-1L, 2048L, -1L, -1L] [-1L, 2048L, -1L, -1L] fp32 120.1 120.08 1.00
4 [-1L, 1L, 513L, 513L] [1L] fp32 42.465 42.505 1.00
5 [512L, 896L, 4L, 12L] [512L, 896L, 4L, 1L] fp32 225.92 223.15 1.01
6 [512L, 896L, 4L, 12L] [512L, 896L, 4L, 1L] fp16 119.92 119.57 1.00
8 [32L, 12L, 128L, 128L] [32L, 1L, 1L, 128L] fp16 34.866 34.505 1.01
9 [32L, 1L, 1L, 128L] [1L, 12L, 128L, 1L] fp16 41.154 39.376 1.05

@CLAassistant
Copy link

CLAassistant commented Jul 28, 2021

CLA assistant check
All committers have signed the CLA.

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@AnnaTrainingG AnnaTrainingG changed the title Module api add block Modify the elementwise op according to the kernel primitive API Sep 2, 2021
Copy link
Contributor Author

@AnnaTrainingG AnnaTrainingG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删除不需要的模板参数

@xingfeng01
Copy link
Contributor

LGTM

2 similar comments
@JamesLim-sy
Copy link
Contributor

LGTM

@ZzSean
Copy link
Contributor

ZzSean commented Sep 7, 2021

LGTM

@zhangting2020 zhangting2020 merged commit eae4bf5 into PaddlePaddle:develop Sep 7, 2021
namespace paddle {
namespace operators {

#define MAX_INPUT_NUM 3 // the max num of ET for BroadcacstConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

定义在ElementwiseType里面,可以最后定义一个kMaxArity = 4

LoadVectorizedDataByDivmod(args[j], tid, j);
}
}
template <typename T, int VecSize, int ShapeSize, bool IsBoundary = false>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议ShapeSize -> Rank

}
template <ElementwiseType ET, typename InT, typename OutT, int ShapeSize,
int VecSize, typename Functor, bool IsBoundary = false>
__device__ void DealSegment(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数名不恰当。

broadcast_wrapper.LoadVectorizedData(args, tid);
template <typename InT, typename OutT, ElementwiseType ET, int VecSize,
int Size, typename Functor>
void LaunchKernel(const platform::CUDADeviceContext &ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LaunchBroadcastKernel吧,函数名区分一下。

framework::Tensor *out, Functor func,
DimensionsTransform merge_dims) {
int numel = out->numel();
const int threads = 256;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

线程数原来是通过GetThreadsConfig控制的,对一些小case能够有效调整线程配置。

OutT *out_data = out->data<OutT>();

framework::Array<kps::details::BroadcastConfig<Size>, MAX_INPUT_NUM>
configlists;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configlists -> config_list

InT args[ET][VecSize];
broadcast_wrapper.LoadVectorizedData(args, tid);
template <typename InT, typename OutT, ElementwiseType ET, int VecSize,
int Size, typename Functor>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的Size也是Rank吧?

inline __device__ void LoadScalarizedData(InT args[], int tid) {
template <ElementwiseType ET, int VecSize, typename InT, typename OutT,
typename Functor, bool IsBoundary>
__device__ void DealSegment(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数名同样需要改下,并且这个函数same dims版本和broadcast版本差不多,可以考虑合并一下。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants