Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【PTen】Add dot and matmul grad kernel in pten #38713

Merged
merged 26 commits into from
Jan 11, 2022

Conversation

zyfncg
Copy link
Contributor

@zyfncg zyfncg commented Jan 5, 2022

PR types

Others

PR changes

Others

Describe

迁移dot和matmul反向一阶、二阶和三阶计算kernel到pten中。

为了完成PTen反向计算kernel与框架的适配,本PR中还包括了以下几项调整:

  1. 原Op体系中反向Op没有OpProto信息,与前向Op的处理有所不同,因此本PR中调整了相应的处理逻辑并为迁移的每个反向kernel对应的Op配置GetExpectedPtenKernelArgs,该解决方案后续有可能会替换。
  2. 反向kernel的部分输入DenseTensor存在为空的情况,并且在kernel内部有相应的判断分支逻辑,为了处理这里的判断条件,使用了paddle::optional<const DenseTensor&>来包裹此类可能为空的输入变量。为此也在pten中增加了对paddle::optional<const DenseTensor&>输入类型的支持。
  3. 增加了kernel输出DenseTensor可能为NULL的适配支持。
  4. 为动态图执行调用PTen反向kernel增加复数转换逻辑。
  5. DenseTensor新增移动赋值函数DenseTensor& operator=(DenseTensor&& other)

@paddle-bot-old
Copy link

paddle-bot-old bot commented Jan 5, 2022

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@@ -560,17 +586,19 @@ static void PreparedOpRunPtImpl(
pt_kernel_context->ClearData();

// TODO(chenweihang): add debug flags later
// TODO(chenweihang): deal with complex cases later
if (framework::IsComplexType(kernel_type.data_type_)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是否可以使用pten_kernel的data type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

由于传入的KernelSignatureKernel数据结构都不具有data_type信息,所以需要使用kernel_type的数据

paddle/pten/kernels/complex_kernel.h Outdated Show resolved Hide resolved
paddle/pten/kernels/complex_kernel.h Outdated Show resolved Hide resolved
paddle/pten/kernels/complex_kernel.h Outdated Show resolved Hide resolved
paddle/pten/kernels/cpu/dot_grad_kernel.cc Outdated Show resolved Hide resolved
paddle/pten/kernels/empty_kernel.h Outdated Show resolved Hide resolved
paddle/pten/kernels/gpu/dot_kernel.cu Outdated Show resolved Hide resolved
paddle/pten/kernels/hybird/transpose.h Show resolved Hide resolved
paddle/pten/kernels/impl/dot_grad_kernel_impl.h Outdated Show resolved Hide resolved
paddle/pten/kernels/impl/matmul_grad_kernel_impl.h Outdated Show resolved Hide resolved
Comment on lines +1893 to +1898
if (current_vector_size > start_idx) {
pt_kernel_context_->SetOutputWithoutSetRange(start_idx, {nullptr});
} else {
pt_kernel_context_->EmplaceBackOutputWithoutSetRange({nullptr});
}
end_idx = start_idx + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里加点注释吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 360 to 365
} else {
kernel_ctx->SetOutputWithoutSetRange(
start_idx + offset,
experimental::MakePtenTensorBaseFromVar(
outs_vector[offset]->MutableVar(), out_def));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个分支有用到吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

动态图模式下会执行到

Comment on lines 374 to 381
} else {
if (current_vector_size > start_idx) {
kernel_ctx->SetOutputWithoutSetRange(start_idx, {nullptr});
} else {
kernel_ctx->EmplaceBackOutputWithoutSetRange(
experimental::MakePtenTensorBaseFromVar(
outs_vector[offset]->MutableVar(), out_def));
kernel_ctx->EmplaceBackOutputWithoutSetRange({nullptr});
}
kernel_ctx->AssignOutputRange(std::make_pair(start_idx, start_idx + 1),
i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里建议将这段逻辑挪到开头,使用iter == outs.end判断执行后直接continue,这样可以优化代码结构,减少if else逻辑嵌套便于代码维护与理解

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

paddle::platform::complex<float>,
paddle::platform::complex<double>) {}

PT_REGISTER_CTX_KERNEL(matmul_grad_grad,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里建议命名与函数一致:matmul_double_grad,alias_name也如此

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

paddle/pten/kernels/dot_grad_kernel.h Outdated Show resolved Hide resolved
paddle/pten/kernels/gpu/dot_grad_kernel.cu Outdated Show resolved Hide resolved
paddle/pten/kernels/gpu/matmul_grad_kernel.cu Outdated Show resolved Hide resolved
paddle/pten/kernels/gpu/matmul_grad_kernel.cu Outdated Show resolved Hide resolved
paddle/pten/kernels/matmul_grad_kernel.h Outdated Show resolved Hide resolved
Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zyfncg zyfncg merged commit be81771 into PaddlePaddle:develop Jan 11, 2022
@Xreki
Copy link
Contributor

Xreki commented Jan 13, 2022

怀疑该PR导致了linear反向性能下降一倍:

  1. 1月6日的OP Benchmark数据:
    image

linear_2的nvprof结果如下:

run command: nvprof --profile-from-start off /work/.virtualenvs_cuda11.4/paddle_py38/bin/python /work/benchmark/api/dynamic_tests_v2/linear.py --api_name linear --task speed --framework paddle --testing_mode dynamic --json_file /work/benchmark/api/tests_v2/configs/linear.json --config_id 2 --backward True --use_gpu True --repeat 1000 --allow_adaptive_repeat True --profiler nvprof
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   36.07%  199.88ms      2000  99.938us  93.696us  136.29us  volta_sgemm_64x32_sliced1x4_tn
                   30.52%  169.10ms      2000  84.548us  81.408us  92.960us  volta_sgemm_64x32_sliced1x4_nn
                   27.48%  152.30ms      2000  76.148us  71.040us  86.752us  volta_sgemm_128x32_nt
                    1.96%  10.845ms      2000  5.4220us  5.1520us  10.528us  void splitKreduce_kernel<float, float, float, float, bool=1, bool=0>(cublasSplitKParams<float>, float const *, float const *, float*, float const *, float const *, float const *, void*, long, float*, int*)
                    1.41%  7.8399ms      2000  3.9190us  3.7430us  10.912us  void pten::ElementwiseBroadcastKernel<float, float, pten::funcs::AddFunctor<float>, int=2, int=1, int=4, int=2>(paddle::framework::Array<float const * restrict , pten::funcs::AddFunctor<float>>, paddle::framework<float*, int=2>, paddle::framework<bool, pten::funcs::AddFunctor<float>>, unsigned int, paddle::framework<pten::ElementwiseBroadcastKernel<float, float, pten::funcs::AddFunctor<float>::operators::kernel_primitives::details::BroadcastConfig<int=4>, int=2, int=1, int=4, int=2>, pten::funcs::AddFunctor<float>>, int, int, float)
                    1.34%  7.4067ms      2000  3.7030us  3.5510us  9.0560us  void pten::kernels::ReduceHigherDimKernel<float, float, float, paddle::operators::kernel_primitives::AddFunctor<float>, paddle::operators::kernel_primitives::IdentityFunctor<float, float>>(float const *, float*, float, paddle::operators::kernel_primitives::AddFunctor<float>, float, int, int, int, paddle::operators::kernel_primitives::DimConfig)
                    1.22%  6.7504ms      2000  3.3750us  3.2000us  9.4730us  [CUDA memcpy DtoD]

total gpu_time: 554.1447 ms
  1. 1月12日的OP Benchmark数据:
    image

linear_2的nvprof结果如下:

run command: nvprof --profile-from-start off /work/.virtualenvs_cuda11.4/paddle_py38/bin/python /work/benchmark/api/dynamic_tests_v2/linear.py --api_name linear --task speed --framework paddle --testing_mode dynamic --json_file /work/benchmark/api/tests_v2/configs/linear.json --config_id 2 --backward True --use_gpu True --repeat 1000 --allow_adaptive_repeat True --profiler nvprof
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   33.13%  275.88ms      4000  68.968us  5.6000us  136.93us  void paddle::platform::ForRangeElemwiseOp<paddle::operators::math::ConjFunctor<float, void>>(float, unsigned long)
                   24.55%  204.44ms      2000  102.22us  96.480us  125.73us  volta_sgemm_64x32_sliced1x4_tn
                   20.20%  168.18ms      2000  84.089us  80.863us  93.632us  volta_sgemm_64x32_sliced1x4_nn
                   18.17%  151.31ms      2000  75.654us  70.720us  81.568us  volta_sgemm_128x32_nt
                    1.31%  10.906ms      2000  5.4530us  5.1510us  11.425us  void splitKreduce_kernel<float, float, float, float, bool=1, bool=0>(cublasSplitKParams<float>, float const *, float const *, float*, float const *, float const *, float const *, void*, long, float*, int*)
                    0.94%  7.8099ms      2000  3.9040us  3.7110us  8.8640us  void pten::ElementwiseBroadcastKernel<float, float, pten::funcs::AddFunctor<float>, int=2, int=1, int=4, int=2>(paddle::framework::Array<float const * restrict , pten::funcs::AddFunctor<float>>, paddle::framework<float*, int=2>, paddle::framework<int, pten::funcs::AddFunctor<float>>, unsigned int, paddle::framework<pten::ElementwiseBroadcastKernel<float, float, pten::funcs::AddFunctor<float>::operators::kernel_primitives::details::BroadcastConfig<int=4>, int=2, int=1, int=4, int=2>, pten::funcs::AddFunctor<float>>, int, int, float)
                    0.89%  7.3781ms      2000  3.6890us  3.5190us  9.3440us  void pten::kernels::ReduceHigherDimKernel<float, float, float, paddle::operators::kernel_primitives::AddFunctor<float>, paddle::operators::kernel_primitives::IdentityFunctor<float, float>>(float const *, float*, float, paddle::operators::kernel_primitives::AddFunctor<float>, float, int, int, int, paddle::operators::kernel_primitives::DimConfig)
                    0.82%  6.8048ms      2000  3.4020us  3.2000us  8.9600us  [CUDA memcpy DtoD]

total gpu_time: 832.7196 ms

新的linear反向计算多了一个paddle::platform::ForRangeElemwiseOp<paddle::operators::math::ConjFunctor<float, void>>(float, unsigned long)函数调用,但是所有linear配置都不是复数的,请check一下matmul的计算逻辑。

@zyfncg
Copy link
Contributor Author

zyfncg commented Jan 13, 2022

收到,我排查一下

@zyfncg zyfncg deleted the pten_matmul_grad branch January 13, 2022 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants