Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【PaddlePaddle Hackathon 4 No.17】为 Paddle 新增 cummax / cummin API #401

Merged
merged 7 commits into from
Mar 29, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 56 additions & 8 deletions rfcs/APIs/20220316_api_design_for_cummax.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@

cummax 是指求累积最大值(cumulative max)的功能。即求
$$
y_i = \max\(x_1, x_2, x_3, \cdots , x_i\)
y_i = \max(x_1, x_2, x_3, \cdots , x_i)
$$

PyTorch、NumPy 和 Pandas 提供了相似算子。
Expand Down Expand Up @@ -50,11 +50,11 @@ Keyword Arguments
```
即输入参数为 Tensor 和指定的维,两个值和索引的切片。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里描述也有点混乱,修改一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改


相关联的 PR [Cumulative Maximum · Issue #20240 · pytorch/pytorch (github.com)](https://github.com/pytorch/pytorch/issues/20240),其中提及`logcumsumexp` 依赖于 `cummax` 功能。
相关联的 PR [Cumulative Maximum · Issue #20240 · pytorch/pytorch (github.com)](https://github.com/pytorch/pytorch/issues/20240),其中提及`logcumsumexp` 依赖于 `cummax` 功能。

### 实现方法

在实现方法上, PyTorch 通用实现采用的遍历,[CPU](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/ReduceOps.cpp#L638),CUDA 采用的[GPU](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/ScanKernels.cpp#L17)。
在实现方法上, PyTorch采用的CPU实现为:循环遍历赋值[CPU](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/ReduceOps.cpp#L769),而CUDA实现则是调用pytorch自己实现的scan_with_indices函数[GPU](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/ScanKernels.cpp#L45)。
核心代码为:
CPU:

Expand Down Expand Up @@ -140,6 +140,24 @@ void launch_cummax_cuda_kernel(const TensorBase& self, const TensorBase& values,
}
```


~~~cpp
template<typename scalar_t, typename BinaryFunction>
void scan_dim_with_indices(const TensorBase& self, const TensorBase& values, const TensorBase& indices, //int64_t dim) {
int64_t dim, scalar_t init, BinaryFunction binary_op) {
int ndim = self.dim();
auto self_ = self.expect_contiguous();
TORCH_INTERNAL_ASSERT(values.is_contiguous() && indices.is_contiguous());
if (dim == ndim - 1) {
scan_innermost_dim_with_indices<scalar_t>(*self_, values, indices, init, binary_op);
} else {
scan_outer_dim_with_indices<scalar_t>(*self_, values, indices, dim, init, binary_op);
}
}
~~~

其中函数`scan_innermost_dim_with_indices`和`scan_outer_dim_with_indices`的相关代码较长,它们的功能是在不同维度上对输入进行并行的累积操作,其中关于并行扫描部分实现的代码值得参考。

## NumPy

NumPy 具有相似功能的 API 是 `numpy.maximum.accumulate()`,文档参见 [numpy.ufunc.accumulate — NumPy v1.22 Manual](https://numpy.org/doc/stable/reference/generated/numpy.ufunc.accumulate.html)。
Expand All @@ -148,8 +166,7 @@ NumPy 的策略是提供一种更具兼容性的实现方式,组合实现该

## Pandas

Pandas 也提供了该 API [pandas.DataFrame.cummax¶
](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cummax.html#pandas-dataframe-cummax)。
Pandas 也提供了该 API [pandas.DataFrame.cummax](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cummax.html#pandas-dataframe-cummax)。

介绍为:

Expand Down Expand Up @@ -219,7 +236,6 @@ Return cumulative maximum of Series or DataFrame.
else:
seen_na[lab, j] = 1
out[i, j] = val

```

# 四、对比分析
Expand All @@ -241,7 +257,39 @@ API设计为`paddle.cummax(x, axis , dtype, name)`以及`paddle.Tensor.cummax(ax

## 底层OP设计

参考 paddle.cumsum 实现。
cpu:
前向计算,可调用cumsum算子所实现的ScanKernel作为核心,增加一些细节处理即可
后向计算,定义CumminGradKernel函数实现cummin函数subgradient计算,或者参考其它不可导函数的subgradient计算方法

gpu:
前向计算,可调用cumsum算子所实现的BlockScanKernel作为核心,增加一些细节处理即可
后向计算,定义CumminGradKernel函数实现cummin函数subgradient计算,或者参考其它不可导函数的subgradient计算方法

前向函数签名

~~~cpp
KernelSignature CummaxOpArgumentMapping(
const ArgumentMappingContext& ctx) {
return KernelSignature("cummax",
{"X"},
{"axis", "flatten", "exclusive", "reverse"},
{"Out"});
}
PD_REGISTER_ARG_MAPPING_FN(cummax, phi::CummaxOpArgumentMapping);
~~~

后向函数签名

~~~cpp
KernelSignature CummaxGradOpArgumentMapping(
const ArgumentMappingContext& ctx) {
return KernelSignature("cummax_grad",
{"X", "Out", "Out@GRAD"},
{"axis", "flatten", "exclusive", "reverse"},
{"X@GRAD"});
}
PD_REGISTER_ARG_MAPPING_FN(cummax_grad, phi::CummaxGradOpArgumentMapping);
~~~

## API实现方案

Expand Down Expand Up @@ -277,4 +325,4 @@ Python 接口实现位置为`paddle/tesnor/math.py`。

# 八、影响面

为独立新增API,对其他模块没有影响
为独立新增API,对其他模块没有影响
Loading