PaddlePaddle · luotao1 · Mar 29, 2023 · Mar 2, 2023 · Mar 6, 2023 · Mar 7, 2023
diff --git a/rfcs/APIs/20220316_api_design_for_cummax.md b/rfcs/APIs/20220316_api_design_for_cummax.md
@@ -14,7 +14,7 @@
 
 cummax 是指求累积最大值（cumulative max）的功能。即求
 $$
-    y_i = \max\(x_1, x_2, x_3, \cdots , x_i\)
+y_i = \max(x_1, x_2, x_3, \cdots , x_i)
 $$
 
 PyTorch、NumPy 和 Pandas 提供了相似算子。
@@ -50,11 +50,11 @@ Keyword Arguments
 ```
 即输入参数为 Tensor 和指定的维，两个值和索引的切片。
 
-相关联的 PR [Cumulative Maximum · Issue #20240 · pytorch/pytorch (github.com)](https://github.com/pytorch/pytorch/issues/20240)，其中提及`logcumsumexp` 依赖于 `cummax` 功能。
+相关联的 PR [Cumulative Maximum · Issue #20240 · pytorch/pytorch (github.com)](https://github.com/pytorch/pytorch/issues/20240)，其中提及`logcumsumexp` 依赖于 `cummax` 功能。
 
 ### 实现方法
 
-在实现方法上, PyTorch 通用实现采用的遍历，[CPU](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/ReduceOps.cpp#L638)，CUDA 采用的[GPU](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/ScanKernels.cpp#L17)。
+在实现方法上, PyTorch采用的CPU实现为：循环遍历赋值[CPU](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/ReduceOps.cpp#L769)，而CUDA实现则是调用pytorch自己实现的scan_with_indices函数[GPU](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/ScanKernels.cpp#L45)。
 核心代码为：
 CPU:
 
@@ -140,6 +140,24 @@ void launch_cummax_cuda_kernel(const TensorBase& self, const TensorBase& values,
 }
 ```
 
+
+~~~cpp
+template<typename scalar_t, typename BinaryFunction>
+void scan_dim_with_indices(const TensorBase& self, const TensorBase& values, const TensorBase& indices, //int64_t dim) {
+     int64_t dim, scalar_t init, BinaryFunction binary_op) {
+  int ndim = self.dim();
+  auto self_ = self.expect_contiguous();
+  TORCH_INTERNAL_ASSERT(values.is_contiguous() && indices.is_contiguous());
+  if (dim == ndim - 1) {
+    scan_innermost_dim_with_indices<scalar_t>(*self_, values, indices, init, binary_op);
+  } else {
+    scan_outer_dim_with_indices<scalar_t>(*self_, values, indices, dim, init, binary_op);
+  }
+}
+~~~
+
+其中函数`scan_innermost_dim_with_indices`和`scan_outer_dim_with_indices`的相关代码较长，它们的功能是在不同维度上对输入进行并行的累积操作，其中关于并行扫描部分实现的代码值得参考。
+
 ## NumPy
 
 NumPy 具有相似功能的 API 是 `numpy.maximum.accumulate()`，文档参见 [numpy.ufunc.accumulate — NumPy v1.22 Manual](https://numpy.org/doc/stable/reference/generated/numpy.ufunc.accumulate.html)。
@@ -148,8 +166,7 @@ NumPy 的策略是提供一种更具兼容性的实现方式，组合实现该
 
 ## Pandas
 
-Pandas 也提供了该 API [pandas.DataFrame.cummax¶
-](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cummax.html#pandas-dataframe-cummax)。
+Pandas 也提供了该 API [pandas.DataFrame.cummax](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.cummax.html#pandas-dataframe-cummax)。
 
 介绍为：
 
@@ -219,7 +236,6 @@ Return cumulative maximum of Series or DataFrame.
                     else:
                         seen_na[lab, j] = 1
                         out[i, j] = val
-
 ```
 
 # 四、对比分析
@@ -241,7 +257,39 @@ API设计为`paddle.cummax(x, axis , dtype, name)`以及`paddle.Tensor.cummax(ax
 
 ## 底层OP设计
 
-参考 paddle.cumsum 实现。
+cpu：
+前向计算，可调用cumsum算子所实现的ScanKernel作为核心，增加一些细节处理即可
+后向计算，定义CumminGradKernel函数实现cummin函数subgradient计算，或者参考其它不可导函数的subgradient计算方法
+
+gpu：
+前向计算，可调用cumsum算子所实现的BlockScanKernel作为核心，增加一些细节处理即可
+后向计算，定义CumminGradKernel函数实现cummin函数subgradient计算，或者参考其它不可导函数的subgradient计算方法
+
+前向函数签名
+
+~~~cpp
+KernelSignature CummaxOpArgumentMapping(
+    const ArgumentMappingContext& ctx) {
+  return KernelSignature("cummax",
+                         {"X"},
+                         {"axis", "flatten", "exclusive", "reverse"},
+                         {"Out"});
+}
+PD_REGISTER_ARG_MAPPING_FN(cummax, phi::CummaxOpArgumentMapping);
+~~~
+
+后向函数签名
+
+~~~cpp
+KernelSignature CummaxGradOpArgumentMapping(
+    const ArgumentMappingContext& ctx) {
+  return KernelSignature("cummax_grad",
+                         {"X", "Out", "Out@GRAD"},
+                         {"axis", "flatten", "exclusive", "reverse"},
+                         {"X@GRAD"});
+}
+PD_REGISTER_ARG_MAPPING_FN(cummax_grad, phi::CummaxGradOpArgumentMapping);
+~~~
 
 ## API实现方案
 
@@ -277,4 +325,4 @@ Python 接口实现位置为`paddle/tesnor/math.py`。
 
 # 八、影响面
 
-为独立新增API，对其他模块没有影响
+为独立新增API，对其他模块没有影响