cuBlasLt Epilogue To Fuse Linear + ReLU|GeLU #39437

mingxu1067 · 2022-02-10T02:18:35Z

PR types

New features

PR changes

OPs

Describe

Added fused_gemm_epilogue_op to compute Matmul+ ElementwiseAdd + ReLU|GeLU.
Added fused_gemm_epilogue_grad_op to compute ElementwiseAdd_grad + Matmul_grad+ [ReLU|GeLU]_grad.
Added a class member fuse_gemm_epilogue to BuildStrategy to enable fuse_gemm_epilogue_pass.
CUDA 11.6+ is required.
Usage Example

class MultiFCLayer(paddle.nn.Layer):
    def __init__(self, hidden, Activation):
        super(MultiFCLayer, self).__init__()
        self.linear1 = paddle.nn.Linear(hidden, hidden)
        self.linear2 = paddle.nn.Linear(hidden, hidden)
        self.linear3 = paddle.nn.Linear(hidden, hidden)

        self.relu1 = Activation()
        self.relu2 = Activation()
        self.relu3 = Activation()

    def forward(self, x, matmul_y, ele_y):
        output = self.linear1(x)
        output = self.relu1(output)
        output = self.linear2(output)

        output1 = paddle.matmul(output, matmul_y)
        output = self.linear3(output)
        output = self.relu2(output)

        output = paddle.matmul(output, matmul_y)
        output = paddle.add(output, ele_y)
        output = self.relu3(output)
        output = paddle.add(output, output1)
        return output

paddle.enable_static()

batch = 64
seqlen = 128
hidden = 768

main_prog = paddle.static.Program()
startup_prog = paddle.static.Program()

with paddle.static.program_guard(main_prog, startup_prog):
    data = paddle.static.data(
        name="_data",
        shape=[-1, seqlen, hidden],
        dtype='float32')
    matmul_y = paddle.static.data(
        name="_matmul_y",
        shape=[1, hidden, hidden],
        dtype='float32')
    ele_y = paddle.static.data(
        name="_ele_y", shape=[hidden, ], dtype='float32')

    multi_fc_layer = MultiFCLayer(hidden, paddle.nn.ReLU)
    with paddle.static.amp.fp16_guard():
        out = multi_fc_layer(data, matmul_y, ele_y)
        loss = paddle.mean(out)
        paddle.static.append_backward(loss=loss)

build_strategy = paddle.static.BuildStrategy()
build_strategy.fuse_gemm_epilogue = True
program = paddle.static.CompiledProgram(smain_prog)
program = program.with_data_parallel(
    loss_name=loss.name,
    build_strategy=build_strategy,
    places=paddle.static.cuda_places())

# 3 subgraphs be fused to fused_gemm_epilogue_op
# 3 subgraphs be fused to fused_gemm_epilogue_grad

1. Added fused_gemm_epilogue op to leverage cuBlastLt Epilogue. 2. Support fusion Act(X*Y + bias), X'dims >=2 and Y'dims shoule be 2. 2. Act currently only be supported ReLU. (Will add GeLU in the future).

1. Added LinearAct into graph_pattern_detector.* to define (2.)'s pattern. 2. LinearAct is used to detect act(element_add(matmul_v2(x, w), bias)). 3. act currently only support ReLU (Will support GeLU in the future).

1, Added FuseGemmEpiloguePass to handle nn.Linear + Act{ReLU} fusion (GeLU will be supported in the future). 2. Only support matmul_v2 from nn.Linear.

1. Added GeLU support to fused_gemm_epilogue op. 2. Added EpilogueSingleton to cache auxiliary pointer. 3. Added related UTs.

1. Added support of fwd graph with grap_ops linking to LinearAct. 2. Added related changes to fuse_gemm_epilogue_pass for above modification.

1. Added matmul_v2 + ele_add pattern to LinearActPattern. 2. Added matmul_v2 + ele_add support to fuse_gemm_epilogue_pass.

1. Added fused_gemm_epilogue_grad to support backward epilogue fusion.

1. Added backward fusion pass to Linear( Act(x)). 2. Added backward fusion pass to Linear(x).

1. Made arguments of some function pass by reference. 2. Removed redundant code. 3. Followed Google code style to change code.

paddle-bot-old · 2022-02-10T02:18:43Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

1. Modified way to get cublasLt handler in device_context to be consistent with last changes in develop.

1. Require CUDA 11.6+ 2. Remove fuse_gemm_epilogue related tests when CUDA < 11.6.

paddle/fluid/framework/ir/fuse_gemm_epilogue_pass.cc

sneaxiy · 2022-02-16T15:16:04Z

paddle/fluid/framework/ir/fuse_gemm_epilogue_pass.cc

+}
+
+ir::Graph *FuseGemmEpiloguePass::FuseLinearBwd(ir::Graph *graph,
+                                               bool is_first_gemm) const {


What is is_first_gemm for? What is the difference between the first and the others GEMM?

In the following code, I feel that is_first_gemm == true means the gradient of X is not needed?

Done, change to with_x_gradient.

paddle/fluid/operators/fused/fused_gemm_epilogue_op.cc

sneaxiy · 2022-02-16T15:54:09Z

paddle/fluid/operators/fused/fused_gemm_epilogue_op.h

+  memory::allocation::AllocationPtr auxiliary = nullptr;
+};
+
+class EpilogueSingleton {


In my understanding, EpilogueSingleton is used to store a memory buffer which is written in the forward cublasLt API. This memory buffer must be passed to the backward cublasLt API without any modification. Therefore, you use a map to save the name-to-memory-buffer mapping here, and the name is the activation output name. Am I right?

I prefer to using something like ReserveSpace in batch_norm op. It is not encouraged to save the variable name inside the op attribute, which makes the graph dependency analysis, .etc difficult.

1. Changed arguments name is_first_gemm to without_x_gradient for clearing. 2. Applied PADDLE_THROW in fused_gemm_epilogue_op.

paddle-bot-old · 2022-02-19T02:35:57Z

Sorry to inform you that fe8a560's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

1. Applied ReserveSpace to replace Epilogue for passing auxiliary pointers between FWD and BWD.

1. Added act op count checking in UTs. 2. Fix issue to fuse backward or ReLU(Linear(X)). 3. TODO: solve GELU fusion issues.

1. Modified graph_detech_pattern to fit with both linear wiht gelu or relu. 2. Modified data range in Uts to allow negative values.

… cublaslt_epilogue

sneaxiy · 2022-02-28T13:23:46Z

paddle/fluid/operators/fused/fused_gemm_epilogue_op.cc

+  void Make() override {
+    AddInput("X", "The input tensor X of Out = Act((X * Y) + bias).");
+    AddInput("Y", "The input tensor Y of Out = Act((X * Y) + bias).");
+    AddInput("bias", "The input tensor bias of Out = Act((X * Y) + bias).");


bias->Bias?

sneaxiy · 2022-02-28T13:23:54Z

paddle/fluid/operators/fused/fused_gemm_epilogue_op.cc

+    AddInput("Y", "The input tensor Y of Out = Act((X * Y) + bias).");
+    AddInput("bias", "The input tensor bias of Out = Act((X * Y) + bias).");
+
+    AddOutput("out", "The output tensor Out of Out = Act((X * Y) + bias).");


out->Out?

sneaxiy · 2022-02-28T13:24:05Z

paddle/fluid/operators/fused/fused_gemm_epilogue_op.cc

+    AddInput("bias", "The input tensor bias of Out = Act((X * Y) + bias).");
+
+    AddOutput("out", "The output tensor Out of Out = Act((X * Y) + bias).");
+    AddOutput("reserve_space",


reserve_space->ReserveSpace?

sneaxiy · 2022-02-28T13:24:24Z

paddle/fluid/operators/fused/fused_gemm_epilogue_op.cc

+             "The input grad tensor to Out of Out = (Act(X) * Y) + bias");
+    AddInput("X", "The input tensor X of Out = (Act(X) * Y) + bias");
+    AddInput("Y", "The input tensor Y of Out = (Act(X) * Y) + bias");
+    AddInput("reserve_space",


reserve_space->ReserveSpace?

sneaxiy · 2022-02-28T13:28:52Z

paddle/fluid/framework/ir/fuse_gemm_epilogue_pass.cc

+      VarDesc reserve_space(patterns::PDNodeName(scope_name, "reserve_space"));
+      auto *reserve_space_node = g->CreateVarNode(&reserve_space);
+
+      EpiloguePassActivationCache::Instance().InsertFusedActivation(


How about make EpiloguePassActivationCache::Instance to be a local variable instead of a singleton? I mean you can change the declaration of FuseGemmEpiloguePass::FuseLinearActFwd and FuseGemmEpiloguePass::FuseLinearActBwd to be:

ir::Graph *FuseGemmEpiloguePass::FuseLinearActFwd( ir::Graph *graph, const std::unordered_set<std::string> &act_types, bool is_training, bool is_act_grad_x_from_act, EpiloguePassActivationCache *cache) const; ir::Graph *FuseGemmEpiloguePass::FuseLinearActBwd( ir::Graph *graph, const std::unordered_set<std::string> &act_grad_types, bool is_act_grad_x_from_act, const EpiloguePassActivationCache &cache) const;

Done.
Made EpiloguePassActivationCache as local variable and pass to FuseLinearActFwd and FuseLinearActBwd.
Using pointer rather than reference, due to request from pre-commit hooks.

1. bias -> Bias. 2. out -> Out. 3. reserve_space -> ReserveSpace.

1. Removed singleton in EpiloguePassActivationCache. 2. Made EpiloguePassActivationCache as an argument to each pass functions.

… cublaslt_epilogue

sneaxiy

LGTM.

mingxu1067 added 26 commits January 14, 2022 18:34

Added cuBlasLtHandle_t to device context.

4c7ee94

Added fused_gemm_epilogue op.

a82c0a8

1. Added fused_gemm_epilogue op to leverage cuBlastLt Epilogue. 2. Support fusion Act(X*Y + bias), X'dims >=2 and Y'dims shoule be 2. 2. Act currently only be supported ReLU. (Will add GeLU in the future).

Added UT to fused_gemm_epilogue op.

26e6411

Added LinearAct Pattern

41b701a

1. Added LinearAct into graph_pattern_detector.* to define (2.)'s pattern. 2. LinearAct is used to detect act(element_add(matmul_v2(x, w), bias)). 3. act currently only support ReLU (Will support GeLU in the future).

Added FuseGemmEpiloguePass

6349809

1, Added FuseGemmEpiloguePass to handle nn.Linear + Act{ReLU} fusion (GeLU will be supported in the future). 2. Only support matmul_v2 from nn.Linear.

Added pybind to BuildStrageter.fuse_gemm_epilogue_.

a0c0f48

Added UT for fuse_gemm_epilogue_pass.

cb1f790

GeLU support and EpilogueSingleton

f001541

1. Added GeLU support to fused_gemm_epilogue op. 2. Added EpilogueSingleton to cache auxiliary pointer. 3. Added related UTs.

Rename cublaslt_epilogue_opto gemm_epilogue_op.*.

51e6a36

Added both train and infer pattern to LinearAct.

2c24ad7

1. Added support of fwd graph with grap_ops linking to LinearAct. 2. Added related changes to fuse_gemm_epilogue_pass for above modification.

Changed CUDA requirement from 11.4 to 11.6 for fuse_gemm_epilogue_pass.

6919ce7

Added identity activation support to gemm_epilogue_op.

a65ab08

Added Linear Fusion (matmul_v2 + ele_add)

1b7541b

1. Added matmul_v2 + ele_add pattern to LinearActPattern. 2. Added matmul_v2 + ele_add support to fuse_gemm_epilogue_pass.

Rename gemm_epilogue_op.* to fused_gemm_epilogue_op.*

ac1a8ca

Add fused_gemm_epilogue_grad op.

9cdf442

1. Added fused_gemm_epilogue_grad to support backward epilogue fusion.

Add UTs to fused_gemm_epilogue_grad_op.

fbda512

Change attribute name in fused_gemm_epilogue_grad_op for clearing.

64a43ea

Allow DX and DBias be dispensable to fused_gemm_epilogue_grad op.

0369fb4

Added ElementwiseAdd+Matmul+Act graph pattern detection.

88c9ecb

Fuse backward of Linear( Act(x))

009eea2

1. Added backward fusion pass to Linear( Act(x)). 2. Added backward fusion pass to Linear(x).

Added UTs to backward fusion of Linear(Act(x)).

a8076a9

Complete document of arguments to fused_gemm_epilogue_op.

1268d48

Made arguments of some functions pass by reference.

dbed64f

Modify code with review comments.

d8a862e

1. Made arguments of some function pass by reference. 2. Removed redundant code. 3. Followed Google code style to change code.

Made 'const' code style be consistent

54a8588

Fixed random seed of python UTs.

06f4240

mingxu1067 added the NVIDIA label Feb 10, 2022

Merge branch 'develop'

fba452e

1. Modified way to get cublasLt handler in device_context to be consistent with last changes in develop.

mingxu1067 force-pushed the cublaslt_epilogue branch from 7e616c8 to 2b03377 Compare February 10, 2022 09:42

mingxu1067 force-pushed the cublaslt_epilogue branch from 2b03377 to c2f5692 Compare February 10, 2022 09:55

mingxu1067 force-pushed the cublaslt_epilogue branch from c2f5692 to cb3bdae Compare February 10, 2022 10:19

mingxu1067 force-pushed the cublaslt_epilogue branch from cb3bdae to 1b41b06 Compare February 11, 2022 01:44

Set Compiling constrains to cuBlasLt

fe8a560

1. Require CUDA 11.6+ 2. Remove fuse_gemm_epilogue related tests when CUDA < 11.6.

mingxu1067 force-pushed the cublaslt_epilogue branch from 1b41b06 to fe8a560 Compare February 11, 2022 02:00

sneaxiy self-requested a review February 16, 2022 14:59

sneaxiy reviewed Feb 16, 2022

View reviewed changes

Code Reivew from Paddle

dcdab08

1. Changed arguments name is_first_gemm to without_x_gradient for clearing. 2. Applied PADDLE_THROW in fused_gemm_epilogue_op.

mingxu1067 added 6 commits February 22, 2022 10:20

Remove EpilogueSingleton

02c007f

1. Applied ReserveSpace to replace Epilogue for passing auxiliary pointers between FWD and BWD.

Fix a logical error and enhance UTs.

84fd06a

1. Added act op count checking in UTs. 2. Fix issue to fuse backward or ReLU(Linear(X)). 3. TODO: solve GELU fusion issues.

Fix Linear and GeLU fusion issues.

30b20da

1. Modified graph_detech_pattern to fit with both linear wiht gelu or relu. 2. Modified data range in Uts to allow negative values.

Removed fused_gemm_epilogue_op.h.

1510a96

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

a421be8

… cublaslt_epilogue

Rename namespace pten to phi.

2768d2a

sneaxiy reviewed Feb 28, 2022

View reviewed changes

mingxu1067 added 3 commits March 1, 2022 10:33

Rename name of arguments in fused_gemm_epilogue_op

3a27015

1. bias -> Bias. 2. out -> Out. 3. reserve_space -> ReserveSpace.

Change EpiloguePassActivationCache as local variable.

2f23475

1. Removed singleton in EpiloguePassActivationCache. 2. Made EpiloguePassActivationCache as an argument to each pass functions.

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

5c47882

… cublaslt_epilogue

mingxu1067 mentioned this pull request Mar 1, 2022

error: stray ‘@’ in programEagerTensor. #40013

Closed

sneaxiy approved these changes Mar 7, 2022

View reviewed changes

luotao1 approved these changes Mar 7, 2022

View reviewed changes

sneaxiy merged commit 2a3d9ec into PaddlePaddle:develop Mar 7, 2022

mingxu1067 deleted the cublaslt_epilogue branch March 8, 2022 02:55

sneaxiy mentioned this pull request Jun 1, 2022

Make fuse_gemm_epilogue support transpose_x and transpose_y #40558

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuBlasLt Epilogue To Fuse Linear + ReLU|GeLU #39437

cuBlasLt Epilogue To Fuse Linear + ReLU|GeLU #39437

mingxu1067 commented Feb 10, 2022 •

edited

Loading

paddle-bot-old bot commented Feb 10, 2022

sneaxiy Feb 16, 2022

mingxu1067 Feb 24, 2022

sneaxiy Feb 16, 2022

mingxu1067 Feb 24, 2022

paddle-bot-old bot commented Feb 19, 2022

sneaxiy Feb 28, 2022

mingxu1067 Mar 1, 2022

sneaxiy Feb 28, 2022

mingxu1067 Mar 1, 2022

sneaxiy Feb 28, 2022

mingxu1067 Mar 1, 2022

sneaxiy Feb 28, 2022

mingxu1067 Mar 1, 2022

sneaxiy Feb 28, 2022 •

edited

Loading

mingxu1067 Mar 1, 2022

sneaxiy left a comment

cuBlasLt Epilogue To Fuse Linear + ReLU|GeLU #39437

cuBlasLt Epilogue To Fuse Linear + ReLU|GeLU #39437

Conversation

mingxu1067 commented Feb 10, 2022 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Feb 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paddle-bot-old bot commented Feb 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sneaxiy Feb 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sneaxiy left a comment

Choose a reason for hiding this comment

mingxu1067 commented Feb 10, 2022 •

edited

Loading

sneaxiy Feb 28, 2022 •

edited

Loading