Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuBlasLt Epilogue To Fuse Linear + ReLU|GeLU #39437

Merged
merged 38 commits into from
Mar 7, 2022

Conversation

mingxu1067
Copy link
Collaborator

@mingxu1067 mingxu1067 commented Feb 10, 2022

PR types

New features

PR changes

OPs

Describe

  1. Added fused_gemm_epilogue_op to compute Matmul+ ElementwiseAdd + ReLU|GeLU.
  2. Added fused_gemm_epilogue_grad_op to compute ElementwiseAdd_grad + Matmul_grad+ [ReLU|GeLU]_grad.
  3. Added a class member fuse_gemm_epilogue to BuildStrategy to enable fuse_gemm_epilogue_pass.
  4. CUDA 11.6+ is required.
  5. Usage Example
class MultiFCLayer(paddle.nn.Layer):
    def __init__(self, hidden, Activation):
        super(MultiFCLayer, self).__init__()
        self.linear1 = paddle.nn.Linear(hidden, hidden)
        self.linear2 = paddle.nn.Linear(hidden, hidden)
        self.linear3 = paddle.nn.Linear(hidden, hidden)

        self.relu1 = Activation()
        self.relu2 = Activation()
        self.relu3 = Activation()

    def forward(self, x, matmul_y, ele_y):
        output = self.linear1(x)
        output = self.relu1(output)
        output = self.linear2(output)

        output1 = paddle.matmul(output, matmul_y)
        output = self.linear3(output)
        output = self.relu2(output)

        output = paddle.matmul(output, matmul_y)
        output = paddle.add(output, ele_y)
        output = self.relu3(output)
        output = paddle.add(output, output1)
        return output

paddle.enable_static()

batch = 64
seqlen = 128
hidden = 768

main_prog = paddle.static.Program()
startup_prog = paddle.static.Program()

with paddle.static.program_guard(main_prog, startup_prog):
    data = paddle.static.data(
        name="_data",
        shape=[-1, seqlen, hidden],
        dtype='float32')
    matmul_y = paddle.static.data(
        name="_matmul_y",
        shape=[1, hidden, hidden],
        dtype='float32')
    ele_y = paddle.static.data(
        name="_ele_y", shape=[hidden, ], dtype='float32')

    multi_fc_layer = MultiFCLayer(hidden, paddle.nn.ReLU)
    with paddle.static.amp.fp16_guard():
        out = multi_fc_layer(data, matmul_y, ele_y)
        loss = paddle.mean(out)
        paddle.static.append_backward(loss=loss)

build_strategy = paddle.static.BuildStrategy()
build_strategy.fuse_gemm_epilogue = True
program = paddle.static.CompiledProgram(smain_prog)
program = program.with_data_parallel(
    loss_name=loss.name,
    build_strategy=build_strategy,
    places=paddle.static.cuda_places())

# 3 subgraphs be fused to fused_gemm_epilogue_op
# 3 subgraphs be fused to fused_gemm_epilogue_grad

1. Added fused_gemm_epilogue op to leverage cuBlastLt Epilogue.
2. Support fusion Act(X*Y + bias), X'dims >=2 and Y'dims shoule be 2.
2. Act currently only be supported ReLU. (Will add GeLU in the future).
1. Added LinearAct into graph_pattern_detector.* to define (2.)'s
pattern.
2. LinearAct is used to detect act(element_add(matmul_v2(x, w), bias)).
3. act currently only support ReLU (Will support GeLU in the future).
1, Added FuseGemmEpiloguePass to handle nn.Linear + Act{ReLU}
fusion (GeLU will be supported in the future).
2. Only support matmul_v2 from nn.Linear.
1. Added GeLU support to fused_gemm_epilogue op.
2. Added EpilogueSingleton to cache auxiliary pointer.
3. Added related UTs.
1. Added support of fwd graph with grap_ops linking to LinearAct.
2. Added related changes to fuse_gemm_epilogue_pass for above
modification.
1. Added matmul_v2 + ele_add pattern to LinearActPattern.
2. Added matmul_v2 + ele_add support to fuse_gemm_epilogue_pass.
1. Added fused_gemm_epilogue_grad to support backward epilogue fusion.
1. Added backward fusion pass to Linear( Act(x)).
2. Added backward fusion pass to Linear(x).
1. Made arguments of some function pass by reference.
2. Removed redundant code.
3. Followed Google code style to change code.
@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

1. Modified way to get cublasLt handler in device_context to be
consistent with last changes in develop.
1. Require CUDA 11.6+
2. Remove fuse_gemm_epilogue related tests when CUDA < 11.6.
}

ir::Graph *FuseGemmEpiloguePass::FuseLinearBwd(ir::Graph *graph,
bool is_first_gemm) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is is_first_gemm for? What is the difference between the first and the others GEMM?

In the following code, I feel that is_first_gemm == true means the gradient of X is not needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, change to with_x_gradient.

paddle/fluid/operators/fused/fused_gemm_epilogue_op.cc Outdated Show resolved Hide resolved
memory::allocation::AllocationPtr auxiliary = nullptr;
};

class EpilogueSingleton {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, EpilogueSingleton is used to store a memory buffer which is written in the forward cublasLt API. This memory buffer must be passed to the backward cublasLt API without any modification. Therefore, you use a map to save the name-to-memory-buffer mapping here, and the name is the activation output name. Am I right?

I prefer to using something like ReserveSpace in batch_norm op. It is not encouraged to save the variable name inside the op attribute, which makes the graph dependency analysis, .etc difficult.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

1. Changed arguments name is_first_gemm to without_x_gradient for
clearing.
2. Applied PADDLE_THROW in fused_gemm_epilogue_op.
@paddle-bot-old
Copy link

Sorry to inform you that fe8a560's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

1. Applied ReserveSpace to replace Epilogue for passing auxiliary
pointers between FWD and BWD.
1. Added act op count checking in UTs.
2. Fix issue to fuse backward or ReLU(Linear(X)).
3. TODO: solve GELU fusion issues.
1. Modified graph_detech_pattern to fit with both linear wiht gelu or
relu.
2. Modified data range in Uts to allow negative values.
void Make() override {
AddInput("X", "The input tensor X of Out = Act((X * Y) + bias).");
AddInput("Y", "The input tensor Y of Out = Act((X * Y) + bias).");
AddInput("bias", "The input tensor bias of Out = Act((X * Y) + bias).");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bias->Bias?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

AddInput("Y", "The input tensor Y of Out = Act((X * Y) + bias).");
AddInput("bias", "The input tensor bias of Out = Act((X * Y) + bias).");

AddOutput("out", "The output tensor Out of Out = Act((X * Y) + bias).");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out->Out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

AddInput("bias", "The input tensor bias of Out = Act((X * Y) + bias).");

AddOutput("out", "The output tensor Out of Out = Act((X * Y) + bias).");
AddOutput("reserve_space",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reserve_space->ReserveSpace?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"The input grad tensor to Out of Out = (Act(X) * Y) + bias");
AddInput("X", "The input tensor X of Out = (Act(X) * Y) + bias");
AddInput("Y", "The input tensor Y of Out = (Act(X) * Y) + bias");
AddInput("reserve_space",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reserve_space->ReserveSpace?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

VarDesc reserve_space(patterns::PDNodeName(scope_name, "reserve_space"));
auto *reserve_space_node = g->CreateVarNode(&reserve_space);

EpiloguePassActivationCache::Instance().InsertFusedActivation(
Copy link
Collaborator

@sneaxiy sneaxiy Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about make EpiloguePassActivationCache::Instance to be a local variable instead of a singleton? I mean you can change the declaration of FuseGemmEpiloguePass::FuseLinearActFwd and FuseGemmEpiloguePass::FuseLinearActBwd to be:

ir::Graph *FuseGemmEpiloguePass::FuseLinearActFwd(
    ir::Graph *graph, const std::unordered_set<std::string> &act_types,
    bool is_training, bool is_act_grad_x_from_act, EpiloguePassActivationCache *cache) const;

ir::Graph *FuseGemmEpiloguePass::FuseLinearActBwd(
    ir::Graph *graph, const std::unordered_set<std::string> &act_grad_types,
    bool is_act_grad_x_from_act, const EpiloguePassActivationCache &cache) const;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Made EpiloguePassActivationCache as local variable and pass to FuseLinearActFwd and FuseLinearActBwd.
Using pointer rather than reference, due to request from pre-commit hooks.

1. bias -> Bias.
2. out -> Out.
3. reserve_space -> ReserveSpace.
1. Removed singleton in EpiloguePassActivationCache.
2. Made EpiloguePassActivationCache as an argument to each pass
functions.
Copy link
Collaborator

@sneaxiy sneaxiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@sneaxiy sneaxiy merged commit 2a3d9ec into PaddlePaddle:develop Mar 7, 2022
@mingxu1067 mingxu1067 deleted the cublaslt_epilogue branch March 8, 2022 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants