Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor dot op's CPU kernel for better performance #32589

Merged
merged 9 commits into from
May 7, 2021

Conversation

tongxin
Copy link
Contributor

@tongxin tongxin commented Apr 26, 2021

PR types

Performance optimization

PR changes

OPs

Describe

Rewrote the dot op's CPU kernel and saw over 10x performance improvement.

The main part of the kernel is a trivial loop nest performing a sequence of sum-reduces. Though the code is simple the branch in the inner loop is trouble enough to stop many types of loop optimization. Moreover, the reduction operation (+=) to a heap dereferenced value cannot be automatically localized. We have to do that manually.

Following compares IRs of the inner loop before and after the rewrite.

Before rewrite

;;   basic block 63, loop depth 1, count 430033602 (estimated locally), maybe hot
;;    prev block 62, next block 64, flags: (NEW, REACHABLE, VISITED)
;;    pred:       61 [50.0% (guessed)]  count:430033601 (estimated locally) (FALSE_VALUE,EXECUTABLE)
;;   starting at line 282
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:282:23] # RANGE ~[2147483648, 18446744073709551614]
  _14 = (long unsigned intD.16) ind_208;
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:282:23] # RANGE [0, 18446744073709551612] NONZERO 18446744073709551612
  _15 = _14 * 4;
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:282:23] # PT = nonlocal escaped null
  _16 = _124 + _15;
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:282:23] # VUSE <.MEM_103>
  _17 = [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:282:23] *_16;
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:282:23] _25 = _17 + _70;
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:282:9] # .MEM_67 = VDEF <.MEM_103>
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:282:9] *_16 = _25;
;;    succ:       64 [always (guessed)]  count:430033602 (estimated locally) (FALLTHRU,EXECUTABLE)    

After rewrite:

;;   basic block 61, loop depth 1, count 94607386 (estimated locally), maybe hot
;;    prev block 60, next block 62, flags: (NEW, REACHABLE, VISITED)
;;    pred:       60 [89.0% (guessed)]  count:94607386 (estimated locally) (FALSE_VALUE,EXECUTABLE)
;;   starting at line -1
  if (_14 <= 6)
    goto <bb 64>; [10.00%]
  else
    goto <bb 62>; [90.00%]
;;    succ:       62 [90.0% (guessed)]  count:85146647 (estimated locally) (FALSE_VALUE,EXECUTABLE)
;;                64 [10.0% (guessed)]  count:9460739 (estimated locally) (TRUE_VALUE,EXECUTABLE)

;;   basic block 62, loop depth 2, count 510879883 (estimated locally), maybe hot
;;    prev block 61, next block 63, flags: (NEW, REACHABLE, VISITED)
;;    pred:       61 [90.0% (guessed)]  count:85146647 (estimated locally) (FALSE_VALUE,EXECUTABLE)
;;                62 [83.3% (adjusted)]  count:425733237 (estimated locally) (FALSE_VALUE,EXECUTABLE)
;;   starting at line 272, discriminator 2
  # ss_201 = PHI <[/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:271:9] 0.0(61), [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:272:33] ss_56(62)>
  # RANGE [0, 8589934528] NONZERO 8589934591
  # ivtmp.2840_53 = PHI <0(61), ivtmp.2840_206(62)>
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:272:47] # VUSE <.MEM_198>
  vect__6.2831_215 = MEM[base: x__207, index: ivtmp.2840_53, offset: 0B];
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:272:47] # VUSE <.MEM_198>
  vect__8.2834_218 = MEM[base: y__204, index: ivtmp.2840_53, offset: 0B];
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:272:47] vect__55.2835_219 = vect__6.2831_215 * vect__8.2834_218;
  stmp_ss_56.2836_220 = BIT_FIELD_REF <vect__55.2835_219, 32, 0>;
  stmp_ss_56.2836_221 = ss_201 + stmp_ss_56.2836_220;
  stmp_ss_56.2836_222 = BIT_FIELD_REF <vect__55.2835_219, 32, 32>;
  stmp_ss_56.2836_223 = stmp_ss_56.2836_221 + stmp_ss_56.2836_222;
  stmp_ss_56.2836_224 = BIT_FIELD_REF <vect__55.2835_219, 32, 64>;
  stmp_ss_56.2836_225 = stmp_ss_56.2836_223 + stmp_ss_56.2836_224;
  stmp_ss_56.2836_226 = BIT_FIELD_REF <vect__55.2835_219, 32, 96>;
  stmp_ss_56.2836_227 = stmp_ss_56.2836_225 + stmp_ss_56.2836_226;
  stmp_ss_56.2836_228 = BIT_FIELD_REF <vect__55.2835_219, 32, 128>;
  stmp_ss_56.2836_229 = stmp_ss_56.2836_227 + stmp_ss_56.2836_228;
  stmp_ss_56.2836_230 = BIT_FIELD_REF <vect__55.2835_219, 32, 160>;
  stmp_ss_56.2836_231 = stmp_ss_56.2836_229 + stmp_ss_56.2836_230;
  stmp_ss_56.2836_232 = BIT_FIELD_REF <vect__55.2835_219, 32, 192>;
  stmp_ss_56.2836_233 = stmp_ss_56.2836_231 + stmp_ss_56.2836_232;
  stmp_ss_56.2836_234 = BIT_FIELD_REF <vect__55.2835_219, 32, 224>;
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:272:33] ss_56 = stmp_ss_56.2836_233 + stmp_ss_56.2836_234;
  # RANGE [32, 8589934560] NONZERO 8589934591
  ivtmp.2840_206 = ivtmp.2840_53 + 32;

  if (_203 == ivtmp.2840_206)
    goto <bb 63>; [16.67%]
  else
    goto <bb 62>; [83.33%]

It's easy to see that the compiler successfully vectorized the loop.

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@CLAassistant
Copy link

CLAassistant commented Apr 26, 2021

CLA assistant check
All committers have signed the CLA.

@paddle-bot-old
Copy link

paddle-bot-old bot commented May 5, 2021

Sorry to inform you that 988c1d1's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

Copy link
Contributor

@XiaoguangHu01 XiaoguangHu01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for op benchmark ci

@Aurelius84 Aurelius84 merged commit 97a9552 into PaddlePaddle:develop May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants