Refactor `dot` op's CPU kernel for better performance #32589

tongxin · 2021-04-26T12:41:47Z

PR types

Performance optimization

PR changes

OPs

Describe

Rewrote the dot op's CPU kernel and saw over 10x performance improvement.

The main part of the kernel is a trivial loop nest performing a sequence of sum-reduces. Though the code is simple the branch in the inner loop is trouble enough to stop many types of loop optimization. Moreover, the reduction operation (+=) to a heap dereferenced value cannot be automatically localized. We have to do that manually.

Following compares IRs of the inner loop before and after the rewrite.

Before rewrite

;;   basic block 63, loop depth 1, count 430033602 (estimated locally), maybe hot
;;    prev block 62, next block 64, flags: (NEW, REACHABLE, VISITED)
;;    pred:       61 [50.0% (guessed)]  count:430033601 (estimated locally) (FALSE_VALUE,EXECUTABLE)
;;   starting at line 282
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:282:23] # RANGE ~[2147483648, 18446744073709551614]
  _14 = (long unsigned intD.16) ind_208;
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:282:23] # RANGE [0, 18446744073709551612] NONZERO 18446744073709551612
  _15 = _14 * 4;
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:282:23] # PT = nonlocal escaped null
  _16 = _124 + _15;
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:282:23] # VUSE <.MEM_103>
  _17 = [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:282:23] *_16;
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:282:23] _25 = _17 + _70;
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:282:9] # .MEM_67 = VDEF <.MEM_103>
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:282:9] *_16 = _25;
;;    succ:       64 [always (guessed)]  count:430033602 (estimated locally) (FALLTHRU,EXECUTABLE)

After rewrite:

;;   basic block 61, loop depth 1, count 94607386 (estimated locally), maybe hot
;;    prev block 60, next block 62, flags: (NEW, REACHABLE, VISITED)
;;    pred:       60 [89.0% (guessed)]  count:94607386 (estimated locally) (FALSE_VALUE,EXECUTABLE)
;;   starting at line -1
  if (_14 <= 6)
    goto <bb 64>; [10.00%]
  else
    goto <bb 62>; [90.00%]
;;    succ:       62 [90.0% (guessed)]  count:85146647 (estimated locally) (FALSE_VALUE,EXECUTABLE)
;;                64 [10.0% (guessed)]  count:9460739 (estimated locally) (TRUE_VALUE,EXECUTABLE)

;;   basic block 62, loop depth 2, count 510879883 (estimated locally), maybe hot
;;    prev block 61, next block 63, flags: (NEW, REACHABLE, VISITED)
;;    pred:       61 [90.0% (guessed)]  count:85146647 (estimated locally) (FALSE_VALUE,EXECUTABLE)
;;                62 [83.3% (adjusted)]  count:425733237 (estimated locally) (FALSE_VALUE,EXECUTABLE)
;;   starting at line 272, discriminator 2
  # ss_201 = PHI <[/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:271:9] 0.0(61), [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:272:33] ss_56(62)>
  # RANGE [0, 8589934528] NONZERO 8589934591
  # ivtmp.2840_53 = PHI <0(61), ivtmp.2840_206(62)>
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:272:47] # VUSE <.MEM_198>
  vect__6.2831_215 = MEM[base: x__207, index: ivtmp.2840_53, offset: 0B];
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:272:47] # VUSE <.MEM_198>
  vect__8.2834_218 = MEM[base: y__204, index: ivtmp.2840_53, offset: 0B];
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:272:47] vect__55.2835_219 = vect__6.2831_215 * vect__8.2834_218;
  stmp_ss_56.2836_220 = BIT_FIELD_REF <vect__55.2835_219, 32, 0>;
  stmp_ss_56.2836_221 = ss_201 + stmp_ss_56.2836_220;
  stmp_ss_56.2836_222 = BIT_FIELD_REF <vect__55.2835_219, 32, 32>;
  stmp_ss_56.2836_223 = stmp_ss_56.2836_221 + stmp_ss_56.2836_222;
  stmp_ss_56.2836_224 = BIT_FIELD_REF <vect__55.2835_219, 32, 64>;
  stmp_ss_56.2836_225 = stmp_ss_56.2836_223 + stmp_ss_56.2836_224;
  stmp_ss_56.2836_226 = BIT_FIELD_REF <vect__55.2835_219, 32, 96>;
  stmp_ss_56.2836_227 = stmp_ss_56.2836_225 + stmp_ss_56.2836_226;
  stmp_ss_56.2836_228 = BIT_FIELD_REF <vect__55.2835_219, 32, 128>;
  stmp_ss_56.2836_229 = stmp_ss_56.2836_227 + stmp_ss_56.2836_228;
  stmp_ss_56.2836_230 = BIT_FIELD_REF <vect__55.2835_219, 32, 160>;
  stmp_ss_56.2836_231 = stmp_ss_56.2836_229 + stmp_ss_56.2836_230;
  stmp_ss_56.2836_232 = BIT_FIELD_REF <vect__55.2835_219, 32, 192>;
  stmp_ss_56.2836_233 = stmp_ss_56.2836_231 + stmp_ss_56.2836_232;
  stmp_ss_56.2836_234 = BIT_FIELD_REF <vect__55.2835_219, 32, 224>;
  [/home/baitongxin/Paddle/paddle/fluid/operators/dot_op.h:272:33] ss_56 = stmp_ss_56.2836_233 + stmp_ss_56.2836_234;
  # RANGE [32, 8589934560] NONZERO 8589934591
  ivtmp.2840_206 = ivtmp.2840_53 + 32;

  if (_203 == ivtmp.2840_206)
    goto <bb 63>; [16.67%]
  else
    goto <bb 62>; [83.33%]

It's easy to see that the compiler successfully vectorized the loop.

paddle-bot-old · 2021-04-26T12:41:50Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

CLAassistant · 2021-04-26T12:41:52Z

All committers have signed the CLA.

paddle-bot-old · 2021-05-05T02:35:30Z

Sorry to inform you that 988c1d1's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

… develop

XiaoguangHu01

LGTM

Xreki

LGTM for op benchmark ci

tongxin added 2 commits April 25, 2021 12:14

OP dot: refactor CPU kernels and get better loop performance.

e611858

Merge remote-tracking branch 'upstream/develop' into develop

440adb9

tongxin added 6 commits April 27, 2021 04:00

Minor fix on code format.

01230a3

Merge remote-tracking branch 'upstream/develop' into develop

7305975

Merge remote-tracking branch 'upstream/develop' into develop

242b808

Merge branch 'develop' of https://github.com/tongxin/Paddle into develop

54a946f

Fixed minor errors.

2f9ae3b

Merge remote-tracking branch 'upstream/develop' into develop

988c1d1

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

df80e4a

… develop

XiaoguangHu01 approved these changes May 6, 2021

View reviewed changes

Xreki approved these changes May 6, 2021

View reviewed changes

Aurelius84 merged commit 97a9552 into PaddlePaddle:develop May 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `dot` op's CPU kernel for better performance #32589

Refactor `dot` op's CPU kernel for better performance #32589

tongxin commented Apr 26, 2021

paddle-bot-old bot commented Apr 26, 2021

CLAassistant commented Apr 26, 2021 •

edited

Loading

paddle-bot-old bot commented May 5, 2021

XiaoguangHu01 left a comment

Xreki left a comment

Refactor dot op's CPU kernel for better performance #32589

Refactor dot op's CPU kernel for better performance #32589

Conversation

tongxin commented Apr 26, 2021

PR types

PR changes

Describe

paddle-bot-old bot commented Apr 26, 2021

CLAassistant commented Apr 26, 2021 • edited Loading

paddle-bot-old bot commented May 5, 2021

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

Refactor `dot` op's CPU kernel for better performance #32589

Refactor `dot` op's CPU kernel for better performance #32589

CLAassistant commented Apr 26, 2021 •

edited

Loading