[TOPI] VNNI support for batch matmul #10332

masahi · 2022-02-21T08:17:55Z

Following #10230, I added VNNI support for batch_matmul as well. The cool part is that I reuse the same dense schedule in #10230 to schedule the GEMM part, and parallelize over the batch dimension. See the perf result in #10332 (comment)

~~After this PR, I'll add int8, int8 support to VNNI dense and batch_matmul~~ (UPDATE: Done) - that will allow us to benchmark e2e performance on QAT BERT made possible by @Icemist in #10239.

Unlike dense case, the second input to batch_matmul is typically not a constant tensor. So I don't use alter_layout and compile time layout transform. Instead, layout transform is done at runtime. So the lowered IR for batch_matmul + post ops looks like:

  parallel (ax0.ax1.outer.ax2.outer.fused.fused, 0, 128) {
    // attr [T_layout_trans] storage_alignment = 128
    let T_layout_trans = tir.TVMBackendAllocWorkspace(1, dev_id, (uint64)1536, 0, 8)
    allocate compute[int32x16 * 1], storage_scope = global
    for (ax2, 0, 24) {
      let cse_var_2 = (ax2*64)
      let cse_var_1 = ((ax0.ax1.outer.ax2.outer.fused.fused*1536) + (ax2*4))
      T_layout_trans[ramp(cse_var_2, 1, 4)] = placeholder[ramp(cse_var_1, 1, 4)]
      T_layout_trans[ramp((cse_var_2 + 4), 1, 4)] = placeholder[ramp((cse_var_1 + 96), 1, 4)]
      T_layout_trans[ramp((cse_var_2 + 8), 1, 4)] = placeholder[ramp((cse_var_1 + 192), 1, 4)]
      T_layout_trans[ramp((cse_var_2 + 12), 1, 4)] = placeholder[ramp((cse_var_1 + 288), 1, 4)]
      T_layout_trans[ramp((cse_var_2 + 16), 1, 4)] = placeholder[ramp((cse_var_1 + 384), 1, 4)]
      T_layout_trans[ramp((cse_var_2 + 20), 1, 4)] = placeholder[ramp((cse_var_1 + 480), 1, 4)]
      T_layout_trans[ramp((cse_var_2 + 24), 1, 4)] = placeholder[ramp((cse_var_1 + 576), 1, 4)]
      T_layout_trans[ramp((cse_var_2 + 28), 1, 4)] = placeholder[ramp((cse_var_1 + 672), 1, 4)]
      T_layout_trans[ramp((cse_var_2 + 32), 1, 4)] = placeholder[ramp((cse_var_1 + 768), 1, 4)]
      T_layout_trans[ramp((cse_var_2 + 36), 1, 4)] = placeholder[ramp((cse_var_1 + 864), 1, 4)]
      T_layout_trans[ramp((cse_var_2 + 40), 1, 4)] = placeholder[ramp((cse_var_1 + 960), 1, 4)]
      T_layout_trans[ramp((cse_var_2 + 44), 1, 4)] = placeholder[ramp((cse_var_1 + 1056), 1, 4)]
      T_layout_trans[ramp((cse_var_2 + 48), 1, 4)] = placeholder[ramp((cse_var_1 + 1152), 1, 4)]
      T_layout_trans[ramp((cse_var_2 + 52), 1, 4)] = placeholder[ramp((cse_var_1 + 1248), 1, 4)]
      T_layout_trans[ramp((cse_var_2 + 56), 1, 4)] = placeholder[ramp((cse_var_1 + 1344), 1, 4)]
      T_layout_trans[ramp((cse_var_2 + 60), 1, 4)] = placeholder[ramp((cse_var_1 + 1440), 1, 4)]
    }
    for (ax1.inner, 0, 32) {
      let cse_var_3 = (((tir.shift_right(ax0.ax1.outer.ax2.outer.fused.fused, 3)*4096) + (ax1.inner*128)) + (tir.bitwise_and(ax0.ax1.outer.ax2.outer.fused.fused, 7)*16))
      compute[ramp(0, 1, 16)] = x16(0)
      for (k.outer, 0, 24) {
        compute[ramp(0, 1, 16)] = (tir.call_llvm_pure_intrin((uint32)9785, (uint32)0, x16(0), x16(tir.reinterpret(placeholder[ramp((((tir.shift_right(ax0.ax1.outer.ax2.outer.fused.fused, 3)*3072) + (ax1.inner*96)) + (k.outer*4)), 1, 4)])), tir.reinterpret(T_layout_trans[ramp((k.outer*64), 1, 64)])) + compute[ramp(0, 1, 16)])
      }
      T_add[ramp(cse_var_3, 1, 16)] = (compute[ramp(0, 1, 16)] + placeholder[ramp(cse_var_3, 1, 16)])
    }

Future work can explore possibilities for eliminating runtime layout transform, or pipelining layout transform and compute to hide the overhead.

@elvin-n @mbrookhart @tkonolige @junrushao1994 @vinx13

elvin-n · 2022-02-21T19:21:29Z

Could you please share of float batch_matmul perf data vs new introducing int8 batch_matmul?

elvin-n

LGTM

masahi · 2022-02-21T23:51:50Z

Ok here is the comparison of GOPS between the new VNNI impl and the existing generic code. Also note that the VNNI numbers were obtained after only 1 or 2 min of tuning while the generic ones have very large tuning space and it took more than 12 hours to get these numbers under the same tuning option. The script is at https://github.com/masahi/int8_experiment/blob/main/relay_bench.py

This is on a rocket lake i5-11400 @ 2.60GHz, 6 threads.

B	M	N	K	TVM VNNI (new)	TVM existing (old)
8	64	800	320	1862.9816985699251	471.93086647752153
8	64	768	512	1957.1780318372826	254.2322265717467
8	16	256	512	481.7846564891195	249.41214520865546
8	128	128	128	1940.7730023523345	372.7504095880382
8	256	512	256	2380.99163061598	496.7852808609268
8	1024	1024	1024	2275.097320545042	219.50257992579049
8	128	768	3072	1449.8759165025203	219.86756788442386
8	128	768	768	1883.3963380647226	234.35976664468328
8	128	3072	768	1595.616577196681	196.09770614852056
16	384	384	64	2487.792996038378	418.875373840064
16	384	64	384	2441.74586017639	301.37582872146345

tmoreau89

Thank you @masahi the speedups you've reported are extremely impressive! LGTM

* add test * compute added * schedule works * reuse dense_vnni schedule * try an alternative approach to scheduling layout transform * introduce a tunable knob to decide if compute_root * check transpose condition * support s8 + s8 input * pylint

elvin-n approved these changes Feb 21, 2022

View reviewed changes

masahi force-pushed the vnni-batch-matmul branch 2 times, most recently from a7c0e72 to 4d6c9bb Compare February 22, 2022 00:04

masahi marked this pull request as ready for review February 22, 2022 09:30

masahi requested review from Laurawly, Huyuwei, kevinthesun, jwfromm, vinx13, yzhliu, mbrookhart, ZihengJiang, jcf94, comaniac, junrushao, tqchen, jroesch, areusch, merrymercy and icemelon as code owners February 22, 2022 09:30

masahi added 8 commits February 23, 2022 01:11

add test

8ea151e

compute added

47a52be

schedule works

486e333

reuse dense_vnni schedule

5e8421c

try an alternative approach to scheduling layout transform

3621410

introduce a tunable knob to decide if compute_root

a376be3

check transpose condition

bf8d4e6

support s8 + s8 input

d546384

tmoreau89 approved these changes Feb 22, 2022

View reviewed changes

masahi force-pushed the vnni-batch-matmul branch from 4d6c9bb to d546384 Compare February 22, 2022 20:30

pylint

0c2ff6d

masahi merged commit 8947729 into apache:main Feb 23, 2022

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TOPI] VNNI support for batch matmul #10332

[TOPI] VNNI support for batch matmul #10332

masahi commented Feb 21, 2022 •

edited

Loading

elvin-n commented Feb 21, 2022 •

edited

Loading

elvin-n left a comment

masahi commented Feb 21, 2022 •

edited

Loading

tmoreau89 left a comment

[TOPI] VNNI support for batch matmul #10332

[TOPI] VNNI support for batch matmul #10332

Conversation

masahi commented Feb 21, 2022 • edited Loading

elvin-n commented Feb 21, 2022 • edited Loading

elvin-n left a comment

Choose a reason for hiding this comment

masahi commented Feb 21, 2022 • edited Loading

tmoreau89 left a comment

Choose a reason for hiding this comment

masahi commented Feb 21, 2022 •

edited

Loading

elvin-n commented Feb 21, 2022 •

edited

Loading

masahi commented Feb 21, 2022 •

edited

Loading