Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI] VNNI support for int8 dense #10230

Merged
merged 19 commits into from
Feb 15, 2022
Merged

[TOPI] VNNI support for int8 dense #10230

merged 19 commits into from
Feb 15, 2022

Conversation

masahi
Copy link
Member

@masahi masahi commented Feb 11, 2022

I started off with the test code in

def test_fc_int8_acc32():
and simplified a bit. Autotvm tuning is supported but only one tunable param is exposed for now. I'm curious to know what kind of further scheduling would be worth it, beyond very simple ones I have now.

Moreover, since I rely on alter op layout to enable this op (see

if (
target_has_vnni(mcpu)
and data_tensor.dtype == "uint8"
and weight_tensor.dtype == "int8"
and weight_tensor.shape[0] % 16 == 0
and weight_tensor.shape[1] % 4 == 0
):
# TODO(masahi): Support int8 x int8 case
weight_layout = "NC16n4c"
return relay.nn.contrib_dense_pack(inputs[0], inputs[1], weight_layout, None, out_dtype)
) and currently this pass is disabled during autotvm (see
# Alter op layout code has been written expecting that tuning is applied
# without it, so we disable AlterOpLayout to maintain that behavior.
with tvm.transform.PassContext(opt_level=opt_level, disabled_pass={"AlterOpLayout"}):
and #10171), users need to manually invoke AlterOpLayout before extracting tasks. I refuse to add an ugly code path to workaround this strange issue like existing code cc @tkonolige.

cc @vinx13 @junrushao1994 @mbrookhart @tkonolige @elvin-n

Current perf results (also see more results in #10230 (comment))

Compare against FBGEMM using their bench exe https://github.com/pytorch/FBGEMM/blob/main/bench/GEMMsBenchmark.cc

The CPU is tigerlake i7-1195G7 @ 2.90GHz, all numbers are giga ops per sec (GOPS).

I didn't spend much on perf tuning, but the results look promising. Perf on bigger workloads don't look great, might need further investigation.

Also, I found that autotvm tuning (only one knob) helped on some single threaded perf, but it didn't on multi-threaded perf at all.

Single thread

M N K TVM FBGEMM
64 800 320 409.87859448066854 267.4
64 768 512 393.2796314153642 355.7
16 256 512 402.0798502998604 95.5
128 128 128 287.1147659800918 99.0
256 512 256 356.8404055290004 344.1
1024 1024 1024 362.241169902788 452.2
128 768 3072 242.5201549349005 423.1
128 768 768 369.1912270185436 371.0
128 3072 768 276.460384879211 398.9

4 threads

M N K TVM FBGEMM
64 800 320 758.5609768620845 191.6
64 768 512 760.5891303600133 326.5
16 256 512 673.6151819351879 39.0
128 128 128 676.2890321415558 42.0
256 512 256 707.0203324200173 375.0
1024 1024 1024 690.8657213907054 1609.7
128 768 3072 510.46516174131585 763.7
128 768 768 679.096063909271 658.1
128 3072 768 659.4564135824261 835.1

@junrushao
Copy link
Member

Copy link
Contributor

@elvin-n elvin-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@masahi
Copy link
Member Author

masahi commented Feb 14, 2022

Another perf result, this time on a desktop CPU i5-11400 @ 2.60GHz, 6 threads.

TVM showing excellent performance!

M N K TVM FBGEMM
64 800 320 2254.902217150772 259.2
64 768 512 2459.8222476184296 485.3
16 256 512 1511.1102144223107 59.8
128 128 128 1655.3361580672251 57.0
256 512 256 2487.655260525414 573.6
1024 1024 1024 2604.4250609008964 2520.1
128 768 3072 1846.407818035626 1864.0
128 768 768 2579.0514976284917 1012.5
128 3072 768 2571.1070749811097 1900.8

Copy link
Member

@junrushao junrushao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is just amazing!!

@masahi masahi merged commit 0009a30 into apache:main Feb 15, 2022
ylc pushed a commit to ylc/tvm that referenced this pull request Feb 16, 2022
* wip

* revert for now

* simplify blocking

* add bench script

* update type rel

* refactor tests

* end to end compilation working

* paralleize outer loop

* add shape check

* fused schedule first cut

* restore original test

* black

* add vnni check

* add relay test

* skip on ci

* check dtype

* lint

* make it tunable

* minor cleanup
pfk-beta pushed a commit to pfk-beta/tvm that referenced this pull request Apr 11, 2022
* wip

* revert for now

* simplify blocking

* add bench script

* update type rel

* refactor tests

* end to end compilation working

* paralleize outer loop

* add shape check

* fused schedule first cut

* restore original test

* black

* add vnni check

* add relay test

* skip on ci

* check dtype

* lint

* make it tunable

* minor cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants