-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI] VNNI support for batch matmul #10332
Conversation
Could you please share of float batch_matmul perf data vs new introducing int8 batch_matmul? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Ok here is the comparison of GOPS between the new VNNI impl and the existing generic code. Also note that the VNNI numbers were obtained after only 1 or 2 min of tuning while the generic ones have very large tuning space and it took more than 12 hours to get these numbers under the same tuning option. The script is at https://github.com/masahi/int8_experiment/blob/main/relay_bench.py This is on a rocket lake
|
a7c0e72
to
4d6c9bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @masahi the speedups you've reported are extremely impressive! LGTM
4d6c9bb
to
d546384
Compare
* add test * compute added * schedule works * reuse dense_vnni schedule * try an alternative approach to scheduling layout transform * introduce a tunable knob to decide if compute_root * check transpose condition * support s8 + s8 input * pylint
Following #10230, I added VNNI support for
batch_matmul
as well. The cool part is that I reuse the samedense
schedule in #10230 to schedule the GEMM part, and parallelize over the batch dimension. See the perf result in #10332 (comment)After this PR, I'll add(UPDATE: Done) - that will allow us to benchmark e2e performance on QAT BERT made possible by @Icemist in #10239.int8, int8
support to VNNIdense
andbatch_matmul
Unlike
dense
case, the second input tobatch_matmul
is typically not a constant tensor. So I don't usealter_layout
and compile time layout transform. Instead, layout transform is done at runtime. So the lowered IR forbatch_matmul
+ post ops looks like:Future work can explore possibilities for eliminating runtime layout transform, or pipelining layout transform and compute to hide the overhead.
@elvin-n @mbrookhart @tkonolige @junrushao1994 @vinx13