-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add optimized GEMM implementation for vector-matrix multiplications #61
Conversation
This makes it more flexible by allowing the starting value to be changed. This is useful when iterating over tiles in a matrix for example.
When attempting to use the `SimdFloat` trait outside of the `rten_vecmath` crate I encountered a situation where performance was very poor due to functions not being inlined. The latest Rust releases should be inlining small functions automatically, but I believe this may not happen when the target also has target feature annotations. To keep things simple, just mark all the methods that wrap intrinsics as `#[inline]`.
This allows writing generic SIMD code in the main rten lib, for functions that don't really belong in rten_vecmath.
c3e77e4
to
525d59f
Compare
Tested on an AWS c6g instance, performance on Arm v8 is better than before (about 2x on the |
See also https://arxiv.org/abs/2302.08417 for discussion of how BLIS handles matrix multiplications when one or both problem sizes is small. It has a more general solution than what is implemented here, which works when the "A" input is a skinny matrix, rather than just a vector. |
2e0645d
to
87449cd
Compare
When the LHS / "A" matrix argument to a matrix multiplication is a row vector, use an optimized vector-matrix ("gemv") function which avoids the overhead of packing, as otherwise packing costs consume most of the execution time. Vector-matrix multiplications are common in autoregressive transformers, including LLMs, for example. Rather than implement the same algorithm separately for each architecture, the generic SIMD wrappers in rten-vecmath are reused. Tested on an Intel i5-1038NG7, this improves the performance of the vector-matrix product in the `bench_gemm` benchmark by ~4x (5 -> 20 GFLOPS).
There is more work to do to tune the block sizes and generalize this to cases where one or both input matrices is skinny (small M / N dims) but not a vector. The current version is still a 2x or more improvement, so I'm going to land it. |
When the LHS / "A" matrix argument to a matrix multiplication is a row vector,
use an optimized vector-matrix ("gemv" in BLAS terminology) function which avoids the overhead of
packing, as otherwise packing costs consume most of the execution time.
Vector-matrix multiplications are common in autoregressive transformers,
including LLMs.
Rather than implement the same algorithm separately for each architecture, the
generic SIMD wrappers in rten-vecmath are reused.
Tested on an Intel i5-1038NG7, this improves the performance of the
vector-matrix product in the
bench_gemm
benchmark by ~4x (5 -> 20 GFLOPS).TODO:
alpha
andbeta
values in the gemv fast-path. These are currently ignored and treated as if they are (1, 1).alpha
/beta
value combinations