Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optimized GEMM implementation for vector-matrix multiplications #61

Merged
merged 5 commits into from
Mar 23, 2024

Conversation

robertknight
Copy link
Owner

@robertknight robertknight commented Mar 20, 2024

When the LHS / "A" matrix argument to a matrix multiplication is a row vector,
use an optimized vector-matrix ("gemv" in BLAS terminology) function which avoids the overhead of
packing, as otherwise packing costs consume most of the execution time.

Vector-matrix multiplications are common in autoregressive transformers,
including LLMs.

Rather than implement the same algorithm separately for each architecture, the
generic SIMD wrappers in rten-vecmath are reused.

Tested on an Intel i5-1038NG7, this improves the performance of the
vector-matrix product in the bench_gemm benchmark by ~4x (5 -> 20 GFLOPS).

TODO:

  • Compare performance on Arm
  • Handle alpha and beta values in the gemv fast-path. These are currently ignored and treated as if they are (1, 1).
  • Add tests for various alpha / beta value combinations

This makes it more flexible by allowing the starting value to be changed. This
is useful when iterating over tiles in a matrix for example.
When attempting to use the `SimdFloat` trait outside of the `rten_vecmath` crate
I encountered a situation where performance was very poor due to functions not
being inlined. The latest Rust releases should be inlining small functions
automatically, but I believe this may not happen when the target also has target
feature annotations. To keep things simple, just mark all the methods that wrap
intrinsics as `#[inline]`.
This allows writing generic SIMD code in the main rten lib, for functions that
don't really belong in rten_vecmath.
@robertknight robertknight force-pushed the gemv-opt2 branch 4 times, most recently from c3e77e4 to 525d59f Compare March 20, 2024 09:04
@robertknight
Copy link
Owner Author

Tested on an AWS c6g instance, performance on Arm v8 is better than before (about 2x on the bench_gemm benchmark) but still much slower than Blis with the same input size (by a factor of ~2.5x), tested using a modified gemm-benchmark.

@robertknight
Copy link
Owner Author

robertknight commented Mar 21, 2024

See also https://arxiv.org/abs/2302.08417 for discussion of how BLIS handles matrix multiplications when one or both problem sizes is small. It has a more general solution than what is implemented here, which works when the "A" input is a skinny matrix, rather than just a vector.

@robertknight robertknight marked this pull request as ready for review March 22, 2024 06:22
@robertknight robertknight force-pushed the gemv-opt2 branch 2 times, most recently from 2e0645d to 87449cd Compare March 22, 2024 07:22
When the LHS / "A" matrix argument to a matrix multiplication is a row vector,
use an optimized vector-matrix ("gemv") function which avoids the overhead of
packing, as otherwise packing costs consume most of the execution time.

Vector-matrix multiplications are common in autoregressive transformers,
including LLMs, for example.

Rather than implement the same algorithm separately for each architecture, the
generic SIMD wrappers in rten-vecmath are reused.

Tested on an Intel i5-1038NG7, this improves the performance of the
vector-matrix product in the `bench_gemm` benchmark by ~4x (5 -> 20 GFLOPS).
@robertknight
Copy link
Owner Author

There is more work to do to tune the block sizes and generalize this to cases where one or both input matrices is skinny (small M / N dims) but not a vector. The current version is still a 2x or more improvement, so I'm going to land it.

@robertknight robertknight merged commit 8e81747 into main Mar 23, 2024
2 checks passed
@robertknight robertknight deleted the gemv-opt2 branch March 23, 2024 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant