Add optimized GEMM implementation for vector-matrix multiplications #61

robertknight · 2024-03-20T07:42:48Z

When the LHS / "A" matrix argument to a matrix multiplication is a row vector,
use an optimized vector-matrix ("gemv" in BLAS terminology) function which avoids the overhead of
packing, as otherwise packing costs consume most of the execution time.

Vector-matrix multiplications are common in autoregressive transformers,
including LLMs.

Rather than implement the same algorithm separately for each architecture, the
generic SIMD wrappers in rten-vecmath are reused.

Tested on an Intel i5-1038NG7, this improves the performance of the
vector-matrix product in the bench_gemm benchmark by ~4x (5 -> 20 GFLOPS).

TODO:

Compare performance on Arm
Handle alpha and beta values in the gemv fast-path. These are currently ignored and treated as if they are (1, 1).
Add tests for various alpha / beta value combinations

This makes it more flexible by allowing the starting value to be changed. This is useful when iterating over tiles in a matrix for example.

When attempting to use the `SimdFloat` trait outside of the `rten_vecmath` crate I encountered a situation where performance was very poor due to functions not being inlined. The latest Rust releases should be inlining small functions automatically, but I believe this may not happen when the target also has target feature annotations. To keep things simple, just mark all the methods that wrap intrinsics as `#[inline]`.

This allows writing generic SIMD code in the main rten lib, for functions that don't really belong in rten_vecmath.

robertknight · 2024-03-21T06:15:29Z

Tested on an AWS c6g instance, performance on Arm v8 is better than before (about 2x on the bench_gemm benchmark) but still much slower than Blis with the same input size (by a factor of ~2.5x), tested using a modified gemm-benchmark.

robertknight · 2024-03-21T06:43:15Z

See also https://arxiv.org/abs/2302.08417 for discussion of how BLIS handles matrix multiplications when one or both problem sizes is small. It has a more general solution than what is implemented here, which works when the "A" input is a skinny matrix, rather than just a vector.

When the LHS / "A" matrix argument to a matrix multiplication is a row vector, use an optimized vector-matrix ("gemv") function which avoids the overhead of packing, as otherwise packing costs consume most of the execution time. Vector-matrix multiplications are common in autoregressive transformers, including LLMs, for example. Rather than implement the same algorithm separately for each architecture, the generic SIMD wrappers in rten-vecmath are reused. Tested on an Intel i5-1038NG7, this improves the performance of the vector-matrix product in the `bench_gemm` benchmark by ~4x (5 -> 20 GFLOPS).

robertknight · 2024-03-23T07:34:29Z

There is more work to do to tune the block sizes and generalize this to cases where one or both input matrices is skinny (small M / N dims) but not a vector. The current version is still a 2x or more improvement, so I'm going to land it.

robertknight force-pushed the gemv-opt2 branch from d2661ee to 3e1a296 Compare March 20, 2024 07:45

robertknight added 4 commits March 20, 2024 07:57

Make unroll_loop take a range instead of a count

67ad340

This makes it more flexible by allowing the starting value to be changed. This is useful when iterating over tiles in a matrix for example.

Expose rten_vecmath::simd_vec for use in other crates

8843737

This allows writing generic SIMD code in the main rten lib, for functions that don't really belong in rten_vecmath.

Fix a few doc errors and clarify safety requirements for SIMD traits

53b7832

robertknight force-pushed the gemv-opt2 branch 4 times, most recently from c3e77e4 to 525d59f Compare March 20, 2024 09:04

robertknight marked this pull request as ready for review March 22, 2024 06:22

robertknight force-pushed the gemv-opt2 branch 2 times, most recently from 2e0645d to 87449cd Compare March 22, 2024 07:22

robertknight force-pushed the gemv-opt2 branch from 87449cd to 11e6c9f Compare March 23, 2024 07:19

robertknight merged commit 8e81747 into main Mar 23, 2024
2 checks passed

robertknight deleted the gemv-opt2 branch March 23, 2024 07:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optimized GEMM implementation for vector-matrix multiplications #61

Add optimized GEMM implementation for vector-matrix multiplications #61

robertknight commented Mar 20, 2024 •

edited

Loading

robertknight commented Mar 21, 2024

robertknight commented Mar 21, 2024 •

edited

Loading

robertknight commented Mar 23, 2024

Add optimized GEMM implementation for vector-matrix multiplications #61

Add optimized GEMM implementation for vector-matrix multiplications #61

Conversation

robertknight commented Mar 20, 2024 • edited Loading

robertknight commented Mar 21, 2024

robertknight commented Mar 21, 2024 • edited Loading

robertknight commented Mar 23, 2024

robertknight commented Mar 20, 2024 •

edited

Loading

robertknight commented Mar 21, 2024 •

edited

Loading