Optimize im2col - avoid redundant zeroing of packing buffer, improve lookup table construction #165

robertknight · 2024-05-10T07:46:24Z

This adds a couple of further optimizations for im2col:

Avoid redundant zeroing.

If the packing buffer needed for im2col operations needs to be resized, avoid zeroing it, since the packing code will overwrite all elements.

As part of this change, change the callers of the packing functions in gemm.rs to always pass an output slice of exactly the correct size, avoiding the need for each packing function to zero unused space at the end, in order to fulfill the contract that the entire slice is initialized after the packing code returns. Previously gemm.rs would always pass an output slice of the same size for every block to be packed. However the last row/column block may be smaller, with unused space at the end.
Improve efficiency of lookup table construction in VirtualIm2Col::new. See commit for details. But in short, avoid divisions where possible.

On the first item, In the Piper models, convolutions get called with progressively larger input images as you get into the deeper layers. I noticed a stall of ~12 ms or so about two-thirds of the way through evaluation, while a packing buffer was resized and zeroed in several threads. Avoiding the zeroing saves maybe 3-4ms. It would be even better if the buffer was just allocated with the largest size it will need at the start though.

If the packing buffer needed for im2col operations needs to be resized, avoid zeroing it, since the packing code will overwrite all elements. As part of this change, change the caller of the packing functions in `gemm.rs` to always pass an output slice of exactly the correct size, avoiding the need for each packing function to zero unused space at the end, in order to fulfil the contract that the entire slice is initialized after the packing code returns. Previously gemm.rs would always pass an output slice of the same size for every block to be packed. However the last row/column block may be smaller, with unused space at the end. The `VirtualMatrix` trait is now unsafe due to the added requirement that implementers must initialize the entire buffer passed to them. An alternative would be to require implementations to return the input as an `&mut [f32]` and have the caller verify that it matches.

Rewrite the construction of lookup tables in `VirtualIm2Col::new` to avoid calculating the same Y offsets repeatedly, and avoid integer divisions as much as possible. This drops worker thread time in `VirtualIm2Col::new` from ~2% to <0.1% in the Piper models.

robertknight added 3 commits May 10, 2024 08:43

Correct a couple of comments in RowOffsets struct

6d784c2

robertknight force-pushed the im2col-pack-buffer-uninit branch from d43b3f7 to 2cb1982 Compare May 10, 2024 17:12

robertknight changed the title ~~Avoid redundant zeroing of im2col packing buffer~~ Optimize im2col - avoid redundant zeroing of packing buffer, improve lookup table construction May 10, 2024

robertknight merged commit c9fde5b into main May 10, 2024
2 checks passed

robertknight deleted the im2col-pack-buffer-uninit branch May 10, 2024 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize im2col - avoid redundant zeroing of packing buffer, improve lookup table construction #165

Optimize im2col - avoid redundant zeroing of packing buffer, improve lookup table construction #165

robertknight commented May 10, 2024 •

edited

Loading

Optimize im2col - avoid redundant zeroing of packing buffer, improve lookup table construction #165

Optimize im2col - avoid redundant zeroing of packing buffer, improve lookup table construction #165

Conversation

robertknight commented May 10, 2024 • edited Loading

robertknight commented May 10, 2024 •

edited

Loading