Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize im2col - avoid redundant zeroing of packing buffer, improve lookup table construction #165

Merged
merged 3 commits into from
May 10, 2024

Conversation

robertknight
Copy link
Owner

@robertknight robertknight commented May 10, 2024

This adds a couple of further optimizations for im2col:

  • Avoid redundant zeroing.

    If the packing buffer needed for im2col operations needs to be resized, avoid zeroing it, since the packing code will overwrite all elements.

    As part of this change, change the callers of the packing functions in gemm.rs to always pass an output slice of exactly the correct size, avoiding the need for each packing function to zero unused space at the end, in order to fulfill the contract that the entire slice is initialized after the packing code returns. Previously gemm.rs would always pass an output slice of the same size for every block to be packed. However the last row/column block may be smaller, with unused space at the end.

  • Improve efficiency of lookup table construction in VirtualIm2Col::new. See commit for details. But in short, avoid divisions where possible.

On the first item, In the Piper models, convolutions get called with progressively larger input images as you get into the deeper layers. I noticed a stall of ~12 ms or so about two-thirds of the way through evaluation, while a packing buffer was resized and zeroed in several threads. Avoiding the zeroing saves maybe 3-4ms. It would be even better if the buffer was just allocated with the largest size it will need at the start though.

If the packing buffer needed for im2col operations needs to be resized, avoid
zeroing it, since the packing code will overwrite all elements.

As part of this change, change the caller of the packing functions in `gemm.rs`
to always pass an output slice of exactly the correct size, avoiding the need
for each packing function to zero unused space at the end, in order to fulfil
the contract that the entire slice is initialized after the packing code
returns. Previously gemm.rs would always pass an output slice of the same size
for every block to be packed. However the last row/column block may be smaller,
with unused space at the end.

The `VirtualMatrix` trait is now unsafe due to the added requirement that
implementers must initialize the entire buffer passed to them. An alternative
would be to require implementations to return the input as an `&mut [f32]`
and have the caller verify that it matches.
Rewrite the construction of lookup tables in `VirtualIm2Col::new` to avoid
calculating the same Y offsets repeatedly, and avoid integer divisions as much
as possible.

This drops worker thread time in `VirtualIm2Col::new` from ~2% to <0.1% in the
Piper models.
@robertknight robertknight changed the title Avoid redundant zeroing of im2col packing buffer Optimize im2col - avoid redundant zeroing of packing buffer, improve lookup table construction May 10, 2024
@robertknight robertknight merged commit c9fde5b into main May 10, 2024
2 checks passed
@robertknight robertknight deleted the im2col-pack-buffer-uninit branch May 10, 2024 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant