Avoid prepacking weights in RNN operators for short input sequences #74
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The RNN operators (LSTM, GRU) prepacked weights before looping over the sequence in order to amortize packing costs. If the sequence is short however, then the operator may run much faster without prepacking, since prepacking requires extra memory.
In the TTS system at https://github.com/robertknight/xd-tts/tree/rten-inference, the decoder loop executes the LSTM with a sequence length of 1 each time. In that case prepacking is completely wasted. By disabling prepacking for short sequences, audio generation speed improves from ~1.6x realtime to ~0.5x realtime. This gets much closer to ONNX Runtime generation speed (~0.25x realtime).
The current threshold for prepacking is a short value that seems reasonable. It could be refined by doing actual testing with a variety of sequence lengths.