Skip to content

Question about partial_run #237

Answered by robertknight
igor-yusupov asked this question in Q&A
Discussion options

You must be logged in to vote

The partial_run optimization works in any situation where there are model inputs that stay the same across each iteration of the decoder loop, and there are chunks of the graph which depend only on those inputs. For encoder-decoder transformer models this should always include the cross-attention inputs. For this model I think using partial_run with the n_layer_cross_k and n_layer_cross_v inputs should work, but you'll have to try it.

By the way, I recently added a new rten-generate crate to this repo which simplifies using transformer decoder models and applies the various key optimizations (using KV cache, partial_run etc.). See the rten-examples/src/gpt2.rs demo. It won't work for this…

Replies: 2 comments 23 replies

Comment options

You must be logged in to vote
2 replies
@igor-yusupov
Comment options

@igor-yusupov
Comment options

Answer selected by igor-yusupov
Comment options

You must be logged in to vote
21 replies
@igor-yusupov
Comment options

@igor-yusupov
Comment options

@robertknight
Comment options

@robertknight
Comment options

@igor-yusupov
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants