Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add a chatbot demo. This uses the Qwen2 model since it provides 0.5b and 1.5b sizes that are best-in-class, and the tokenization is conveniently a derivative of the GPT-2 tokenization which rten-text supports. Larger models would produce better results, but since RTen only supports fp32 precision at present, such models become slow due to the memory bandwidth requirements. 1.5b is about the largest size that produces "usable" speed on my Intel i5.
In the process it was necessary to add some capabilities to rten-generate and rten-text to better support chat-like applications:
Generator::append_prompt
attention_mask
input. It is supposed to have the length of the input sequence not just the new input IDs. Previously it was incorrect after the initial step, but the code worked because on subsequent steps it had size 1 and that gets broadcast to whatever size is required.<|imend|>
and<|endoftext|>
TODO
append_prompt
Replace dummy NFC normalization with proper implementation(Deferred for later)