Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add chatbot example using Qwen2 #282

Merged
merged 8 commits into from
Jul 16, 2024
Merged

Add chatbot example using Qwen2 #282

merged 8 commits into from
Jul 16, 2024

Conversation

robertknight
Copy link
Owner

@robertknight robertknight commented Jul 15, 2024

Add a chatbot demo. This uses the Qwen2 model since it provides 0.5b and 1.5b sizes that are best-in-class, and the tokenization is conveniently a derivative of the GPT-2 tokenization which rten-text supports. Larger models would produce better results, but since RTen only supports fp32 precision at present, such models become slow due to the memory bandwidth requirements. 1.5b is about the largest size that produces "usable" speed on my Intel i5.

In the process it was necessary to add some capabilities to rten-generate and rten-text to better support chat-like applications:

  • Support adding user input to the model input after the initial generation, via Generator::append_prompt
  • Fix generation of the attention_mask input. It is supposed to have the length of the input sequence not just the new input IDs. Previously it was incorrect after the initial step, but the code worked because on subsequent steps it had size 1 and that gets broadcast to whatever size is required.
  • Support multiple stop tokens. Qwen2 uses <|imend|> and <|endoftext|>
  • Support temperature in top-K sampling
  • Add fake support for NFC normalization in text encoding

TODO

  • Tests for append_prompt
  • Replace dummy NFC normalization with proper implementation (Deferred for later)

Zero is a value that is "safe" for more inputs with an `_ids` suffix, such as
`position_ids`.
In HuggingFace models this input's sequence axis is expected to have the same
size as the sequence length, not the length of the input IDs that are being
provided at the current step. The length was correct for the initial prompt but
wrong for subsequent generation steps. However when only one new token was added
during iterative decoding, the `attention_mask` worked despite being the wrong
size because 1-sized inputs are broadcast by various operators. When appending
multiple tokens after the initial generation however, such as when adding a
tokenized chat message from a user, this broadcasting failed.
 - Fix a design mistake that using `stop_on_token` would cause generation to
   silently stop without propagating the error to the caller.

 - Support multiple end-of-turn token IDs. Qwen2 for example can emit either
   `<|endoftext|>` or `<|im_end|>` tokens.
This is useful in chat applications for example where generation alternates
between iterative decoding and feeding in tokenized user input.
This allows loading `tokenizer.json` files which specify NFC normalization, but
doesn't actually implement the normalization yet.  This is OK as long as the
input text doesn't require it.
Qwen2 was chosen as an initial chatbot example because its tokenization is very
similar to the already-supported GPT-2 and it is one of the best very small
instruction-tuned models.
@robertknight robertknight marked this pull request as ready for review July 16, 2024 06:38
@robertknight robertknight merged commit 5ef6e67 into main Jul 16, 2024
2 checks passed
@robertknight robertknight deleted the qwen-chat branch July 16, 2024 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant