Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for slow the bug tokenizer adding spaces to single id decodes #32564

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

DuyguA
Copy link
Contributor

@DuyguA DuyguA commented Aug 9, 2024

What does this PR do?

Quick fix for a bug with the tokenizer, slow tokenizers add spaces in between when the input is a single id.

Fixes #29489

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@ArthurZucker

@DuyguA DuyguA changed the title Fix for slow tokenizer adding spaces to Fix for slow the bug tokenizer adding spaces to single id decodes Aug 9, 2024
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@LysandreJik
Copy link
Member

cc @itazap as well!

Copy link
Contributor

@itazap itazap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick update 🤗 Thanks for merging the tests! Left a few comments about the single special token case, let me know what you think!

@DuyguA
Copy link
Contributor Author

DuyguA commented Aug 29, 2024

Thanks for the quick update 🤗 Thanks for merging the tests! Left a few comments about the single special token case, let me know what you think!

No worries, I'll do the changes 😉

Copy link
Contributor

@itazap itazap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Nice tests 🤗

@DuyguA
Copy link
Contributor Author

DuyguA commented Sep 9, 2024

@ArthurZucker and @LysandreJik merge time please 😉

Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
DuyguA and others added 4 commits September 10, 2024 11:23
Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>
@DuyguA
Copy link
Contributor Author

DuyguA commented Sep 17, 2024

Gentle ping @itazap , can we do the merge? Some commits from the main was failing this branch but looks like all fixed , can we do the merge before any more breaking changes come 😁 😁 😬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Tokenizer] Inconsistent behavior when decoding a single ID and a list of the single ID
4 participants