Fix for slow the bug tokenizer adding spaces to single id decodes #32564

DuyguA · 2024-08-09T11:12:21Z

What does this PR do?

Quick fix for a bug with the tokenizer, slow tokenizers add spaces in between when the input is a single id.

Fixes #29489

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker

HuggingFaceDocBuilderDev · 2024-08-26T16:01:39Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

LysandreJik · 2024-08-27T11:43:13Z

cc @itazap as well!

src/transformers/tokenization_utils.py

tests/tokenization/test_tokenization_utils.py

itazap

Thanks for the quick update 🤗 Thanks for merging the tests! Left a few comments about the single special token case, let me know what you think!

DuyguA · 2024-08-29T13:54:47Z

Thanks for the quick update 🤗 Thanks for merging the tests! Left a few comments about the single special token case, let me know what you think!

No worries, I'll do the changes 😉

itazap

Thank you! Nice tests 🤗

DuyguA · 2024-09-09T11:27:13Z

@ArthurZucker and @LysandreJik merge time please 😉

tests/tokenization/test_tokenization_utils.py

src/transformers/tokenization_utils.py

Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>

tests/tokenization/test_tokenization_utils.py

Co-authored-by: Ita Zaporozhets <31893021+itazap@users.noreply.github.com>

DuyguA · 2024-09-17T08:40:52Z

Gentle ping @itazap , can we do the merge? Some commits from the main was failing this branch but looks like all fixed , can we do the merge before any more breaking changes come 😁 😁 😬

DuyguA added 4 commits August 9, 2024 10:00

_decode signature change and quick return

74da1b4

added bunch of decoding tests

6f4c1e6

signature match and return

7ddc3ca

added tests for decoding

15c2d9e

DuyguA changed the title ~~Fix for slow tokenizer adding spaces to~~ Fix for slow the bug tokenizer adding spaces to single id decodes Aug 9, 2024

ArthurZucker requested a review from itazap August 27, 2024 14:06

itazap reviewed Aug 27, 2024

View reviewed changes

src/transformers/tokenization_utils.py Show resolved Hide resolved

tests/tokenization/test_tokenization_utils.py Show resolved Hide resolved

DuyguA added 5 commits August 28, 2024 10:22

merged decoding test

3716cfd

more tests for special tokens

251a5ac

cosmetics

2e82e67

fixed param

97d5cb1

ruffed the file

f5da92b

ArthurZucker requested a review from itazap August 29, 2024 09:33

itazap reviewed Aug 29, 2024

View reviewed changes

tests/tokenization/test_tokenization_utils.py Show resolved Hide resolved

itazap reviewed Aug 29, 2024

View reviewed changes

DuyguA added 3 commits September 9, 2024 09:30

refinement for single special tokens

2b097eb

added test for single special tokens

420993c

Merge branch 'huggingface:main' into fix/tokenizer-decoding-space

689b93e

itazap approved these changes Sep 9, 2024

View reviewed changes