Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different sizes of dictionaries in different models #85

Open
bariluz93 opened this issue Nov 27, 2022 · 1 comment
Open

different sizes of dictionaries in different models #85

bariluz93 opened this issue Nov 27, 2022 · 1 comment

Comments

@bariluz93
Copy link

Hi,
I use different tokenizers for different languages:

Helsinki-NLP/opus-mt-en-de
Helsinki-NLP/opus-mt-en-he
Helsinki-NLP/opus-mt-en-ru
Helsinki-NLP/opus-mt-en-es

I see that the English parts of the dictionaries are different
for example
tokenizer_he.tokenize("housekeeper") outputs
['▁housekeeper']
and
tokenizer_es.tokenize("housekeeper") outputs
['▁house', 'keeper']

I want to know what is the reason for this different
Was it trained on different dataset?
Thank you
Bar

@jorgtied
Copy link
Member

Yes, all models actually have their own sentence piece model trained on each side of the bitext used for training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants