different sizes of dictionaries in different models #85

bariluz93 · 2022-11-27T20:54:11Z

Hi,
I use different tokenizers for different languages:

Helsinki-NLP/opus-mt-en-de
Helsinki-NLP/opus-mt-en-he
Helsinki-NLP/opus-mt-en-ru
Helsinki-NLP/opus-mt-en-es

I see that the English parts of the dictionaries are different
for example
tokenizer_he.tokenize("housekeeper") outputs
['▁housekeeper']
and
tokenizer_es.tokenize("housekeeper") outputs
['▁house', 'keeper']

I want to know what is the reason for this different
Was it trained on different dataset?
Thank you
Bar

jorgtied · 2022-11-28T12:21:51Z

Yes, all models actually have their own sentence piece model trained on each side of the bitext used for training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

different sizes of dictionaries in different models #85

different sizes of dictionaries in different models #85

bariluz93 commented Nov 27, 2022

jorgtied commented Nov 28, 2022

different sizes of dictionaries in different models #85

different sizes of dictionaries in different models #85

Comments

bariluz93 commented Nov 27, 2022

jorgtied commented Nov 28, 2022