Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rinna RoBERTa's max_length is 510 not 512? #3

Closed
masayakondo opened this issue Sep 2, 2021 · 4 comments
Closed

rinna RoBERTa's max_length is 510 not 512? #3

masayakondo opened this issue Sep 2, 2021 · 4 comments

Comments

@masayakondo
Copy link

Hi, I have been using rinna RoBERTa for a while now.
I have a question.
The max_length of rinna RoBERTa is 510 (not 512), right?
Is this the intended result? If this was the intended result, why did you use 510 instead of 512 for max_length?

rinna RoBERTa's padding_idx is 3 (not 1). So I think the starting position of position_embeddings is padding_idx+1=4 as in the following problem, but the size of position_embeddings in rinna RoBERTa is (514, 768). If I actually enter text with a length of 512, I get an index error.

@ZHAOTING
Copy link
Contributor

ZHAOTING commented Sep 2, 2021

Hi @masayakondo, the maximum length of rinna/japanese-roberta-base is 514, which aligns with the first dimension of the position_embeddings.
padding_idx has nothing to do with the maximum length, since It only interacts with the word_embeddings.

So I believe there should not be any errors inputting a 512-length sequence of tokens. Could you please share the code that causes the error? Thanks!

@masayakondo
Copy link
Author

Hi @ZHAOTING , thank you for your reply.
For example, when I ran the following code, I got an index error.

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained('rinna/japanese-roberta-base')

# sample sentence ( ▁ ドド・・・ド</s>)
input_ids = torch.tensor([9] + [100 for _ in range(510)] + [2]).unsqueeze(0)
print(input_ids.size()) # torch.Size([1, 512])
model(input_ids)
# IndexError: index out of range in self

In the case of RoBERTa, from the following code, I thought that padding_idx and position_embeddings, or padding_idx and sentence length, were related.

I am very sorry if my comment is misguided. Thanks!

@ZHAOTING
Copy link
Contributor

ZHAOTING commented Sep 2, 2021

You are correct about huggingface's roberta code! I didn't notice how they construct position_ids when it is not explicitly provided.
To be honest, I don't understand why they start with padding_idx instead of 0 when constructing position_ids, and I think it is wrong.

To properly use our model, please try constructing position_ids by yourself and using it as an argument along with input_ids. Hope it helps.

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained('rinna/japanese-roberta-base')

input_ids = torch.tensor([9] + [100 for _ in range(510)] + [2]).unsqueeze(0)

max_seq_len = input_ids.size(1)
position_ids = torch.LongTensor(list(range(0, max_seq_len))).unsqueeze(0)

output = model(input_ids, position_ids=position_ids)

@masayakondo
Copy link
Author

To be honest, I don't understand why they start with padding_idx instead of 0 when constructing position_ids, and I think it is wrong.

Yeah, I think you're right, too...

To properly use our model, please try constructing position_ids by yourself and using it as an argument along with input_ids. Hope it helps.

Thanks you for the advice. I will refer to it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants