-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add bits-per-byte calculation to levanter #729
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. I'm not used to reading levanter code, so it's sometimes hard to figure out what the type/dimensions of some of the tensors are... The annotations in the comments (e.g., [Batch, Pos]) are very helpful, but perhaps this could be done more pervasively?
The other thing I was curious about how this bpb implementation (which seems subtle) compares with other ones (e.g., lm-evaluation-harness).
|
||
|
||
def byte_length_of_token(tokenizer, idx: int) -> int: | ||
# this is a pain because we want the prefix spaces, but we don't want extra noise for bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this correspond to other implementations that need to compute bpb? Would be good to reference and comment on whether we're doing the same thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lm-eval-harness is doing the more obvious thing where you just take the whole untokenized string and gets its length, but our eval pipeline starts from the tokenized and chunked sequences, so we have to back it out. I imagine it's not exactly the same value, but it's close enough and we can use lm-eval-harness to get "real" numbers for reporting if we're worried about it https://github.com/EleutherAI/lm-evaluation-harness/blob/fb963f0f0a5b28b69763590bb59676072cf43a01/lm_eval/tasks/french_bench/preprocess_wikitext.py#L39-L48
@@ -22,3 +24,27 @@ def test_load_tokenizer_in_memory_fs(): | |||
) | |||
tokenizer = load_tokenizer("memory://foo/") | |||
assert len(tokenizer) == 5027 | |||
|
|||
|
|||
@skip_if_hf_model_not_accessible("meta-llama/Llama-2-7b-hf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add some tests with more funky Unicode characters and tokens that don't align on character boundaries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call on the extra tests. caught an issue with the single byte tokens. i now test every token in the llama vocab
No description provided.