Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add bits-per-byte calculation to levanter #729

Merged
merged 11 commits into from
Sep 18, 2024
Merged

add bits-per-byte calculation to levanter #729

merged 11 commits into from
Sep 18, 2024

Conversation

dlwh
Copy link
Member

@dlwh dlwh commented Sep 15, 2024

No description provided.

Copy link

@percyliang percyliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I'm not used to reading levanter code, so it's sometimes hard to figure out what the type/dimensions of some of the tensors are... The annotations in the comments (e.g., [Batch, Pos]) are very helpful, but perhaps this could be done more pervasively?

The other thing I was curious about how this bpb implementation (which seems subtle) compares with other ones (e.g., lm-evaluation-harness).

src/levanter/eval.py Show resolved Hide resolved
src/levanter/eval.py Show resolved Hide resolved


def byte_length_of_token(tokenizer, idx: int) -> int:
# this is a pain because we want the prefix spaces, but we don't want extra noise for bytes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this correspond to other implementations that need to compute bpb? Would be good to reference and comment on whether we're doing the same thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lm-eval-harness is doing the more obvious thing where you just take the whole untokenized string and gets its length, but our eval pipeline starts from the tokenized and chunked sequences, so we have to back it out. I imagine it's not exactly the same value, but it's close enough and we can use lm-eval-harness to get "real" numbers for reporting if we're worried about it https://github.com/EleutherAI/lm-evaluation-harness/blob/fb963f0f0a5b28b69763590bb59676072cf43a01/lm_eval/tasks/french_bench/preprocess_wikitext.py#L39-L48

@@ -22,3 +24,27 @@ def test_load_tokenizer_in_memory_fs():
)
tokenizer = load_tokenizer("memory://foo/")
assert len(tokenizer) == 5027


@skip_if_hf_model_not_accessible("meta-llama/Llama-2-7b-hf")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some tests with more funky Unicode characters and tokens that don't align on character boundaries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call on the extra tests. caught an issue with the single byte tokens. i now test every token in the llama vocab

src/levanter/eval.py Outdated Show resolved Hide resolved
@dlwh dlwh merged commit 07b3f16 into main Sep 18, 2024
7 of 8 checks passed
@dlwh dlwh deleted the bpb branch September 18, 2024 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants