add bits-per-byte calculation to levanter #729

dlwh · 2024-09-15T07:09:39Z

No description provided.

percyliang

Great. I'm not used to reading levanter code, so it's sometimes hard to figure out what the type/dimensions of some of the tensors are... The annotations in the comments (e.g., [Batch, Pos]) are very helpful, but perhaps this could be done more pervasively?

The other thing I was curious about how this bpb implementation (which seems subtle) compares with other ones (e.g., lm-evaluation-harness).

src/levanter/eval.py

percyliang · 2024-09-15T16:07:05Z

src/levanter/utils/hf_utils.py

+
+
+def byte_length_of_token(tokenizer, idx: int) -> int:
+    # this is a pain because we want the prefix spaces, but we don't want extra noise for bytes


Does this correspond to other implementations that need to compute bpb? Would be good to reference and comment on whether we're doing the same thing.

lm-eval-harness is doing the more obvious thing where you just take the whole untokenized string and gets its length, but our eval pipeline starts from the tokenized and chunked sequences, so we have to back it out. I imagine it's not exactly the same value, but it's close enough and we can use lm-eval-harness to get "real" numbers for reporting if we're worried about it https://github.com/EleutherAI/lm-evaluation-harness/blob/fb963f0f0a5b28b69763590bb59676072cf43a01/lm_eval/tasks/french_bench/preprocess_wikitext.py#L39-L48

percyliang · 2024-09-15T16:08:14Z

tests/test_hf_utils.py

@@ -22,3 +24,27 @@ def test_load_tokenizer_in_memory_fs():
        )
    tokenizer = load_tokenizer("memory://foo/")
    assert len(tokenizer) == 5027
+
+
+@skip_if_hf_model_not_accessible("meta-llama/Llama-2-7b-hf")


Should we add some tests with more funky Unicode characters and tokens that don't align on character boundaries?

good call on the extra tests. caught an issue with the single byte tokens. i now test every token in the llama vocab

src/levanter/eval.py

dlwh added 5 commits September 14, 2024 11:02

wip

ef2a4cc

wip

a205b7d

it feels correct?

230c421

fix bpb for llama tokenizer (which has weird space handling)

133f1a5

pre-commit

0827737

percyliang approved these changes Sep 15, 2024

View reviewed changes

dlwh added 6 commits September 15, 2024 22:58

fix train_lm entry

3f60b82

will this make the tests happier?

6838884

remove chatgpt comments

14975d3

pr comments

52c26f2

grr

33cdb92

oops

d6d21c1

dlwh merged commit 07b3f16 into main Sep 18, 2024
7 of 8 checks passed

dlwh deleted the bpb branch September 18, 2024 05:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add bits-per-byte calculation to levanter #729

add bits-per-byte calculation to levanter #729

dlwh commented Sep 15, 2024

percyliang left a comment

percyliang Sep 15, 2024

dlwh Sep 17, 2024

percyliang Sep 15, 2024

dlwh Sep 17, 2024

dlwh Sep 17, 2024



		def byte_length_of_token(tokenizer, idx: int) -> int:
		# this is a pain because we want the prefix spaces, but we don't want extra noise for bytes

add bits-per-byte calculation to levanter #729

add bits-per-byte calculation to levanter #729

Conversation

dlwh commented Sep 15, 2024

percyliang left a comment

Choose a reason for hiding this comment

percyliang Sep 15, 2024

Choose a reason for hiding this comment

dlwh Sep 17, 2024

Choose a reason for hiding this comment

percyliang Sep 15, 2024

Choose a reason for hiding this comment

dlwh Sep 17, 2024

Choose a reason for hiding this comment

dlwh Sep 17, 2024

Choose a reason for hiding this comment