WIP Completely rework dataset/cache system: instant resume, perfect shuffle, stable mixtures and more #716

dlwh · 2024-09-03T23:21:21Z

Introduces a massive rework of Levanter's cache system to support instant resume, perfect shuffle, stable mixtures and such.

The basic idea is to use TensorStore to store all of our data as a kind of janky column store (implemented in JaggedArrayStore) and pytrees of such (implemented in TreeStore).

TensorStore provides efficient storage and access to very large arrays. We still support streaming from an in progress cache via a new AsyncDataset class.

I've successfully tests this on the pile and, modulo the usual issues with the llama tokenizer on long documents/books, it behaves well.

Closes #626 #311 #119 #34

dlwh · 2024-09-04T05:07:54Z

(I recognize this is a massive change. I'd mostly just appreciate a look at the design doc and maybe the audio bits)

rjpower

Looks great overall to me: the user code is definitely easier to follow in the new design as well. I didn't take a good look through the TensorStore stuff but will try to take a peek later this week.

I noted a few nits but otherwise seems sensible to me!

rjpower · 2024-09-04T12:33:25Z

src/levanter/data/dataset.py

+        raise NotImplementedError("...")
+
+
+class Dataset(DatasetBase[T_co]):


Random nit (haven't read through all the way yet): do we need both Sync & AsyncDatasets? It seems like the dominate use is AsyncDataset, so could simplify things if we only need to think about that.

If sync is needed for convenience maybe a SyncDatasetWrapper which runs a thread to pull from an async dataset would work?

i use it in one "real" place out of laziness. I think getting rid of it might make sense

leaving it in as a convenience but moving it and making it clear it's dispreferred

rjpower · 2024-09-04T12:35:23Z

src/levanter/data/dataset.py

        raise NotImplementedError

+    @abc.abstractmethod
+    async def length_is_known(self) -> bool:


Nit: async_has_len? It's a bit odd to have the 2 different conventions.

yeah i like that

rjpower · 2024-09-04T12:36:47Z

src/levanter/data/dataset.py

+        """
+
+    @abc.abstractmethod
+    def has_len(self) -> bool:


nit: Is this synonymous with current_len() is None? Maybe combine/remove.

good point!

rjpower · 2024-09-04T12:41:48Z

src/levanter/data/dataset.py

+        return self._run_coroutine(self.dataset.async_getitem(index))
+
+
+class AsyncifiedDataset(AsyncDataset[T_co]):


(Yeah might be missing something, but this only seems used to wrap the SequenceDataset, but you can use the "native" ListAsyncDataset instead everywhere.)

src/levanter/data/dataset.py

rjpower · 2024-09-04T12:59:39Z

docs/design/Data-Loader-Design.md


+Early on in Levanter's development, we made the decision to support "quick start" training, where we can start
+training while we are still building the cache. This is helpful when iterating on the data pipeline
+and removes a step from the training process. This implies that we need to support simultaneous reading and writing


Likely doesn't affect the design, but, is it sensible to think of the dataset as a lazy construction and have the cache be a streaming "log" of the training data? Then I could sensibly write something like:

ds = data_from_jsonl()
ds.map(my_transform)
ds = ds.cache(ds, "cache/mydir")
ds = ds.seek(1234)

for i, batch in ds:

I actually started down this road a while back and I should probably finish it. I think I'll come back and add that layer later?

Of course, definitely can be postponed. I don't think it would make a big difference, just flatten out the "build_and_load" style logic a bit.

dlwh added 30 commits June 18, 2024 10:03

jagged arrays first cut

c627640

add extend for somewhat more efficient appending

fecee3c

tree_builder seems to basically work?

fb7ea34

more tests, require exemplar on construction

3279c82

cleanup, docs

7d9ff25

switch back to tensorstore

eeacc28

move the queue stuff to its own file

ed96259

clean up tests

764a28d

move metrics monitoring to its own file

b1e712e

move metrics_monitor to its own file

fe4dc02

factor out a RoundRobinBuffer

72899d3

factor out a RoundRobinBuffer

b760e6c

future_from_value

0153caa

missed a metrics_monitor thing

0ce196f

trim_to_size, async refactor

d99aaa0

convert old cache to new cache

5cd4d18

wip

6fdc97b

Merge remote-tracking branch 'origin/main' into jagged_cache

eb1e12c

fix None max capacity

60b6f59

bunch of cleanups to jaggedarray

db898c4

bunch of cleanups to jaggedarray

07afe1f

scripts to convert a cache and also stress test

eea660e

work on core representation

88ddc34

asyncdataset and associated ideas

563d52b

more test coverage

ec504da

get rid of samplers

d81bdb0

make an AsyncBackgroundIterable

662fbb3

actually make BackgroundIterable work with async

675d265

add new DataLoader that works with AsyncDataset

e532aa6

Merge remote-tracking branch 'origin/main' into jagged_cache

b9711b7

dlwh added 5 commits September 3, 2024 21:36

Merge remote-tracking branch 'origin/main' into jagged_cache

b451101

let's just get rid of the XLA thing in Ray

8029882

will this fix segfault?

cea6cd2

sigh

397d2c5

grrr

857530c

maybe this?

b498541

rjpower approved these changes Sep 4, 2024

View reviewed changes

dlwh and others added 20 commits September 4, 2024 10:33

ok see if we finally have the tests under control

f166ae9

fajklfnjkaf

3cf36fc

madlka

2fc2e1b

amdklmlad

992aa19

skip more

442f6a0

more?

22d6c83

Merge branch 'main' into jagged_cache

694d58f

less?

5a56efd

little more?

d86c9f1

and more?

a35098d

all non-failures enabled

79b996c

rework map to take args and kwargs

c76c6e9

remove the one "real" sync dataset and make it async

8bc2d82

remove the one "real" sync dataset and make it async

7b35836

more cleanup

9a9c2e4

rename loader, some cleanup

7e330bb

deal with async nonsense

d57f712

expose permutation datasets

f5a1764

tweak thresholds

d424e9b

fix test

b73e2ea

dlwh merged commit fbe27bc into main Sep 5, 2024
8 checks passed

dlwh deleted the jagged_cache branch September 5, 2024 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP Completely rework dataset/cache system: instant resume, perfect shuffle, stable mixtures and more #716

WIP Completely rework dataset/cache system: instant resume, perfect shuffle, stable mixtures and more #716

dlwh commented Sep 3, 2024 •

edited

Loading

dlwh commented Sep 4, 2024

rjpower left a comment

rjpower Sep 4, 2024

dlwh Sep 4, 2024

dlwh Sep 4, 2024

rjpower Sep 4, 2024

dlwh Sep 4, 2024

rjpower Sep 4, 2024

dlwh Sep 4, 2024

rjpower Sep 4, 2024

rjpower Sep 4, 2024

dlwh Sep 4, 2024

rjpower Sep 4, 2024

		raise NotImplementedError("...")


		class Dataset(DatasetBase[T_co]):

		return self._run_coroutine(self.dataset.async_getitem(index))


		class AsyncifiedDataset(AsyncDataset[T_co]):

WIP Completely rework dataset/cache system: instant resume, perfect shuffle, stable mixtures and more #716

WIP Completely rework dataset/cache system: instant resume, perfect shuffle, stable mixtures and more #716

Conversation

dlwh commented Sep 3, 2024 • edited Loading

dlwh commented Sep 4, 2024

rjpower left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlwh commented Sep 3, 2024 •

edited

Loading