Make data loading sufficiently random #34

dlwh · 2022-10-11T20:23:46Z

I wrote a design doc here outlining desiderata and the current status, along with a potential plan for moving forward. (Very open to other designs!)

The basic issue is that if docs are super long or not randomized when they come in, performance seems to suffer substantially.

Sub issues assuming we go with the plan above:

implement a shuffle buffer
implement seek in IndexedDataset
implement JumpingDataset
figure out serialization of datasets

dlwh · 2022-10-11T20:29:07Z

one random note is that according to https://medium.com/@duhroach/optimal-size-of-a-cloud-storage-fetch-8c270b511016 you should aim to read ~1MB chunks, though that's against the HTTP api so presumably the native API is more efficient?

dlwh · 2022-10-11T20:45:12Z

(also maybe i should just accept pre-declaring how big the sequences are and reprocessing whenever you change that so we can shuffle up front.)

dlwh · 2023-03-28T20:16:47Z

Updated design doc:
https://gist.github.com/dlwh/88ae0ef4efdc03fff2d3ade8d2f11ee6

…le, stable mixtures and more (#716) Introduces a massive rework of Levanter's cache system to support instant resume, perfect shuffle, stable mixtures and such. The basic idea is to use TensorStore to store all of our data as a kind of janky column store (implemented in JaggedArrayStore) and pytrees of such (implemented in TreeStore). TensorStore provides efficient storage and access to very large arrays. We still support streaming from an in progress cache via a new AsyncDataset class. I've successfully tests this on the pile and, modulo the usual issues with the llama tokenizer on long documents/books, it behaves well. Closes #626 #311 #119 #34

dlwh assigned yifanmai Oct 11, 2022

dlwh added the dataloading label Mar 28, 2023

dlwh unassigned yifanmai Mar 29, 2023

dlwh mentioned this issue Mar 29, 2023

On-The-Fly Caching (Tracking Issue) #99

Closed

7 tasks

dlwh added the project label Mar 29, 2023

dlwh mentioned this issue Jun 12, 2024

Instant Data loader resumes #626

Closed

dlwh mentioned this issue Sep 4, 2024

WIP Completely rework dataset/cache system: instant resume, perfect shuffle, stable mixtures and more #716

Merged

dlwh closed this as completed Sep 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make data loading sufficiently random #34

Make data loading sufficiently random #34

dlwh commented Oct 11, 2022 •

edited

Loading

dlwh commented Oct 11, 2022

dlwh commented Oct 11, 2022

dlwh commented Mar 28, 2023

Make data loading sufficiently random #34

Make data loading sufficiently random #34

Comments

dlwh commented Oct 11, 2022 • edited Loading

dlwh commented Oct 11, 2022

dlwh commented Oct 11, 2022

dlwh commented Mar 28, 2023

dlwh commented Oct 11, 2022 •

edited

Loading