Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make data loading sufficiently random #34

Closed
4 tasks
dlwh opened this issue Oct 11, 2022 · 3 comments
Closed
4 tasks

Make data loading sufficiently random #34

dlwh opened this issue Oct 11, 2022 · 3 comments

Comments

@dlwh
Copy link
Member

dlwh commented Oct 11, 2022

I wrote a design doc here outlining desiderata and the current status, along with a potential plan for moving forward. (Very open to other designs!)

The basic issue is that if docs are super long or not randomized when they come in, performance seems to suffer substantially.

Sub issues assuming we go with the plan above:

  • implement a shuffle buffer
  • implement seek in IndexedDataset
  • implement JumpingDataset
  • figure out serialization of datasets
@dlwh
Copy link
Member Author

dlwh commented Oct 11, 2022

one random note is that according to https://medium.com/@duhroach/optimal-size-of-a-cloud-storage-fetch-8c270b511016 you should aim to read ~1MB chunks, though that's against the HTTP api so presumably the native API is more efficient?

@dlwh
Copy link
Member Author

dlwh commented Oct 11, 2022

(also maybe i should just accept pre-declaring how big the sequences are and reprocessing whenever you change that so we can shuffle up front.)

@dlwh
Copy link
Member Author

dlwh commented Mar 28, 2023

@dlwh dlwh added the project label Mar 29, 2023
dlwh added a commit that referenced this issue Sep 5, 2024
…le, stable mixtures and more (#716)

Introduces a massive rework of Levanter's cache system to support instant resume, perfect shuffle, stable mixtures and such.

The basic idea is to use TensorStore to store all of our data as a kind of janky column store (implemented in JaggedArrayStore) and pytrees of such (implemented in TreeStore).

TensorStore provides efficient storage and access to very large arrays. We still support streaming from an in progress cache via a new AsyncDataset class.

I've successfully tests this on the pile and, modulo the usual issues with the llama tokenizer on long documents/books, it behaves well.

Closes #626 #311 #119 #34
@dlwh dlwh closed this as completed Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants