-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make data loading sufficiently random #34
Comments
one random note is that according to https://medium.com/@duhroach/optimal-size-of-a-cloud-storage-fetch-8c270b511016 you should aim to read ~1MB chunks, though that's against the HTTP api so presumably the native API is more efficient? |
(also maybe i should just accept pre-declaring how big the sequences are and reprocessing whenever you change that so we can shuffle up front.) |
Updated design doc: |
…le, stable mixtures and more (#716) Introduces a massive rework of Levanter's cache system to support instant resume, perfect shuffle, stable mixtures and such. The basic idea is to use TensorStore to store all of our data as a kind of janky column store (implemented in JaggedArrayStore) and pytrees of such (implemented in TreeStore). TensorStore provides efficient storage and access to very large arrays. We still support streaming from an in progress cache via a new AsyncDataset class. I've successfully tests this on the pile and, modulo the usual issues with the llama tokenizer on long documents/books, it behaves well. Closes #626 #311 #119 #34
I wrote a design doc here outlining desiderata and the current status, along with a potential plan for moving forward. (Very open to other designs!)
The basic issue is that if docs are super long or not randomized when they come in, performance seems to suffer substantially.
Sub issues assuming we go with the plan above:
The text was updated successfully, but these errors were encountered: