-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP Completely rework dataset/cache system: instant resume, perfect shuffle, stable mixtures and more #716
Conversation
(I recognize this is a massive change. I'd mostly just appreciate a look at the design doc and maybe the audio bits) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great overall to me: the user code is definitely easier to follow in the new design as well. I didn't take a good look through the TensorStore stuff but will try to take a peek later this week.
I noted a few nits but otherwise seems sensible to me!
src/levanter/data/dataset.py
Outdated
raise NotImplementedError("...") | ||
|
||
|
||
class Dataset(DatasetBase[T_co]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random nit (haven't read through all the way yet): do we need both Sync & AsyncDatasets? It seems like the dominate use is AsyncDataset
, so could simplify things if we only need to think about that.
If sync is needed for convenience maybe a SyncDatasetWrapper which runs a thread to pull from an async dataset would work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i use it in one "real" place out of laziness. I think getting rid of it might make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaving it in as a convenience but moving it and making it clear it's dispreferred
src/levanter/data/dataset.py
Outdated
raise NotImplementedError | ||
|
||
@abc.abstractmethod | ||
async def length_is_known(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: async_has_len
? It's a bit odd to have the 2 different conventions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i like that
src/levanter/data/dataset.py
Outdated
""" | ||
|
||
@abc.abstractmethod | ||
def has_len(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is this synonymous with current_len() is None
? Maybe combine/remove.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point!
return self._run_coroutine(self.dataset.async_getitem(index)) | ||
|
||
|
||
class AsyncifiedDataset(AsyncDataset[T_co]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Yeah might be missing something, but this only seems used to wrap the SequenceDataset, but you can use the "native" ListAsyncDataset instead everywhere.)
|
||
Early on in Levanter's development, we made the decision to support "quick start" training, where we can start | ||
training while we are still building the cache. This is helpful when iterating on the data pipeline | ||
and removes a step from the training process. This implies that we need to support simultaneous reading and writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely doesn't affect the design, but, is it sensible to think of the dataset as a lazy construction and have the cache be a streaming "log" of the training data? Then I could sensibly write something like:
ds = data_from_jsonl()
ds.map(my_transform)
ds = ds.cache(ds, "cache/mydir")
ds = ds.seek(1234)
for i, batch in ds:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually started down this road a while back and I should probably finish it. I think I'll come back and add that layer later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, definitely can be postponed. I don't think it would make a big difference, just flatten out the "build_and_load" style logic a bit.
Introduces a massive rework of Levanter's cache system to support instant resume, perfect shuffle, stable mixtures and such.
The basic idea is to use TensorStore to store all of our data as a kind of janky column store (implemented in JaggedArrayStore) and pytrees of such (implemented in TreeStore).
TensorStore provides efficient storage and access to very large arrays. We still support streaming from an in progress cache via a new AsyncDataset class.
I've successfully tests this on the pile and, modulo the usual issues with the llama tokenizer on long documents/books, it behaves well.
Closes #626 #311 #119 #34