Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP Completely rework dataset/cache system: instant resume, perfect shuffle, stable mixtures and more #716

Merged
merged 125 commits into from
Sep 5, 2024

Conversation

dlwh
Copy link
Member

@dlwh dlwh commented Sep 3, 2024

Introduces a massive rework of Levanter's cache system to support instant resume, perfect shuffle, stable mixtures and such.

The basic idea is to use TensorStore to store all of our data as a kind of janky column store (implemented in JaggedArrayStore) and pytrees of such (implemented in TreeStore).

TensorStore provides efficient storage and access to very large arrays. We still support streaming from an in progress cache via a new AsyncDataset class.

I've successfully tests this on the pile and, modulo the usual issues with the llama tokenizer on long documents/books, it behaves well.

Closes #626 #311 #119 #34

@dlwh
Copy link
Member Author

dlwh commented Sep 4, 2024

(I recognize this is a massive change. I'd mostly just appreciate a look at the design doc and maybe the audio bits)

Copy link
Collaborator

@rjpower rjpower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall to me: the user code is definitely easier to follow in the new design as well. I didn't take a good look through the TensorStore stuff but will try to take a peek later this week.

I noted a few nits but otherwise seems sensible to me!

raise NotImplementedError("...")


class Dataset(DatasetBase[T_co]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random nit (haven't read through all the way yet): do we need both Sync & AsyncDatasets? It seems like the dominate use is AsyncDataset, so could simplify things if we only need to think about that.

If sync is needed for convenience maybe a SyncDatasetWrapper which runs a thread to pull from an async dataset would work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i use it in one "real" place out of laziness. I think getting rid of it might make sense

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving it in as a convenience but moving it and making it clear it's dispreferred

raise NotImplementedError

@abc.abstractmethod
async def length_is_known(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: async_has_len? It's a bit odd to have the 2 different conventions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i like that

"""

@abc.abstractmethod
def has_len(self) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is this synonymous with current_len() is None? Maybe combine/remove.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point!

return self._run_coroutine(self.dataset.async_getitem(index))


class AsyncifiedDataset(AsyncDataset[T_co]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Yeah might be missing something, but this only seems used to wrap the SequenceDataset, but you can use the "native" ListAsyncDataset instead everywhere.)

src/levanter/data/dataset.py Outdated Show resolved Hide resolved
src/levanter/data/dataset.py Outdated Show resolved Hide resolved
src/levanter/data/dataset.py Outdated Show resolved Hide resolved

Early on in Levanter's development, we made the decision to support "quick start" training, where we can start
training while we are still building the cache. This is helpful when iterating on the data pipeline
and removes a step from the training process. This implies that we need to support simultaneous reading and writing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely doesn't affect the design, but, is it sensible to think of the dataset as a lazy construction and have the cache be a streaming "log" of the training data? Then I could sensibly write something like:

ds = data_from_jsonl()
ds.map(my_transform)
ds = ds.cache(ds, "cache/mydir")
ds = ds.seek(1234)

for i, batch in ds:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually started down this road a while back and I should probably finish it. I think I'll come back and add that layer later?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, definitely can be postponed. I don't think it would make a big difference, just flatten out the "build_and_load" style logic a bit.

@dlwh dlwh merged commit fbe27bc into main Sep 5, 2024
8 checks passed
@dlwh dlwh deleted the jagged_cache branch September 5, 2024 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Instant Data loader resumes
2 participants