Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Stream's repeat option to cycle through entire dataset before repeating, when shuffle=True #521

Open
m-harmonic opened this issue Dec 6, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@m-harmonic
Copy link

m-harmonic commented Dec 6, 2023

馃殌 Feature Request

I am using the repeat option when creating a stream, i.e. Stream(repeat=2) in addition to random shuffle, i.e. StreamingDataset(shuffle=True). It appears that there is no constraint about surfacing every sample once before repeating, that is, the ideal before for my use case is going through every sample once in a shuffled manner before starting to see a sample a second time. Is there already some way to achieve this behavior, and if not, would it be possible to add? Thanks!

Motivation

For various reasons I am constructing datasets that should have samples duplicated a certain number of times, but each sample should be seen once before any are seen a second time.

[Optional] Implementation

Additional context

@m-harmonic m-harmonic added the enhancement New feature or request label Dec 6, 2023
@m-harmonic m-harmonic changed the title Allow Stream's repeat option to cycle through entire dataset before repeating Allow Stream's repeat option to cycle through entire dataset before repeating, when shuffle=True Dec 6, 2023
@m-harmonic m-harmonic changed the title Allow Stream's repeat option to cycle through entire dataset before repeating, when shuffle=True Allow Stream's repeat option to cycle through entire dataset before repeating, when shuffle=True Dec 6, 2023
@karan6181
Copy link
Collaborator

@m-harmonic Does keeping repeat=1 and iterating over multiple epoch helps in anyway? Or are you using multiple streams and each stream have >= 1 repeat?

@m-harmonic
Copy link
Author

@karan6181 Yes exactly, we do have cases where we have multiple streams some of which have multiple repeats. Separately we are also experiencing a problem that is forcing us to train within a single epoch, so duplicating the data within one epoch is the workaround we're trying to use. Do you think there is a possible fix, or an easy solution?

@karan6181
Copy link
Collaborator

Hey, @m-harmonic, thanks for the clarification. Unfortunately, we don't support that use case at the moment. I wonder why you care each sample should be seen once before any are seen a second time? Have you tried our new shuffling algorithm py1e and py1br, which provides excellent shuffle quality? I doubt that you will see any convergence issues with sample ordering. I recommend using our new streaming simulator to find the correct set of hyperparameters for the best performance.

@karan6181
Copy link
Collaborator

karan6181 commented Jan 10, 2024

@m-harmonic Can you also explain why you would want the repeated samples to show up after going through the original dataset? Can you please share your use case and what exactly you are trying to do? Thanks!

@karan6181
Copy link
Collaborator

@m-harmonic Gentle reminder on the above question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants