-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why so many faces? #2
Comments
Thanks for your interest in our work. Actually, our datasets are organized by K-means clustering labels. Therefore, the samples shown in HuggingFace are because the cluster is probably about the human concept. The dataset has also been balanced based on the Kmeans labels during sampling. |
Got it. So does this mean that the parquet files of the released dataset are organized with K-means clustering labels without any shuffling? If so, this may cause some biased training problems with WebDataset, since WebDataset does not fully shuffle the data. |
Yes, the released dataset is ordered by the clustering labels. If WebDataset is used for training, it needs to be shuffled before packing. Our released model is trained on the shuffled dataset. Thanks for your reminder. We will add more tips to README. |
That's great. Currently when downloading the parquet dataset with img2dataset, the images in one tar file are probably under similar labels . It would be very useful if there could be a script provided to shuffle the downloaded data. |
Sure, it can be shuffled using pandas. Here is a code snippet example for shuffling.
|
Hi, nice work!
I noticed that the example samples shown on HuggingFace mostly consist of human faces. Does the actual distribution of the dataset like this? If so, does this result in significant bias?
The text was updated successfully, but these errors were encountered: