Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why so many faces? #2

Open
YifanXu74 opened this issue Jan 9, 2024 · 5 comments
Open

Why so many faces? #2

YifanXu74 opened this issue Jan 9, 2024 · 5 comments

Comments

@YifanXu74
Copy link

Hi, nice work!

I noticed that the example samples shown on HuggingFace mostly consist of human faces. Does the actual distribution of the dataset like this? If so, does this result in significant bias?

@linyq17
Copy link
Collaborator

linyq17 commented Jan 10, 2024

Thanks for your interest in our work. Actually, our datasets are organized by K-means clustering labels. Therefore, the samples shown in HuggingFace are because the cluster is probably about the human concept. The dataset has also been balanced based on the Kmeans labels during sampling.

@YifanXu74
Copy link
Author

YifanXu74 commented Jan 10, 2024

Got it. So does this mean that the parquet files of the released dataset are organized with K-means clustering labels without any shuffling? If so, this may cause some biased training problems with WebDataset, since WebDataset does not fully shuffle the data.

@linyq17
Copy link
Collaborator

linyq17 commented Jan 10, 2024

Yes, the released dataset is ordered by the clustering labels. If WebDataset is used for training, it needs to be shuffled before packing. Our released model is trained on the shuffled dataset. Thanks for your reminder. We will add more tips to README.

@YifanXu74
Copy link
Author

That's great. Currently when downloading the parquet dataset with img2dataset, the images in one tar file are probably under similar labels . It would be very useful if there could be a script provided to shuffle the downloaded data.

@linyq17
Copy link
Collaborator

linyq17 commented Jan 11, 2024

Sure, it can be shuffled using pandas. Here is a code snippet example for shuffling.

import os
import concurrent
import pandas as pd
import pyarrow.parquet as pq
def read_shuffle_multiple_parquet_files(parquet_dir):
    files = [os.path.join(parquet_dir, f) 
                for f in os.listdir(parquet_dir) 
                    if f.endswith('.parquet.snappy')]
    with concurrent.futures.ThreadPoolExecutor(max_workers=64) as executor:
        dfs = list(executor.map(pd.read_parquet, files))
    df = pd.concat(dfs, ignore_index=True)
    df = df.sample(frac=1).reset_index(drop=True)
    df.to_parquet('out.parquet.snappy', compression='snappy')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants