Why so many faces? #2

YifanXu74 · 2024-01-09T12:06:09Z

Hi, nice work!

I noticed that the example samples shown on HuggingFace mostly consist of human faces. Does the actual distribution of the dataset like this? If so, does this result in significant bias?

linyq17 · 2024-01-10T01:39:30Z

Thanks for your interest in our work. Actually, our datasets are organized by K-means clustering labels. Therefore, the samples shown in HuggingFace are because the cluster is probably about the human concept. The dataset has also been balanced based on the Kmeans labels during sampling.

YifanXu74 · 2024-01-10T11:47:45Z

Got it. So does this mean that the parquet files of the released dataset are organized with K-means clustering labels without any shuffling? If so, this may cause some biased training problems with WebDataset, since WebDataset does not fully shuffle the data.

linyq17 · 2024-01-10T12:40:48Z

Yes, the released dataset is ordered by the clustering labels. If WebDataset is used for training, it needs to be shuffled before packing. Our released model is trained on the shuffled dataset. Thanks for your reminder. We will add more tips to README.

YifanXu74 · 2024-01-10T12:53:53Z

That's great. Currently when downloading the parquet dataset with img2dataset, the images in one tar file are probably under similar labels . It would be very useful if there could be a script provided to shuffle the downloaded data.

linyq17 · 2024-01-11T06:28:55Z

Sure, it can be shuffled using pandas. Here is a code snippet example for shuffling.

import os
import concurrent
import pandas as pd
import pyarrow.parquet as pq
def read_shuffle_multiple_parquet_files(parquet_dir):
    files = [os.path.join(parquet_dir, f) 
                for f in os.listdir(parquet_dir) 
                    if f.endswith('.parquet.snappy')]
    with concurrent.futures.ThreadPoolExecutor(max_workers=64) as executor:
        dfs = list(executor.map(pd.read_parquet, files))
    df = pd.concat(dfs, ignore_index=True)
    df = df.sample(frac=1).reset_index(drop=True)
    df.to_parquet('out.parquet.snappy', compression='snappy')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why so many faces? #2

Why so many faces? #2

YifanXu74 commented Jan 9, 2024

linyq17 commented Jan 10, 2024

YifanXu74 commented Jan 10, 2024 •

edited

Loading

linyq17 commented Jan 10, 2024

YifanXu74 commented Jan 10, 2024

linyq17 commented Jan 11, 2024

Why so many faces? #2

Why so many faces? #2

Comments

YifanXu74 commented Jan 9, 2024

linyq17 commented Jan 10, 2024

YifanXu74 commented Jan 10, 2024 • edited Loading

linyq17 commented Jan 10, 2024

YifanXu74 commented Jan 10, 2024

linyq17 commented Jan 11, 2024

YifanXu74 commented Jan 10, 2024 •

edited

Loading