GroupViT: Semantic Segmentation Emerges from Text Supervision

GroupViT is a framework for learning semantic segmentation purely from text captions without using any mask supervision. It learns to perform bottom-up heirarchical spatial grouping of semantically-related visual regions. This repository is the official implementation of GroupViT introduced in the paper:

GroupViT: Semantic Segmentation Emerges from Text Supervision, Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang, CVPR 2022.

Visual Results

Links

Jiarui Xu's Project Page (with additonal visual results)
arXiv Page

Citation

If you find our work useful in your research, please cite:

@article{xu2022groupvit,
  author    = {Xu, Jiarui and De Mello, Shalini and Liu, Sifei and Byeon, Wonmin and Breuel, Thomas and Kautz, Jan and Wang, Xiaolong},
  title     = {GroupViT: Semantic Segmentation Emerges from Text Supervision},
  journal   = {arXiv preprint arXiv:2202.11094},
  year      = {2022},
}

Environmental Setup

Python 3.7
PyTorch 1.8
webdataset 0.1.103
mmsegmentation 0.18.0
timm 0.4.12

Instructions:

conda create -n groupvit python=3.7 -y
conda activate groupvit
conda install pytorch==1.8.0 torchvision==0.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge
pip install mmcv-full==1.3.14 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.8.0/index.html
pip install mmsegmentation==0.18.0
pip install webdataset==0.1.103
pip install timm==0.4.12
git clone https://github.com/NVIDIA/apex
cd && apex && pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
pip install opencv-python==4.4.0.46 termcolor==1.1.0 diffdist einops omegaconf
pip install nltk ftfy regex tqdm

Demo

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the web demo:
Run the demo on Google Colab:
To run the demo from the command line:

python demo/demo_seg.py --cfg configs/group_vit_gcc_yfcc_30e.yml --resume /path/to/checkpoint --vis input_pred_label final_group --input demo/examples/voc.jpg --output_dir demo/output

The output is saved in demo/output/.

Benchmark Results

	Zero-shot Classification	Zero-shot Segmentation
config	ImageNet	Pascal VOC	Pascal Context	COCO
GCC + YFCC (cfg)	43.7	52.3	22.4	24.3
GCC + RedCaps (cfg)	51.6	50.8	23.7	27.5

Pre-trained weights group_vit_gcc_yfcc_30e-879422e0.pth and group_vit_gcc_redcap_30e-3dd09a76.pth for these models are provided by Jiarui Xu here.

Data Preparation

During training, we use webdataset for scalable data loading. To convert image text pairs into the webdataset format, we use the img2dataset tool to download and preprocess the dataset.

For inference, we use mmsegmentation for semantic segmentation testing, evaluation and visualization on Pascal VOC, Pascal Context and COCO datasets.

The overall file structure is as follows:

GroupViT
├── local_data
│   ├── gcc3m_shards
│   │   ├── gcc-train-000000.tar
│   │   ├── ...
│   │   ├── gcc-train-000436.tar
│   ├── gcc12m_shards
│   │   ├── gcc-conceptual-12m-000000.tar
│   │   ├── ...
│   │   ├── gcc-conceptual-12m-001943.tar
│   ├── yfcc14m_shards
│   │   ├── yfcc14m-000000.tar
│   │   ├── ...
│   │   ├── yfcc14m-001888.tar
│   ├── redcap12m_shards
│   │   ├── redcap12m-000000.tar
│   │   ├── ...
│   │   ├── redcap12m-001211.tar
│   ├── imagenet_shards
│   │   ├── imagenet-val-000000.tar
│   │   ├── ...
│   │   ├── imagenet-val-000049.tar
│   ├── VOCdevkit
│   │   ├── VOC2012
│   │   │   ├── JPEGImages
│   │   │   ├── SegmentationClass
│   │   │   ├── ImageSets
│   │   │   │   ├── Segmentation
│   │   ├── VOC2010
│   │   │   ├── JPEGImages
│   │   │   ├── SegmentationClassContext
│   │   │   ├── ImageSets
│   │   │   │   ├── SegmentationContext
│   │   │   │   │   ├── train.txt
│   │   │   │   │   ├── val.txt
│   │   │   ├── trainval_merged.json
│   │   ├── VOCaug
│   │   │   ├── dataset
│   │   │   │   ├── cls
│   ├── coco
│   │   ├── images
│   │   │   ├── train2017
│   │   │   ├── val2017
│   │   ├── annotations
│   │   │   ├── train2017
│   │   │   ├── val2017

The instructions for preparing each dataset are as follows.

GCC3M

Please download the training split annotation file from Conceptual Caption 12M and name it as gcc3m.tsv.

Then run img2dataset to download the image text pairs and save them in the webdataset format.

sed -i '1s/^/caption\turl\n/' gcc3m.tsv
img2dataset --url_list gcc3m.tsv --input_format "tsv" \
            --url_col "url" --caption_col "caption" --output_format webdataset\
            --output_folder local_data/gcc3m_shards
            --processes_count 16 --thread_count 64
            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
            --enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/gcc-train-/' local_data/gcc3m_shards/*

Please refer to img2dataset CC3M tutorial for more details.

GCC12M

Please download the annotation file from Conceptual Caption 12M and name it as gcc12m.tsv.

Then run img2dataset to download the image text pairs and save them in the webdataset format.

sed -i '1s/^/caption\turl\n/' gcc12m.tsv
img2dataset --url_list gcc12m.tsv --input_format "tsv" \
            --url_col "url" --caption_col "caption" --output_format webdataset\
            --output_folder local_data/gcc12m_shards \
            --processes_count 16 --thread_count 64
            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
            --enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/gcc-conceptual-12m-/' local_data/gcc12m_shards/*

Please refer to img2dataset CC12M tutorial for more details.

YFCC14M

Please follow the CLIP Data Preparation instructions to download the YFCC14M subset.

wget https://openaipublic.azureedge.net/clip/data/yfcc100m_subset_data.tsv.bz2
bunzip2 yfcc100m_subset_data.tsv.bz2

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
configs		configs
convert_dataset		convert_dataset
datasets		datasets
demo		demo
figs		figs
models		models
segmentation		segmentation
tools		tools
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main_group_vit.py		main_group_vit.py
main_seg.py		main_seg.py
setup.cfg		setup.cfg

License

NVlabs/GroupViT

Folders and files

Latest commit

History

Repository files navigation