This repository is the official PyTorch implementation for Seer introduced in the paper:
Seer: Language Instructed Video Prediction with Latent Diffusion Models. ICLR 2024
Xianfan Gu, Chuan Wen, Weirui Ye, Jiaming Song, and Yang Gao
pipeline_video.mp4
- Python 3.8
- PyTorch 1.12.0
- Other dependencies
Create an new conda environment.
conda create -n seer python=3.8 -y
conda activate seer
Install following packages and environments in the following (Note: only accelerate, transformers, xformers, diffusers require indicated versions):
pip install -r requirements.txt
The deault datasets include Something-Something v. 2 (Sthv2), Bridge Data, and Epic Kitchens to fine-tune and evaluate Seer model. See MMAction2 for detailed steps about video frames extraction.
The overall file structure in Sthv2:
data
├── annotations
│ ├── train.json
| ├── validation.json
| └── test.json
├── rawframes
│ ├── 1
| | ├── img_00001.jpg
| | ├── img_00002.jpg
| | ├── img_00003.jpg
| | └── ...
│ ├── 2
│ └── ...
├── videos
│ ├── 1.mp4
│ ├── 2.mp4
│ └── ...
The overall file structure in Bridge Data (name of subfolder follow path list in dataset/path_id_bridgedata.txt):
data
├── rawframes
│ ├── close_brown1fbox_flap
| | ├── 0
| | | ├── img_00001.jpg
| | | ├── img_00002.jpg
| | | ├── img_00003.jpg
| | | └── ...
| | ├── 1
| | ├── 2
| | └── ...
│ ├── close_small4fbox_flaps
│ └── ...
The overall file structure in Epic Kitchens:
data
├── P01
│ └── rgb_frames
| ├── P01_01
| | ├── frame_0000000001.jpg
| | ├── frame_0000000002.jpg
| | └── ...
| ├── P01_02
| └── ...
├── P02
└── ...
1.Download the initialization checkpoint of FSText model (download), then place it under store_pth/fstext_init
2.To fine-tune with 24GB NVIDIA 3090 GPUs by running:
accelerate launch train.py --config ./configs/train.yaml
The default version of Stable-Diffusion is runwayml/stable-diffusion-v1-5
.
The checkpoints fine-tuned on Dataset Something-Something v. 2 (Sthv2), Bridge Data, and Epic Kitchens can be downloaded as following:
Dataset | training steps | num ref.frames | num frames | Link |
---|---|---|---|---|
Sthv2 | 200k | 2 | 12 | [checkpoints] |
Bridge Data | 80k | 1 | 16 | [checkpoints] |
Epic Kitchens | 80k | 1 | 16 | [checkpoints] |
Sthv2+Bridge | 200k+80k | 1 | 16 | [checkpoints] |
After downloading a checkpoint file, place it under outputs/
folder and set output_dir
attributes in inference.yaml
or eval.yaml
.
The inferece stage of Seer. To sample batches of video clip and visualize results from dataset by running (Ensure that the checkpoint file is existed)
accelerate launch inference.py --config ./configs/inference.yaml
To sample a video clip from an indicated image:
python inference_img.py \
--config="./configs/inference_base.yaml" \
--image_path="{image_name}.jpg|{sec_img(optional)}.jpg" \
--input_text_prompts="{your input text}"
For Example
python inference_img.py\
--config="./configs/inference_base.yaml"\
--image_path="./src/figs/book.jpg"\ OR #one ref. frame
--image_path="./src/figs/book.jpg|./src/figs/book.jpg"\ #two ref. frame
--input_text_prompts="close book"
(Hints: We recommend using Sthv2+Bridge checkpoints for improved performance in zero-shot video prediction tasks.)
The evaluation of FVD/KVD is based on the implementation of VideoGPT. See the evaluation results by setting compute_fvd:True
in eval.yaml
and running:
accelerate launch eval.py --config ./configs/eval.yaml
If you find this repository useful, please consider giving a star ⭐ and citation:
@article{gu2023seer,
author = {Gu, Xianfan and Wen, Chuan and Ye, Weirui and Song, Jiaming and Gao, Yang},
title = {Seer: Language Instructed Video Prediction with Latent Diffusion Models},
journal = {arXiv preprint arXiv:2303.14897},
year = {2023},
}
This code builds on Diffusers and modified from Tune-A-Video.