Seer: Language Instructed Video Prediction with Latent Diffusion Models

This repository is the official PyTorch implementation for Seer introduced in the paper:

Seer: Language Instructed Video Prediction with Latent Diffusion Models. ICLR 2024
Xianfan Gu, Chuan Wen, Weirui Ye, Jiaming Song, and Yang Gao

Approach

pipeline_video.mp4

Installation

Dependency Setup

Python 3.8
PyTorch 1.12.0
Other dependencies

Create an new conda environment.

conda create -n seer python=3.8 -y
conda activate seer

Install following packages and environments in the following (Note: only accelerate, transformers, xformers, diffusers require indicated versions):

pip install -r requirements.txt

Dataset Preparation

The deault datasets include Something-Something v. 2 (Sthv2), Bridge Data, and Epic Kitchens to fine-tune and evaluate Seer model. See MMAction2 for detailed steps about video frames extraction.

The overall file structure in Sthv2:

data
├── annotations
│   ├── train.json
|   ├── validation.json
|   └── test.json
├── rawframes
│   ├── 1
|   |   ├── img_00001.jpg
|   |   ├── img_00002.jpg
|   |   ├── img_00003.jpg
|   |   └── ...
│   ├── 2
│   └── ...
├── videos
│   ├── 1.mp4
│   ├── 2.mp4
│   └── ...

The overall file structure in Bridge Data (name of subfolder follow path list in dataset/path_id_bridgedata.txt):

data
├── rawframes
│   ├── close_brown1fbox_flap
|   |   ├── 0
|   |   |   ├── img_00001.jpg
|   |   |   ├── img_00002.jpg
|   |   |   ├── img_00003.jpg
|   |   |   └── ...
|   |   ├── 1
|   |   ├── 2
|   |   └── ...
│   ├── close_small4fbox_flaps
│   └── ...

The overall file structure in Epic Kitchens:

data
├── P01
│   └── rgb_frames
|       ├── P01_01
|       |   ├── frame_0000000001.jpg
|       |   ├── frame_0000000002.jpg
|       |   └── ...
|       ├── P01_02
|       └── ...
├── P02
└── ...

Fine-tune Seer From Inflated Stable Diffusion

1.Download the initialization checkpoint of FSText model (download), then place it under store_pth/fstext_init

2.To fine-tune with 24GB NVIDIA 3090 GPUs by running:

accelerate launch train.py --config ./configs/train.yaml

The default version of Stable-Diffusion is runwayml/stable-diffusion-v1-5.

Inference

Checkpoint of Seer

The checkpoints fine-tuned on Dataset Something-Something v. 2 (Sthv2), Bridge Data, and Epic Kitchens can be downloaded as following:

Dataset	training steps	num ref.frames	num frames	Link
Sthv2	200k	2	12	[checkpoints]
Bridge Data	80k	1	16	[checkpoints]
Epic Kitchens	80k	1	16	[checkpoints]
Sthv2+Bridge	200k+80k	1	16	[checkpoints]

After downloading a checkpoint file, place it under outputs/ folder and set output_dir attributes in inference.yaml or eval.yaml.

Inference of Dataset

The inferece stage of Seer. To sample batches of video clip and visualize results from dataset by running (Ensure that the checkpoint file is existed)

accelerate launch inference.py --config ./configs/inference.yaml

Sampling Video Clip from Image

To sample a video clip from an indicated image:

python inference_img.py \
 --config="./configs/inference_base.yaml" \
 --image_path="{image_name}.jpg|{sec_img(optional)}.jpg" \
 --input_text_prompts="{your input text}"

For Example

python inference_img.py\
--config="./configs/inference_base.yaml"\ 
--image_path="./src/figs/book.jpg"\ OR #one ref. frame
--image_path="./src/figs/book.jpg|./src/figs/book.jpg"\ #two ref. frame 
--input_text_prompts="close book"

(Hints: We recommend using Sthv2+Bridge checkpoints for improved performance in zero-shot video prediction tasks.)

Evaluation

FVD/KVD Metrics

The evaluation of FVD/KVD is based on the implementation of VideoGPT. See the evaluation results by setting compute_fvd:True in eval.yaml and running:

accelerate launch eval.py --config ./configs/eval.yaml

Citation

If you find this repository useful, please consider giving a star ⭐ and citation:

@article{gu2023seer,
    author  = {Gu, Xianfan and Wen, Chuan and Ye, Weirui and Song, Jiaming and Gao, Yang},
    title   = {Seer: Language Instructed Video Prediction with Latent Diffusion Models},
    journal = {arXiv preprint arXiv:2303.14897},
    year    = {2023},
}

Acknowledgement

This code builds on Diffusers and modified from Tune-A-Video.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seer: Language Instructed Video Prediction with Latent Diffusion Models

Approach

Installation

Dependency Setup

Dataset Preparation

Fine-tune Seer From Inflated Stable Diffusion

Inference

Checkpoint of Seer

Inference of Dataset

Sampling Video Clip from Image

Evaluation

FVD/KVD Metrics

Citation

Acknowledgement

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
configs		configs
dataset		dataset
ldm		ldm
seer/models		seer/models
src		src
utils		utils
README.md		README.md
eval.py		eval.py
inference.py		inference.py
inference_img.py		inference_img.py
requirements.txt		requirements.txt
train.py		train.py

seervideodiffusion/SeerVideoLDM

Folders and files

Latest commit

History

Repository files navigation

Seer: Language Instructed Video Prediction with Latent Diffusion Models

Approach

Installation

Dependency Setup

Dataset Preparation

Fine-tune Seer From Inflated Stable Diffusion

Inference

Checkpoint of Seer

Inference of Dataset

Sampling Video Clip from Image

Evaluation

FVD/KVD Metrics

Citation

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages