Skip to content

Latest commit

 

History

History
96 lines (77 loc) · 5.16 KB

README.md

File metadata and controls

96 lines (77 loc) · 5.16 KB

Structured Diffusion Guidance (ICLR 2023)

We propose a method to fuse language structures into diffusion guidance for compositionality text-to-image generation.

This is the official codebase for Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis.

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
Weixi Feng 1, Xuehai He 2, Tsu-Jui Fu1, Varun Jampani3, Arjun Akula3, Pradyumna Narayana3, Sugato Basu3, Xin Eric Wang2, William Yang Wang 1
1UCSB, 2UCSC, 3Google

Update:

Apr. 4th: updated links, uploaded benchmarks and GLIP eval scripts, updated bibtex.

Setup

Clone this repository and then create a conda environment with:

conda env create -f environment.yaml
conda activate structure_diffusion

If you already have a stable diffusion environment, you can run the following commands:

pip install stanza nltk scenegraphparser tqdm matplotlib
pip install -e .

Inference

This repository supports stable diffusion 1.4 for now. Please refer to the official stable-diffusion repository to download the pre-trained model and put it under models/ldm/stable-diffusion-v1/. Our method is training-free and can be applied to the trained stable diffusion checkpoint directly.

To generate an image, run

python scripts/txt2img_demo.py --prompt "A red teddy bear in a christmas hat sitting next to a glass" --plms --parser_type constituency

By default, the guidance scale is set to 7.5 and output image size is 512x512. We only support PLMS sampling and batch size equals to 1 for now. Apart from the default arguments from Stable Diffusion, we add --parser_type and --conjunction.

usage: txt2img_demo.py [-h] [--prompt [PROMPT]] ...
                       [--parser_type {constituency,scene_graph}] [--conjunction] [--save_attn_maps]

optional arguments:
    ...
  --parser_type {constituency,scene_graph}
  --conjunction         If True, the input prompt is a conjunction of two concepts like "A and B"
  --save_attn_maps      If True, the attention maps will be saved as a .pth file with the name same as the image

Without specifying the conjunction argument, the model applies one key and multiple values for each cross-attention layer. For concept conjunction prompts, you can run: