We propose a method to fuse language structures into diffusion guidance for compositionality text-to-image generation.
This is the official codebase for Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis.
Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis
Weixi Feng 1,
Xuehai He 2,
Tsu-Jui Fu1,
Varun Jampani3,
Arjun Akula3,
Pradyumna Narayana3,
Sugato Basu3,
Xin Eric Wang2,
William Yang Wang 1
1UCSB, 2UCSC, 3Google
Apr. 4th: updated links, uploaded benchmarks and GLIP eval scripts, updated bibtex.
Clone this repository and then create a conda environment with:
conda env create -f environment.yaml
conda activate structure_diffusion
If you already have a stable diffusion environment, you can run the following commands:
pip install stanza nltk scenegraphparser tqdm matplotlib
pip install -e .
This repository supports stable diffusion 1.4 for now. Please refer to the official stable-diffusion repository to download the pre-trained model and put it under models/ldm/stable-diffusion-v1/
.
Our method is training-free and can be applied to the trained stable diffusion checkpoint directly.
To generate an image, run
python scripts/txt2img_demo.py --prompt "A red teddy bear in a christmas hat sitting next to a glass" --plms --parser_type constituency
By default, the guidance scale is set to 7.5 and output image size is 512x512. We only support PLMS sampling and batch size equals to 1 for now.
Apart from the default arguments from Stable Diffusion, we add --parser_type
and --conjunction
.
usage: txt2img_demo.py [-h] [--prompt [PROMPT]] ...
[--parser_type {constituency,scene_graph}] [--conjunction] [--save_attn_maps]
optional arguments:
...
--parser_type {constituency,scene_graph}
--conjunction If True, the input prompt is a conjunction of two concepts like "A and B"
--save_attn_maps If True, the attention maps will be saved as a .pth file with the name same as the image
Without specifying the conjunction
argument, the model applies one key
and multiple values
for each cross-attention layer.
For concept conjunction prompts, you can run: