Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetuning for multiple classes #114

Closed
NamburiSrinath opened this issue Oct 25, 2022 · 6 comments
Closed

Finetuning for multiple classes #114

NamburiSrinath opened this issue Oct 25, 2022 · 6 comments

Comments

@NamburiSrinath
Copy link

Hi,

I tried playing with Stable diffusion (https://github.com/huggingface/diffusers) and tried to generate images, but didn't achieve good quality ones.

Example prompt: "Aeroplane"
Generated image
plane

Note: Some generated images are of nice quality (eg:
plane2
)

I came across your repo and found that I can use "Textual Inversion" and fine tune to get the concept transferred. But I have multiple classes (ship, aeroplane etc;) and would like to know how can I fine tune on multiple classes.

In short, while inferencing, you mentioned that we need to prompt it as "A photo of *", but I would like to do "A photo of an aeroplane", "A photo of a ship" etc; after fine tuning the model. (I checked this and am curious to know if this repo can work for my case - #8)

Thanks in advance
Srinath

@rinongal
Copy link
Owner

Hi,

If you're using this implementation, you can just individually train one model for each concept and then use the merge embedding script to combine them into a single model. There's instructions for that in the readme.

You can also changr your placeholder token from * to 'ship' or 'plane'. Look at either the config file, or main.py's run arguments for how to do this.

@NamburiSrinath
Copy link
Author

Thanks for your response @rinongal. I tried to invert for 2 classes "airplane" and "truck"

Inversion command (for airplane):

python main.py --base configs/latent-diffusion/txt2img-1p4B-finetune.yaml 
               -t
               --actual_resume models/ldm/text2img-large/model.ckpt 
              -- placeholder_string "airplane" 
               -n airplane_run_1 
               --gpus 0,
               --data_root train_data/airplane 
               --init_word "airplane" 

And the inference command is:

python scripts/txt2img.py --ddim_eta 0.0 
                          --n_samples 8 
                          --n_iter 2 
                          --scale 10.0 
                          --ddim_steps 50 
                          --embedding_path /hdd2/srinath/textual_inversion/logs/airplane2022-10-25T23-13- 
                            42_truck_run_1/checkpoints/embeddings_gs-6099.pt 
                          --ckpt_path models/ldm/text2img-large/model.ckpt 
                          --prompt "a photo of airplane"

The images in the train_data/airplane are:
1
2
3
4
5

And the images generated in outputs/samples are:
a-photo-of-*

which I believe is a fair generation of airplane capturing the style/features present in the above 5 images.

But I've few questions and need suggestions from your end

  1. I am having some difficulty in understanding the checkpoints structure. PFA the checkpoints for airplane

Screen Shot 2022-10-26 at 11 31 20 AM

  • Can you explain what's the number beside embedding_gs-xxxx.pt (is the number related to epoch?) This is to understand when the model got converged, how many epochs it ran for and how to make sure that I am using the best .pt file
  • How are these images present in images/train folder generated? Or what are they for precisely. I understand that they are generated to give it as input, but can you elaborate a bit/point to reference where I can get more details.
  1. How can I generalize this behaviour? i.e Suppose I want different colors of airplanes, how can I do that. By prompt engineering ("a photo of airplane in blue") and/or having different styles related images present in train_data/airplane (i.e make sure you have a blue airplane in train data)

  2. I also tried the merge script that you suggested (the command I ran is)

python merge_embeddings.py \
--manager_ckpts /hdd2/srinath/textual_inversion/logs/airplane2022-10-25T22-56-37_airplane_run_1/checkpoints/embeddings_gs-6099.pt \
/hdd2/srinath/textual_inversion/logs/truck2022-10-25T23-13-42_truck_run_1/checkpoints/embeddings_gs-6099.pt \
--output_path airplane_truck.pt

and it did generate the airplane_truck.pt file. Now my question is,

  1. Can I safely assume that airplane_truck.pt is better than individual pt embeddings in generating the images of airplane and truck when we pass a prompt?

Thanks a lot for your time :)
Srinath

@rinongal
Copy link
Owner

  1. The number in the _gs-xxxx.pt prefix is the step number (the number of training iterations). This is not the number of epochs, but you can estimate the number of epochs by dividing this number by your number of training images.

You can find an explanation of the logged images here: #19 and here: #34

  1. Generalizing: If you are using LDM, you can just change the text you use for generation, so: "a photo of airplane in blue" should be fine. If you see that it fails to change, try using an earlier training checkpoint since you may be overfit (typically ~5000 steps should be good enough). You do not need to have blue airplanes in your data.

If you are using Stable Diffusion, their text encoder is significantly weaker and more prone to overfitting, which may make some modifications harder. In that case, you may have to use more complex prompts or some prompt-weighting methods (for which I'd recommend the AUTOMATIC1111 WebUI

3(+4). The merged embedding file is just putting both of your new words into one single file. It won't create better trucks than just the truck file, and it won't create better airplanes than just the airplane file. It just stores them in a way that lets you access both at the same time.

@NamburiSrinath
Copy link
Author

Thank you so much. Have additional questions which are not in this thread, so closing this thread :)

@yuxu915
Copy link

yuxu915 commented Apr 16, 2023

@rinongal @NamburiSrinath Hi, I have a relate question: I use Stable Diffusion and I find it's very hard to modification. For example, below are images generated by using prompts "a photo of * " and "a photo of * in river". The embedding is at step 499 and should not be overfitting.
image
image

@yuxu915
Copy link

yuxu915 commented Apr 16, 2023

@NamburiSrinath hi, I'd like to ask the merge of plane and truck, the output images are mixture of airplanes and trucks, or a separate airplane and truck? Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants