How to train the first stage? #3

Weifeng-Chen · 2022-09-19T11:51:31Z

Hi, do you train your text encoder in CLIP way without latent diffusion. Or maybe train it with the diffusion model, how's the loss function and other detail? Would you like to share more detail about the training?

ScottishFold007 · 2022-09-19T13:16:02Z

Hi, do you train your text encoder in CLIP way without latent diffusion. Or maybe train it with the diffusion model, how's the loss function and other detail? Would you like to share more detail about the training?

This I make it,with direct training clip model, but freeze the weight of the vit model, increase the batch can be, 64 characters can be opened on the colab p100 96batch. So easy@!

Weifeng-Chen · 2022-09-19T13:16:20Z

【自动恢复】来信已收到，我将尽快回复你！

Weifeng-Chen · 2022-09-19T13:34:01Z

Hi, do you train your text encoder in CLIP way without latent diffusion. Or maybe train it with the diffusion model, how's the loss function and other detail? Would you like to share more detail about the training?

This I make it,with direct training clip model, but freeze the weight of the vit model, increase the batch can be, 64 characters can be opened on the colab p100 96batch. So easy@!

oh , I have train my Chinese CLIP as well. So the thing I need to do is to jointly train with latent diffusion model?

ScottishFold007 · 2022-09-19T14:11:34Z

直接说中文吧。我训练完，输入中文能出效果还可以的图，就是中国化差点，你第一步做完后可以出效果吗？还是说，输入中文，出来的结果很差，没理解语义

…

---Original--- From: ***@***.***> Date: Mon, Sep 19, 2022 21:34 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [rinnakk/japanese-stable-diffusion] How to train the first stage?(Issue #3) Hi, do you train your text encoder in CLIP way without latent diffusion. Or maybe train it with the diffusion model, how's the loss function and other detail? Would you like to share more detail about the training? This I make it,with direct training clip model, but freeze the weight of the vit model, increase the batch can be, 64 characters can be opened on the colab p100 96batch. So easy@! oh , I have train my Chinese CLIP as well. So the thing I need to do is to jointly train with latent diffusion model? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

Weifeng-Chen · 2022-09-19T14:15:44Z

直接说中文吧。我训练完，输入中文能出效果还可以的图，就是中国化差点，你第一步做完后可以出效果吗？还是说，输入中文，出来的结果很差，没理解语义
…
---Original--- From: @.> Date: Mon, Sep 19, 2022 21:34 PM To: @.>; Cc: @.@.>; Subject: Re: [rinnakk/japanese-stable-diffusion] How to train the first stage?(Issue #3) Hi, do you train your text encoder in CLIP way without latent diffusion. Or maybe train it with the diffusion model, how's the loss function and other detail? Would you like to share more detail about the training? This I make it,with direct training clip model, but freeze the weight of the vit model, increase the batch can be, 64 characters can be opened on the colab p100 96batch. So easy@! oh , I have train my Chinese CLIP as well. So the thing I need to do is to jointly train with latent diffusion model? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

我可能得重训一下。。。他这个cross-attention机制，我那个模型维度可能对不上。你是直接没做进一步的finetune，就可以拿训好的CLIP去做中文生成了？

ScottishFold007 · 2022-09-19T14:21:31Z

我是直接在vl14上做clip的微调的，也就是这个日本博主的第一步造成，数据虽不多，但直接输入中文出图是OK的，可以理解语义，第二步我其实有了方案，只缺算力

…

---Original--- From: ***@***.***> Date: Mon, Sep 19, 2022 22:15 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [rinnakk/japanese-stable-diffusion] How to train the first stage?(Issue #3) 直接说中文吧。我训练完，输入中文能出效果还可以的图，就是中国化差点，你第一步做完后可以出效果吗？还是说，输入中文，出来的结果很差，没理解语义 … ---Original--- From: @.> Date: Mon, Sep 19, 2022 21:34 PM To: @.>; Cc: @.@.>; Subject: Re: [rinnakk/japanese-stable-diffusion] How to train the first stage?(Issue #3) Hi, do you train your text encoder in CLIP way without latent diffusion. Or maybe train it with the diffusion model, how's the loss function and other detail? Would you like to share more detail about the training? This I make it,with direct training clip model, but freeze the weight of the vit model, increase the batch can be, 64 characters can be opened on the colab p100 96batch. So easy@! oh , I have train my Chinese CLIP as well. So the thing I need to do is to jointly train with latent diffusion model? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***> 我可能得重训一下。。。他这个cross-attention机制，我那个模型维度可能对不上。你是直接没做进一步的finetune，就可以拿训好的CLIP去做中文生成了？ — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

ScottishFold007 · 2022-09-19T14:28:00Z

我的微信g18818233178

…

---Original--- From: ***@***.***> Date: Mon, Sep 19, 2022 22:26 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [rinnakk/japanese-stable-diffusion] How to train the first stage?(Issue #3) 我是直接在vl14上做clip的微调的，也就是这个日本博主的第一步造成，数据虽不多，但直接输入中文出图是OK的，可以理解语义，第二步我其实有了方案，只缺算力 … ---Original--- From: @.> Date: Mon, Sep 19, 2022 22:15 PM To: @.>; Cc: @.@.>; Subject: Re: [rinnakk/japanese-stable-diffusion] How to train the first stage?(Issue #3) 直接说中文吧。我训练完，输入中文能出效果还可以的图，就是中国化差点，你第一步做完后可以出效果吗？还是说，输入中文，出来的结果很差，没理解语义 … ---Original--- From: @.> Date: Mon, Sep 19, 2022 21:34 PM To: @.>; Cc: @.@.>; Subject: Re: [rinnakk/japanese-stable-diffusion] How to train the first stage?(Issue #3) Hi, do you train your text encoder in CLIP way without latent diffusion. Or maybe train it with the diffusion model, how's the loss function and other detail? Would you like to share more detail about the training? This I make it,with direct training clip model, but freeze the weight of the vit model, increase the batch can be, 64 characters can be opened on the colab p100 96batch. So easy@! oh , I have train my Chinese CLIP as well. So the thing I need to do is to jointly train with latent diffusion model? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.> 我可能得重训一下。。。他这个cross-attention机制，我那个模型维度可能对不上。你是直接没做进一步的finetune，就可以拿训好的CLIP去做中文生成了？ — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.> 那可以留个联系方式多多交流啊~ 我打算重训CLIP对齐一下维度，不过现在在训huge版，得等一段时间了。后面第二步也打算做一些finetune。。。目前我这边的数据集大概就是wukong100M和zero23M两个。 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: ***@***.***>

Weifeng-Chen · 2022-09-23T01:52:04Z

模型维度得是768的hidden，其次最好是CLIP的text encoder（单向的causal attention），而非BERT encoder，直接对齐VIT-L可以很好的迁移

我一开始训的是CLIP的text encoder，但是训出来的在下游zero-shot表现一般般。（主要是词表的问题），但是生成类的似乎不太好评价下游效果。我后面会试试对齐一下维度做做看。谢谢大佬！！

ScottishFold007 · 2022-09-23T01:54:35Z

模型维度得是768的hidden，其次最好是CLIP的text encoder（单向的causal attention），而非BERT encoder，直接对齐VIT-L可以很好的迁移

嗯，这次我就是这么干的。但单从clip中文适配效果而言，中文bert和clip中的vit拼接，再做fine tune效果更佳

ScottishFold007 · 2022-09-23T01:56:12Z

模型维度得是768的hidden，其次最好是CLIP的text encoder（单向的causal attention），而非BERT encoder，直接对齐VIT-L可以很好的迁移

话又说回来，要向sd的“中国化”效果较好，第二步还是得走的，text encoder/unet/vae联合训练，其中的个别可以冻住

Weifeng-Chen · 2022-09-23T02:04:02Z

模型维度得是768的hidden，其次最好是CLIP的text encoder（单向的causal attention），而非BERT encoder，直接对齐VIT-L可以很好的迁移

我一开始训的是CLIP的text encoder，但是训出来的在下游zero-shot表现一般般。（主要是词表的问题），但是生成类的似乎不太好评价下游效果。我后面会试试对齐一下维度做做看。谢谢大佬！！

我是直接魔改了vocab和tokenizer，把vocab改成了BERT的，然后做了几个trick：

[CLS] XXXXXX [SEP] [PAD] [PAD] 改成了[CLS ] XXXXXX [PAD] [PAD] [PAD]

把原生CLIP权重[BOS]复制到[CLS] ，把原生CLIP权重[EOS]复制给[PAD]对齐

这样我可以直接从原生CLIP的权重微调并且可以很快速的迁移到中文的BERT的Vocab

emmm 你是在多大规模上finetune的？我是直接用上亿数据去做训练的，所以用加载中文的robert预训练模型，效果会好很多。我记得之前用原生的不改词表大概0.2几，用中文Robert能到0.4几（在imagenet1k翻译过来的中文版）。也试过用clip的权重，只换掉词表，效果也不行，你的trick可能还是蛮重要的（但是这个实验成本挺高的hhh，可能得换小点数据看看）

ScottishFold007 · 2022-09-23T02:10:19Z

你这个是vit和text的权重都训练吗？我是放开text模型的权重进行训练，vit的冻住，不然16g的显存玩不动，但效果还可以，千万级的训练数据。

Weifeng-Chen · 2022-09-23T02:10:35Z

我用的LAION-5B的中文subset做的，大约1亿数据，你说的分数是指相似度匹配的分数吗，如果是相似度匹配，CLIP原生的匹配分都很低，但是指导stable diffusion的效果不错

laion中文subset居然有这么大。我用的是诺亚开源的wukong和360开源的zero。后面用来指导disco diffusion的生成，可以生成很多中文互联网域内的图片了。但是stable 这个由于我之前用的维度转了一下，所以不能直接套。。

JunnYu · 2022-09-26T02:14:12Z

请问可以使用现有的中文CLIP text encoder权重吗？如：https://github.com/PaddlePaddle/ERNIE/tree/ernie-kit-open-v1.0/Research/ERNIE-ViL2 （这个是双向语言模型，不是单向的）这个模型的hidden states是768。
固定住这个ERNIE-ViL2 text encoder和vae后，只训练微调unet是不是就可以了呢？

ScottishFold007 · 2022-09-28T10:02:19Z

请问可以使用现有的中文CLIP text encoder权重吗？如：https://github.com/PaddlePaddle/ERNIE/tree/ernie-kit-open-v1.0/Research/ERNIE-ViL2 （这个是双向语言模型，不是单向的）这个模型的hidden states是768。固定住这个ERNIE-ViL2 text encoder和vae后，只训练微调unet是不是就可以了呢？

应该是不行的，现有的sd是基于openai的clip训练的；百度这个不行，第二阶段的unet VAE和clip都要有联系才行

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train the first stage? #3

How to train the first stage? #3

Weifeng-Chen commented Sep 19, 2022

ScottishFold007 commented Sep 19, 2022

Weifeng-Chen commented Sep 19, 2022 via email

Weifeng-Chen commented Sep 19, 2022

ScottishFold007 commented Sep 19, 2022 via email

Weifeng-Chen commented Sep 19, 2022

ScottishFold007 commented Sep 19, 2022 via email

ScottishFold007 commented Sep 19, 2022 via email

Weifeng-Chen commented Sep 23, 2022

ScottishFold007 commented Sep 23, 2022

ScottishFold007 commented Sep 23, 2022

Weifeng-Chen commented Sep 23, 2022

ScottishFold007 commented Sep 23, 2022

Weifeng-Chen commented Sep 23, 2022

JunnYu commented Sep 26, 2022

ScottishFold007 commented Sep 28, 2022

How to train the first stage? #3

How to train the first stage? #3

Comments

Weifeng-Chen commented Sep 19, 2022

ScottishFold007 commented Sep 19, 2022

Weifeng-Chen commented Sep 19, 2022 via email

Weifeng-Chen commented Sep 19, 2022

ScottishFold007 commented Sep 19, 2022 via email

Weifeng-Chen commented Sep 19, 2022

ScottishFold007 commented Sep 19, 2022 via email

ScottishFold007 commented Sep 19, 2022 via email

Weifeng-Chen commented Sep 23, 2022

ScottishFold007 commented Sep 23, 2022

ScottishFold007 commented Sep 23, 2022

Weifeng-Chen commented Sep 23, 2022

ScottishFold007 commented Sep 23, 2022

Weifeng-Chen commented Sep 23, 2022

JunnYu commented Sep 26, 2022

ScottishFold007 commented Sep 28, 2022