Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

32kHz Vocos Multi Speaker Model Training Log #48

Open
LEECHOONGHO opened this issue Feb 22, 2024 · 13 comments
Open

32kHz Vocos Multi Speaker Model Training Log #48

LEECHOONGHO opened this issue Feb 22, 2024 · 13 comments

Comments

@LEECHOONGHO
Copy link

LEECHOONGHO commented Feb 22, 2024

Training Loss, Generated Outputs.

I hope this will be a reference for model training.

https://api.wandb.ai/links/xi-speech-team/k0kdfwch

@patriotyk
Copy link

Do you have a standard tensorboard logs? It is interesting to compare.

@LEECHOONGHO
Copy link
Author

@patriotyk Sorry, I've change the code to log on WandB server. I have no local logging files nor tensorboard logs.

@patriotyk
Copy link

patriotyk commented Apr 16, 2024

What is your validation loss on the last checkpoint? It is encoded in to the checkpoint file name. I am training 44100 for an almost a week already and loss still goes down.

@Jon-Zbw
Copy link

Jon-Zbw commented Apr 22, 2024

Training Loss, Generated Outputs.

I hope this will be a reference for model training.

https://api.wandb.ai/links/xi-speech-team/k0kdfwch

TKS for your work,could your share 32k model training detail like:
your encodec model(i found pretrained models :24k and 48k,so i guess 32k resample to 24k or 48k for encodec pretrained model,then resample to 32k ??)

@LEECHOONGHO
Copy link
Author

Training Loss, Generated Outputs.
I hope this will be a reference for model training.
https://api.wandb.ai/links/xi-speech-team/k0kdfwch

TKS for your work,could your share 32k model training detail like: your encodec model(i found pretrained models :24k and 48k,so i guess 32k resample to 24k or 48k for encodec pretrained model,then resample to 32k ??)

I'm sry for your confuse.
I just trained Mel Vocoder not for encodec's decoder.

But I have plans to train Mel-Encodec?(Mel Spectrogram to RVQ Encoder, and Vocos Decoder for Various Speech data) in the future.

@LEECHOONGHO
Copy link
Author

LEECHOONGHO commented Apr 23, 2024

Do you have a standard tensorboard logs? It is interesting to compare.

What is your validation loss on the last checkpoint? It is encoded in to the checkpoint file name. I am training 44100 for an almost a week already and loss still goes down.

I estimated mel loss, and Generator loss with newly gained dataset. and each was 0.0942 and 2.82.
Because of the dataset's Size, estimating Eval loss with eval dataset have no difference with sampled train data.

how about your model output's quality? any artifacts?

@patriotyk
Copy link

Do you have a standard tensorboard logs? It is interesting to compare.

What is your validation loss on the last checkpoint? It is encoded in to the checkpoint file name. I am training 44100 for an almost a week already and loss still goes down.

I estimated mel loss, and Generator loss with newly gained dataset. and each was 0.0942 and 2.82. Because of the dataset's Size, estimating Eval loss with eval dataset have no difference with sampled train data.

how about your model output's quality? any artifacts?

I am still training(third week). It is very slow. I will update with my results when finish.

@Mahmoud-ghareeb
Copy link

Mahmoud-ghareeb commented May 7, 2024

how much data do we need for training

@patriotyk
Copy link

patriotyk commented May 11, 2024

@LEECHOONGHO I have published my model here https://huggingface.co/patriotyk/vocos-mel-hifigan-compat-44100khz
Sounds great, and there is metrics.
@Mahmoud-ghareeb My model has been trained on 800+ hours of audio. Vocoder doesn't require text transcripts so you can easily use audio books for training. You even don't need to cut it by silence because vocos anyway internally splits provided audios to smaller segments.

@Mahmoud-ghareeb
Copy link

Great work! @patriotyk, Thank you so much

@bzp83
Copy link

bzp83 commented Jun 13, 2024

@LEECHOONGHO I have published my model here https://huggingface.co/patriotyk/vocos-mel-hifigan-compat-44100khz Sounds great, and there is metrics. @Mahmoud-ghareeb My model has been trained on 800+ hours of audio. Vocoder doesn't require text transcripts so you can easily use audio books for training. You even don't need to cut it by silence because vocos anyway internally splits provided audios to smaller segments.

I'm new to this... Could you please tell me what's the purpose of sharing the model? I mean, when I try to use it with a wav file, the output is very close to the original input file... So I'm confused here.

Thank you

@patriotyk
Copy link

patriotyk commented Jun 13, 2024

This model generates audio from mel spectrograms. The functionality that you tried just generates mel from audio and then back audio from mel. But real tts systmes generate mels directly from text then vocoder generates audio.

@bzp83
Copy link

bzp83 commented Jun 13, 2024

Ah ok so generating mel from audio is different from what tts systems do? Is there any code snippet that would let me test the model you trained (ans possibly others)? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants