Quality #1

abylouw · 2024-04-23T19:56:53Z

Hi,

Thank you for creating this repo and implementing the architecture in the paper. I have been looking at the paper and was going to try an implementation.

Do you have any preliminary results available? Do you think that it is better than for example iSTFTNet or MB-MelGan?

wetdog · 2024-04-24T11:38:38Z

I've not compared yet with iSTFTNet or MB-MelGan, but i'll try to hear those models on the same sample. Here is a sample of a first model.

src: https://drive.google.com/file/d/1awn-oHt-wycZFyB7c_tQtC8hIewbrwR1/view?usp=drive_link
wavenext: https://drive.google.com/file/d/1jUDebB0oxuzo7VMu8pQEII3UXdtv0kLs/view?usp=drive_link

I'll probably try to increase the alpha of mrd loss to 1.0 as this was suggested here gemelo-ai/vocos#48
I'll also post the weights of the trainings when it ends.

egorsmkv · 2024-04-25T10:37:12Z

@abylouw interesting to compare outputs for Ukrainian as well

I have a RAD-TTS model with these vocoders:

patriotyk · 2024-05-02T13:00:06Z

I've not compared yet with iSTFTNet or MB-MelGan, but i'll try to hear those models on the same sample. Here is a sample of a first model. src: https://drive.google.com/file/d/1awn-oHt-wycZFyB7c_tQtC8hIewbrwR1/view?usp=drive_link wavenext: https://drive.google.com/file/d/1jUDebB0oxuzo7VMu8pQEII3UXdtv0kLs/view?usp=drive_link

I'll probably try to increase the alpha of mrd loss to 1.0 as this was suggested here gemelo-ai/vocos#48 I'll also post the weights of the trainings when it ends.

I am training your vocos-matcha with mrd loss = 1.0 and 44100 Hz So it is very slow, for almost 1M iterations, it still sounds slightly worse than yours https://huggingface.co/BSC-LT/vocos-mel-22khz. And metrics still worse.
The following logs is for about 3 weeks continuous training on 2 RTX3090 with dataset about 800 hours:

mush42 · 2024-06-06T11:08:31Z

Hi @wetdog

What's the status of this implementation in terms of quality and speed?
Do you have pre-trained weights available?

I've great expectations for this repo 🙂
Best of luck!

wetdog · 2024-06-07T11:52:10Z

Hi @mush42 I finished a training for the mel version this week. in terms of quality it achieves better periodicity,
pesq_score, pitch_loss than vocos trained on the same datasets. you can find the weights here: https://huggingface.co/BSC-LT/wavenext-mel

Also I fixed some things with the encodec experiment this week and now is training. For this trainings I used the mel features compatible with hifigan but probably is worth to train a version with 24khz using the same features as the original vocos. Let me know if you have some doubts.

wetdog · 2024-06-07T11:57:41Z

@egorsmkv Great work I would probably use your versions to run some metrics and compare the quality of those vocoders.

patriotyk · 2024-06-07T16:40:55Z

@wetdog I have added your wavenext pretrained model to my huggingface app that runs pflowtts model. But unfortunately it sounds not very good. There is 4 vocoders that generate all waveforms from the same mel spectrogram generated by pflowtts and wavenext sounds similar to hifigan but slightly worse. There are also 44100 vocos vocoder trained from your implementation and it sounds the best. You can check it here https://huggingface.co/spaces/patriotyk/pflowtts_ukr_demo

wetdog · 2024-06-07T17:04:24Z

@patriotyk Thanks for the quick implementation, Do you think that this could be due to the dataset where it was trained? I used libritts for this run but I would like to try a version with commophone https://arxiv.org/abs/2201.05912 to make it more "universal".

mush42 · 2024-06-07T17:19:49Z

Hi

Heavy TTS user here.
I don't agree with @patriotyk on this.
My initial testing shows that wavenext is significantly better than vocos, both in inference speed and synthesis quality.

Specifically, there is an audible hissing noise in the audio vocoded by vocos, probably as an ISTFT artifact.

Here's a sample of an unseen speaker, where Matcha TTS is used to generate the melspectogram.
vocos-vs-wavenext.zip

Best
Musharraf

patriotyk · 2024-06-07T18:08:40Z

@wetdog I don't know, but seems to be yes. I will try pflowtts trained on libritts and we will see.

@mush42 Your Matcha TTS is trained on which dataset? Also your vocos sample is really bad, what pretrained model do you use here? On my app 'BSC-LT/vocos-mel-22khz' sounds much better.

mush42 · 2024-06-07T18:39:13Z

@patriotyk
Matcha was trained on HifiCaptin US English female dataset.
I'm using an ONNX model converted from this model with a custom ISTFT implementation that uses CNN (in order to be ONNX exportable).

fd873630 · 2024-06-26T14:28:09Z

Hi @mush42 I finished a training for the mel version this week. in terms of quality it achieves better periodicity, pesq_score, pitch_loss than vocos trained on the same datasets. you can find the weights here: https://huggingface.co/BSC-LT/wavenext-mel

Also I fixed some things with the encodec experiment this week and now is training. For this trainings I used the mel features compatible with hifigan but probably is worth to train a version with 24khz using the same features as the original vocos. Let me know if you have some doubts.

Hi! @wetdog

Could you please share the .ckpt checkpoint file in addition to the .bin checkpoint file that you provided?

I want finetuning! but .bin checkpoint exist only generator!

wetdog · 2024-06-28T13:35:45Z

@fd873630 I just uploaded the ckpt. you can find it here https://huggingface.co/BSC-LT/wavenext-mel/blob/main/wavenext_2M_libritt_r.ckpt

mush42 · 2024-07-05T15:02:33Z

@wetdog
Thanks for open-sourcing your work guys.
Really appreciate it.

Best
Musharraf

abylouw closed this as completed Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quality #1

Quality #1

abylouw commented Apr 23, 2024

wetdog commented Apr 24, 2024

egorsmkv commented Apr 25, 2024

patriotyk commented May 2, 2024

mush42 commented Jun 6, 2024 •

edited

Loading

wetdog commented Jun 7, 2024

wetdog commented Jun 7, 2024

patriotyk commented Jun 7, 2024

wetdog commented Jun 7, 2024

mush42 commented Jun 7, 2024

patriotyk commented Jun 7, 2024

mush42 commented Jun 7, 2024

fd873630 commented Jun 26, 2024 •

edited

Loading

wetdog commented Jun 28, 2024

mush42 commented Jul 5, 2024

Quality #1

Quality #1

Comments

abylouw commented Apr 23, 2024

wetdog commented Apr 24, 2024

egorsmkv commented Apr 25, 2024

patriotyk commented May 2, 2024

mush42 commented Jun 6, 2024 • edited Loading

wetdog commented Jun 7, 2024

wetdog commented Jun 7, 2024

patriotyk commented Jun 7, 2024

wetdog commented Jun 7, 2024

mush42 commented Jun 7, 2024

patriotyk commented Jun 7, 2024

mush42 commented Jun 7, 2024

fd873630 commented Jun 26, 2024 • edited Loading

wetdog commented Jun 28, 2024

mush42 commented Jul 5, 2024

mush42 commented Jun 6, 2024 •

edited

Loading

fd873630 commented Jun 26, 2024 •

edited

Loading