Multi language training #2214

Bebaam · 2022-12-13T11:51:15Z

Bebaam
Dec 13, 2022

Hey everybody,

I currently try to train a multispeaker multilanguage model with phonemes. I've read before that I can start with one language and add new languages over time by training explicitly with the corresponding dataset (Searched a lot but unfortunately I can't find that thread, only #1859 (comment)). This would be perfect for me, because I can then set the proper phoneme language and a language dependent cleaner.

However, I trained the model for close to 1 Million steps in German - with Thorsten voice and another one - where the quality is ok. But when I then want to add English with the LJ-Speech Dataset, using the English espeak phonemizer and an English cleaner, the capability to speak german and also the german speakers in general are lost directly (no difference in using each of the german speakers provided as 'speaker_idx' for inference). Do I need to manually add this information to force the model to keep speaker/language information?

As a possible alternative, if the method above is not working, maybe it is better to train with the whole dataset - containing German and English? As erogol pointed out in #1590 (comment), phonemizer would be ready for that. But then I won't be able to use the cleaner appropriately?

Thanks in advance for any help and insights.

Ca-ressemble-a-du-fake · 2022-12-13T14:11:09Z

Ca-ressemble-a-du-fake
Dec 13, 2022

Maybe you can use the "multilingual_cleaners" and use your multi language dataset ?
Also did you check the recipe they've just released for multilingual training ?

4 replies

Bebaam Dec 13, 2022
Author

Thanks for the hint with the new recipe, I'll try to adapt that.
I remember that I've read somewhere that language specific cleaner is not neccessary when using espeak (as it supports both En and De). If that would be the case, using multilingual cleaners would be completely sufficient for me.

Furthermore, I've just started to test using the whole dataset and with multilingual cleaners. The speaker information is still there, but quality directly drops. Maybe I have to train that for a bit. I'll try that in combination with information from the new recipe :)

Bebaam Dec 13, 2022
Author

I've tried that now, but the problem still exists at least after a few steps.
To get an idea, here are two samples saying the german sentence: "Heute ist ein großer Tag für uns alle."
The first one is generated with the german only model:

thor_example_ger.mp4

The second one is generated with the 2-language model after around 1000 steps:

thor_example_ger_eng.mp4

Something is going wrong, maybe it is a problem with the phonemes?

Ca-ressemble-a-du-fake Dec 13, 2022

When I fine tune a model I train for more than 1000 steps. Maybe you've got to train 20k or 30k more steps and see. In tensorboard you can also have a look at the eval losses that should decrease (although I could never achieve to make them decrease, but he quality increased).

Bebaam Dec 13, 2022
Author

Yeah will do that definitively. Just wanted to give an insight. For me it sounded as something was wrong, hopefully it will diminish later on.

Tensorboard is never really helpful for me, and I just listen to samples every few 10k steps, a bit like Edresson suggests here #1306 (comment). So I think it is the same as for you, losses aren't that interesting, but indeed quality increases :)

Ca-ressemble-a-du-fake · 2022-12-13T20:15:05Z

Ca-ressemble-a-du-fake
Dec 13, 2022

Looking forward to hearing how it'll evolve overtime when you reach mote steps.

By the way the answer from Edresson you linked is very interesting : I've always been wondering how all those losses should be interpreted !

And you wrote that you trained for nearly 1M steps. How long did it take you to do that and what's your hardware / batch size ? With an i5 2400 (~10 years old) / 16GB / rtx 3090 I reach ~ 80k steps / 24h (bs = 32) so I'd have to wait ~13 days to hopefully reach the same quality than what you've got (dataset is mine with ~ 1500 samples).

7 replies

Bebaam Dec 14, 2022
Author

40k steps later nothing changed :( Results only get worse.

Ca-ressemble-a-du-fake Dec 14, 2022

One day I was told to freeze the duration predictor and the PE (post encoder I guess) when finetuning. I could never achieve better results with that but my datasets have always been reduced. So maybe it can work for you.

Bebaam Dec 14, 2022
Author

In which way you had your datasets reduced?
With freezing I think you refer to #1781 (comment)?
Yeah I did not try freezing but hoped that at least decent results should be achieved without freezing.

Ca-ressemble-a-du-fake Dec 14, 2022

I meant that my datasets have a reduced number of samples. So I use small datasets (~1500 samples) compared to what you use ! Listening to your eng + ger generated samples I would say that it lost something so yeah freezing should prevent this effect.

Bebaam Dec 14, 2022
Author

Ok I understand, maybe this can really have an effect here.

Ca-ressemble-a-du-fake · 2022-12-14T05:19:32Z

Ca-ressemble-a-du-fake
Dec 14, 2022

FWIW on my machine it's rather 90k per 24h as I've just measured it accurately on Tensorboard. So yeah you've got room for improvements!
The first audio you provided sounds perfect to me. So what makes you think you were already overfitting a bit (I would love to be able to hear when the training has to be stopped because overfitting starts to occur)?

9 replies

Bebaam Dec 14, 2022
Author

Yeah the sound quality is really subjective :D For me the second one just sounds less natural "German" 😆

Btw I installed pure linux. I expected to have a loss of 10-20% which would be okay for me. But now training is up to 3x faster! It'll be around 130k now, thank you for that :)

Ca-ressemble-a-du-fake Dec 14, 2022

🍾3x faster ! That's awesome ! It also shows in practice that decent cpu/motherboard is required to get all the high performance out of an rtx 3090 (your platform is 45% faster than mine for the same gpu and you're using 22k sampling rate so I ought to find the budget to upgrade my cpu😄). Regarding the voice quality I don't know for sure whether it applies to your case or not but somebody told me that the quality of a single speaker model was better than the one of multispeaker model because [all neurons are dedicated to the voice] (https://gitter.im/coqui-ai/TTS?at=634ac1abcf41c67a5cb157c4). So that could partially explain the decrease in quality that you're experiencing.

Bebaam Dec 14, 2022
Author

Maybe there are some other factors influencing the speed, 45% seems a bit high. E.g. things like using an SSD (In RTVC repo it was sometimes of interest). Would be interesting to know where these huge differences exactly come from.

Ca-ressemble-a-du-fake Dec 15, 2022

It has a decent SSD (from Samsung). I'd say that it comes from the transfer rate from GPU to mainboard components (CPU, RAM, mass storage...). But it's OK I am not in a hurry !
And yes I meant VITS model like you are using.

Bebaam Dec 15, 2022
Author

Ok yes that will be the case then, fortunately training is fast anyway :)

p0p4k · 2022-12-14T15:52:23Z

p0p4k
Dec 14, 2022

Can you incrementally add languages or speakers without losing the older languages?

1 reply

Bebaam Dec 14, 2022
Author

Yes and no. When I train in German (for 1M steps) and add a second language afterwards (English) and train on the datasets of both languages together, the way the speaker pronounce the words in German is directly getting worse. I have the sample for the German only model (same as in answer above) for the sentence: "Heute ist ein großer Tag für uns alle.":

thor_example_ger.mp4

The model trained with German+English dataset after 1k steps:

thor_example_ger_eng_after1k.mp4

The model trained with German+English dataset after 40k steps:

thor_example_ger_eng_after40k.mp4

Bebaam · 2022-12-15T14:49:47Z

Bebaam
Dec 15, 2022
Author

No idea what I did wrong, but now it is working. I just used another checkpoint and then recomputed the phoneme cache dir. Now the german language output is not corrupted anymore :) Thanks again @Ca-ressemble-a-du-fake for the support.

2 replies

Ca-ressemble-a-du-fake Dec 15, 2022

Cool:smile:! It hit me too when I was using a common folder for the phoneme cache. Now I append the cache dir to the current model dir in the recipe I use (so I only use the python way, not the json file anymore :relieved: )

Bebaam Dec 15, 2022
Author

Yes is probably the best way 😄

thivux · 2024-07-17T11:29:40Z

thivux
Jul 17, 2024

hi @Bebaam, did you train with both German and English data or just continue training German data from English checkpoint?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi language training #2214

{{title}}

Replies: 6 comments 23 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Multi language training #2214

Replies: 6 comments · 23 replies

Bebaam Dec 13, 2022 Author

Bebaam Dec 13, 2022 Author

Bebaam Dec 13, 2022 Author

Bebaam Dec 14, 2022 Author

Bebaam Dec 14, 2022 Author

Bebaam Dec 14, 2022 Author

Bebaam Dec 14, 2022 Author

Bebaam Dec 14, 2022 Author

Bebaam Dec 15, 2022 Author

Bebaam Dec 14, 2022 Author

Bebaam Dec 15, 2022 Author

Bebaam Dec 15, 2022 Author

Replies: 6 comments 23 replies

Bebaam Dec 13, 2022
Author

Bebaam Dec 13, 2022
Author

Bebaam Dec 13, 2022
Author

Bebaam Dec 14, 2022
Author

Bebaam Dec 14, 2022
Author

Bebaam Dec 14, 2022
Author

Bebaam Dec 14, 2022
Author

Bebaam Dec 14, 2022
Author

Bebaam Dec 15, 2022
Author

Bebaam Dec 14, 2022
Author

Bebaam
Dec 15, 2022
Author

Bebaam Dec 15, 2022
Author