File structure for training (encoder, synthesizer (vocoder)) #934

Dannypeja · 2021-12-02T06:38:49Z

I want to train my own model on the mozilla common voice dataset.
All .mp3s are delivered in one folder with accompanying .tsv lists. I understood, that next to an utterance the corresponding .txt has to reside.
But what about folder structre. Can I leave all .mp3s in that one folder or do I have to split them into one subdirectory for every speaker (i'd hate to do that.).

I would be very thankful if somebody could help me with the code adjustments since I am quite new to all of this :)

Bebaam · 2021-12-03T16:21:05Z

#437 (comment)

I would suggest to use that structure. You can just sort the files according to the respective speaker. These informations should be contained in the commonVoice dataset.

Dannypeja · 2021-12-03T18:15:57Z

So the trainers for all components use folder structure to distinguish speakers?

Bebaam · 2021-12-04T00:21:48Z

Yes, only that encoder does not need text files. That's why it is easier to obtain train data. U can use 10000+ speakers for encoder, they don't need best quality and also benefit from some lower quality files such that they can make use of noisy audio, too. But for the synthesizer you should only use good quality audio with I think at least 300 speakers.

Dannypeja · 2021-12-04T08:56:31Z

Will it be a problem that commonvoice is mp3 and mailabs is wav if I want to use them in the same training?

Bebaam · 2021-12-04T10:36:27Z

I'm not sure, but I don't think that it would be a problem. But you could easily convert CommonVoice to wav, it will take a few hours.

AlexSteveChungAlvarez · 2021-12-06T19:33:32Z

You don't need to split them into one subdirectory for each speaker, but the folder levels (hierarchy) need to be the same.

Dannypeja · 2021-12-06T20:10:30Z

I don’t understand. Bebaam confirmed earlier that the speaker folders are used to recognise different Speakers. So do you say the opposite?
what do you mean by folder hierarchy? Could you give an example? Would it be possible to simply leave all recordings of all 15.000 speakers in one directory and add the .txts alongside them?

AlexSteveChungAlvarez · 2021-12-06T20:25:24Z

I already trained 3 times with different datasets, I am doing a fourth one right now, just by using this folder hierarchy:

*datasets_root
    * subfolder
        * folder1
            * folder2
                * folder3
                    * audio
                    * text
                    * audio
                    * text
                    * .
                    * .
                    * .

You can leave the whole 15000 recordings in one directory with the txts, I did that.

Dannypeja · 2021-12-06T20:27:44Z

Thanks a lot!
Just to further clarify: Is it really essential that you have folder 1, 2 and 3 or are those just an example?

AlexSteveChungAlvarez · 2021-12-06T20:29:13Z

Yes, it is! I tried without them and it didn't recognize the audios! You can put them any name you want, just remember to include the names of the subfolder and folder1 in the command, for example: --datasets_name subfolder --subfolders folder1

Bebaam · 2021-12-07T10:05:42Z

Did you train an encoder or did you always just train a syn? As I understood, the encoder will need to distinguish between different speakers with help of the folder structure, but I didn't go into it.

AlexSteveChungAlvarez · 2021-12-07T13:40:00Z

I just trained the synthesizer. But I think the encoder uses a similarity function to distinguish between different speakers, I'm not sure if the folder structure has something to do with it.

Dannypeja · 2021-12-12T10:10:58Z

I am having severe Issues with preparing the commonvoice Dataset.

First I am converting the mp3s to wav. That takes an eternity (a little over 24 hours). At least preprocessing seems to be a lot faster that way.
next I am mimicing file hierarchy (root/speaker/book/wav/files.*)and placing .txts next to wavs. Takes also forever and seemingly introduces a lot of corrupt files.
cleaning corrupt files (those that are 0 kb ) consumes time.
then after preprocess the data is unusable. Seemingly there were still some corrupt things inside.

Result: I am left with nothing but wasted time. It seems that file corruption is a big issue. Maybe my script for sorting the files to folder is shit.

Any Ideas?

AlexSteveChungAlvarez · 2021-12-12T14:30:49Z

For which language are you using it? Right now @Andredenise and I are using it to train synthesizer in Spanish (#941), and the preprocessing part took like 2 days. Now, we are in the training part. I prepared the dataset, and I didn't convert the mp3 to wav, since it indeed would take a lot of time (there are like 196 006 files). The part of mimicing file hierarchy also takes a lot of time, but didn't introduce me corrupt files (all the text files are 1 kb). If you get corrupt files in this step the most probable is that the preprocessing won't work well (it already happened to me, as I explained here #789 (comment) ). Maybe you can see my script to prepare the dataset and figure out how to solve your issue, here: https://github.com/AlexSteveChungAlvarez/Real-Time-Voice-Cloning/blob/master/split_transcript.py the cvcorpus function is the one that prepares the commonvoice and directly put the audios and texts in the file hierarchy. Before, I tried with the audios from validated.tsv, now I am trying with the audios from train.tsv, everything seems good until now, we have started training today.

Bebaam · 2021-12-12T14:36:33Z

How do you do mimicing file hierarchy? It should usually be just moving files into directories and I think took around 15 minutes on my SSD.

Bebaam · 2021-12-12T14:46:46Z

Furthermore, I kept only I think 100 files per speaker, this should balance the dataset a bit.

Dannypeja · 2021-12-12T14:54:07Z

@AlexSteveChungAlvarez thank you for the script. I am a python noob but I should be able to adapt it for my german dataset. Okay I believe the preprocessing will be much faster using wavs and since I have the 500 GB space I will continue using them. maybe I will get less corrupt files. maybe I already got corrupted when transcoding into wav. Don't know.

@Bebaam I am copying files with an awk script. Maybe you are faster since you are not copying but moving. But I wanted to preserve the original wavs so I can resort them for synthesizer later.

@everyone: what are the differences of the .tsvs of the cv datasets? what is train.tsv? I always use validated.tsv and sort for up and downvotes to my desires.

#!/usr/bin/awk -f

BEGIN {
	FS = "\t"
	src = "de/wavs/"
	dist = "de/processed/"
print dist
	while("cat de/validated-wav.tsv" | getline)
	{
		if($4 < 2 || $5 > 0) continue
		client_id = $1
		mp3path = $2
		sub(/wav/, "txt", $2)
		sentence = $3
		up_votes = $4
		down_votes = $5
		age = $6
		gender = $7
		accent = $8
		locale = $9
		segment = $10
		if(system("test -e "src mp3path) == 0)
		{
			system("mkdir -p "dist client_id"/book0/wavs/")
			system("cp "src mp3path" "dist client_id"/book0/wavs/")
			system("echo " sentence ">" dist client_id"/book0/wavs/"$2)
			printf("Created entries for %s\n", client_id)
		}
	}
}

Bebaam · 2021-12-12T14:57:37Z

okay, copying should be much slower than moving. #941 (comment) that is why I use train.tsv for syn+voc and validated.tsv for encoder.

Bebaam · 2021-12-12T15:05:07Z

I think chances are good that data got errors when converting, as copying shouldn't do any harm. How did you do converting? I just used pydub, something like:

from pydub import AudioSegment
src = elem
dst = elem.split(".")[0]+".wav"
# convert mp3 to wav                                                            
sound = AudioSegment.from_mp3(src)
sound.export(dst, format="wav")
os.remove(elem)

Dannypeja · 2021-12-12T16:22:02Z

I used ffmpeg and a shell script as I am not fluent with python scripting.
I haven't read into how python imports filesystem stuff.

Only some got corrupted, so I cannot reproduce why. I will surely use @AlexSteveChungAlvarez script for sorting since he proved it's stability. Can you tell me if it sorts the files into speaker directories? It doesn't look like it. You also mentioned before that you are not doing that, but @Babaam said it is required for encoder training.

#! /bin/bash

srcExt=mp3
destExt=wav

srcDir=$1
destDir=$2

for filename in "$srcDir"/*.$srcExt; do

        basePath=${filename%.*}
        baseName=${basePath##*/}

        ffmpeg -i "$filename"  "$destDir"/"$baseName"."$destExt"

done

Bebaam · 2021-12-12T17:06:01Z

As far as I know the directory structure is mandatory, but I am not sure for which model. As @AlexSteveChungAlvarez stated, structure wasn't neccessary (at least for synthesizer), so if the model works with that, you could use the script.

Furthermore, maybe just training a synthesizer may be enough, I don't know that. But if I would have a 3090, I would want to try to train from scratch :D

AlexSteveChungAlvarez · 2021-12-12T18:04:25Z

I used ffmpeg and a shell script as I am not fluent with python scripting. I haven't read into how python imports filesystem stuff.

Only some got corrupted, so I cannot reproduce why. I will surely use @AlexSteveChungAlvarez script for sorting since he proved it's stability. Can you tell me if it sorts the files into speaker directories? It doesn't look like it. You also mentioned before that you are not doing that, but @Babaam said it is required for encoder training.
#! /bin/bash

srcExt=mp3
destExt=wav

srcDir=$1
destDir=$2

for filename in "$srcDir"/*.$srcExt; do

        basePath=${filename%.*}
        baseName=${basePath##*/}

        ffmpeg -i "$filename"  "$destDir"/"$baseName"."$destExt"

done

It doesn't, it just puts everything into one directory, as you may have noticed in the code itself. That code was made from a script that bluefish shared in one past issue with me, I just modified it to be used with commonvoice.

Bebaam · 2021-12-12T19:15:48Z

for need of proper structure I refer to this: #431 (comment)

pauortegariera · 2022-09-01T14:43:28Z

Hello, I want to use a dataset in Spanish from Argentina, can this implementation be adapted for that? Any information is welcome. Thanks a lot !

Bebaam mentioned this issue Dec 9, 2021

Train Synthetizer in Spanish #941

Closed

Bebaam mentioned this issue Dec 13, 2021

Training on RTX 3090. Batch Sizes and other parameters? #914

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File structure for training (encoder, synthesizer (vocoder)) #934

File structure for training (encoder, synthesizer (vocoder)) #934

Dannypeja commented Dec 2, 2021

Bebaam commented Dec 3, 2021

Dannypeja commented Dec 3, 2021

Bebaam commented Dec 4, 2021

Dannypeja commented Dec 4, 2021

Bebaam commented Dec 4, 2021

AlexSteveChungAlvarez commented Dec 6, 2021 •

edited

Loading

Dannypeja commented Dec 6, 2021

AlexSteveChungAlvarez commented Dec 6, 2021 •

edited

Loading

Dannypeja commented Dec 6, 2021

AlexSteveChungAlvarez commented Dec 6, 2021 •

edited

Loading

Bebaam commented Dec 7, 2021

AlexSteveChungAlvarez commented Dec 7, 2021

Dannypeja commented Dec 12, 2021

AlexSteveChungAlvarez commented Dec 12, 2021

Bebaam commented Dec 12, 2021

Bebaam commented Dec 12, 2021

Dannypeja commented Dec 12, 2021

Bebaam commented Dec 12, 2021

Bebaam commented Dec 12, 2021

Dannypeja commented Dec 12, 2021 •

edited

Loading

Bebaam commented Dec 12, 2021 •

edited

Loading

AlexSteveChungAlvarez commented Dec 12, 2021

Bebaam commented Dec 12, 2021

pauortegariera commented Sep 1, 2022

File structure for training (encoder, synthesizer (vocoder)) #934

File structure for training (encoder, synthesizer (vocoder)) #934

Comments

Dannypeja commented Dec 2, 2021

Bebaam commented Dec 3, 2021

Dannypeja commented Dec 3, 2021

Bebaam commented Dec 4, 2021

Dannypeja commented Dec 4, 2021

Bebaam commented Dec 4, 2021

AlexSteveChungAlvarez commented Dec 6, 2021 • edited Loading

Dannypeja commented Dec 6, 2021

AlexSteveChungAlvarez commented Dec 6, 2021 • edited Loading

Dannypeja commented Dec 6, 2021

AlexSteveChungAlvarez commented Dec 6, 2021 • edited Loading

Bebaam commented Dec 7, 2021

AlexSteveChungAlvarez commented Dec 7, 2021

Dannypeja commented Dec 12, 2021

AlexSteveChungAlvarez commented Dec 12, 2021

Bebaam commented Dec 12, 2021

Bebaam commented Dec 12, 2021

Dannypeja commented Dec 12, 2021

Bebaam commented Dec 12, 2021

Bebaam commented Dec 12, 2021

Dannypeja commented Dec 12, 2021 • edited Loading

Bebaam commented Dec 12, 2021 • edited Loading

AlexSteveChungAlvarez commented Dec 12, 2021

Bebaam commented Dec 12, 2021

pauortegariera commented Sep 1, 2022

AlexSteveChungAlvarez commented Dec 6, 2021 •

edited

Loading

AlexSteveChungAlvarez commented Dec 6, 2021 •

edited

Loading

AlexSteveChungAlvarez commented Dec 6, 2021 •

edited

Loading

Dannypeja commented Dec 12, 2021 •

edited

Loading

Bebaam commented Dec 12, 2021 •

edited

Loading