Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File structure for training (encoder, synthesizer (vocoder)) #934

Open
Dannypeja opened this issue Dec 2, 2021 · 24 comments
Open

File structure for training (encoder, synthesizer (vocoder)) #934

Dannypeja opened this issue Dec 2, 2021 · 24 comments

Comments

@Dannypeja
Copy link

I want to train my own model on the mozilla common voice dataset.
All .mp3s are delivered in one folder with accompanying .tsv lists. I understood, that next to an utterance the corresponding .txt has to reside.
But what about folder structre. Can I leave all .mp3s in that one folder or do I have to split them into one subdirectory for every speaker (i'd hate to do that.).

I would be very thankful if somebody could help me with the code adjustments since I am quite new to all of this :)

@Bebaam
Copy link

Bebaam commented Dec 3, 2021

#437 (comment)

I would suggest to use that structure. You can just sort the files according to the respective speaker. These informations should be contained in the commonVoice dataset.

@Dannypeja
Copy link
Author

So the trainers for all components use folder structure to distinguish speakers?

@Bebaam
Copy link

Bebaam commented Dec 4, 2021

Yes, only that encoder does not need text files. That's why it is easier to obtain train data. U can use 10000+ speakers for encoder, they don't need best quality and also benefit from some lower quality files such that they can make use of noisy audio, too. But for the synthesizer you should only use good quality audio with I think at least 300 speakers.

@Dannypeja
Copy link
Author

Will it be a problem that commonvoice is mp3 and mailabs is wav if I want to use them in the same training?

@Bebaam
Copy link

Bebaam commented Dec 4, 2021

I'm not sure, but I don't think that it would be a problem. But you could easily convert CommonVoice to wav, it will take a few hours.

@AlexSteveChungAlvarez
Copy link

AlexSteveChungAlvarez commented Dec 6, 2021

You don't need to split them into one subdirectory for each speaker, but the folder levels (hierarchy) need to be the same.

@Dannypeja
Copy link
Author

I don’t understand. Bebaam confirmed earlier that the speaker folders are used to recognise different Speakers. So do you say the opposite?
what do you mean by folder hierarchy? Could you give an example? Would it be possible to simply leave all recordings of all 15.000 speakers in one directory and add the .txts alongside them?

@AlexSteveChungAlvarez
Copy link

AlexSteveChungAlvarez commented Dec 6, 2021

I already trained 3 times with different datasets, I am doing a fourth one right now, just by using this folder hierarchy:

*datasets_root
    * subfolder
        * folder1
            * folder2
                * folder3
                    * audio
                    * text
                    * audio
                    * text
                    * .
                    * .
                    * .

You can leave the whole 15000 recordings in one directory with the txts, I did that.

@Dannypeja
Copy link
Author

Thanks a lot!
Just to further clarify: Is it really essential that you have folder 1, 2 and 3 or are those just an example?

@AlexSteveChungAlvarez
Copy link

AlexSteveChungAlvarez commented Dec 6, 2021

Yes, it is! I tried without them and it didn't recognize the audios! You can put them any name you want, just remember to include the names of the subfolder and folder1 in the command, for example: --datasets_name subfolder --subfolders folder1

@Bebaam
Copy link

Bebaam commented Dec 7, 2021

Did you train an encoder or did you always just train a syn? As I understood, the encoder will need to distinguish between different speakers with help of the folder structure, but I didn't go into it.

@AlexSteveChungAlvarez
Copy link

I just trained the synthesizer. But I think the encoder uses a similarity function to distinguish between different speakers, I'm not sure if the folder structure has something to do with it.

@Dannypeja
Copy link
Author

I am having severe Issues with preparing the commonvoice Dataset.

  • First I am converting the mp3s to wav. That takes an eternity (a little over 24 hours). At least preprocessing seems to be a lot faster that way.
  • next I am mimicing file hierarchy (root/speaker/book/wav/files.*)and placing .txts next to wavs. Takes also forever and seemingly introduces a lot of corrupt files.
  • cleaning corrupt files (those that are 0 kb ) consumes time.
  • then after preprocess the data is unusable. Seemingly there were still some corrupt things inside.

Result: I am left with nothing but wasted time. It seems that file corruption is a big issue. Maybe my script for sorting the files to folder is shit.

Any Ideas?

@AlexSteveChungAlvarez
Copy link

For which language are you using it? Right now @Andredenise and I are using it to train synthesizer in Spanish (#941), and the preprocessing part took like 2 days. Now, we are in the training part. I prepared the dataset, and I didn't convert the mp3 to wav, since it indeed would take a lot of time (there are like 196 006 files). The part of mimicing file hierarchy also takes a lot of time, but didn't introduce me corrupt files (all the text files are 1 kb). If you get corrupt files in this step the most probable is that the preprocessing won't work well (it already happened to me, as I explained here #789 (comment) ). Maybe you can see my script to prepare the dataset and figure out how to solve your issue, here: https://github.com/AlexSteveChungAlvarez/Real-Time-Voice-Cloning/blob/master/split_transcript.py the cvcorpus function is the one that prepares the commonvoice and directly put the audios and texts in the file hierarchy. Before, I tried with the audios from validated.tsv, now I am trying with the audios from train.tsv, everything seems good until now, we have started training today.

@Bebaam
Copy link

Bebaam commented Dec 12, 2021

How do you do mimicing file hierarchy? It should usually be just moving files into directories and I think took around 15 minutes on my SSD.

@Bebaam
Copy link

Bebaam commented Dec 12, 2021

Furthermore, I kept only I think 100 files per speaker, this should balance the dataset a bit.

@Dannypeja
Copy link
Author

@AlexSteveChungAlvarez thank you for the script. I am a python noob but I should be able to adapt it for my german dataset. Okay I believe the preprocessing will be much faster using wavs and since I have the 500 GB space I will continue using them. maybe I will get less corrupt files. maybe I already got corrupted when transcoding into wav. Don't know.

@Bebaam I am copying files with an awk script. Maybe you are faster since you are not copying but moving. But I wanted to preserve the original wavs so I can resort them for synthesizer later.

@everyone: what are the differences of the .tsvs of the cv datasets? what is train.tsv? I always use validated.tsv and sort for up and downvotes to my desires.

#!/usr/bin/awk -f

BEGIN {
	FS = "\t"
	src = "de/wavs/"
	dist = "de/processed/"
print dist
	while("cat de/validated-wav.tsv" | getline)
	{
		if($4 < 2 || $5 > 0) continue
		client_id = $1
		mp3path = $2
		sub(/wav/, "txt", $2)
		sentence = $3
		up_votes = $4
		down_votes = $5
		age = $6
		gender = $7
		accent = $8
		locale = $9
		segment = $10
		if(system("test -e "src mp3path) == 0)
		{
			system("mkdir -p "dist client_id"/book0/wavs/")
			system("cp "src mp3path" "dist client_id"/book0/wavs/")
			system("echo " sentence ">" dist client_id"/book0/wavs/"$2)
			printf("Created entries for %s\n", client_id)
		}
	}
}

@Bebaam
Copy link

Bebaam commented Dec 12, 2021

okay, copying should be much slower than moving. #941 (comment) that is why I use train.tsv for syn+voc and validated.tsv for encoder.

@Bebaam
Copy link

Bebaam commented Dec 12, 2021

I think chances are good that data got errors when converting, as copying shouldn't do any harm. How did you do converting? I just used pydub, something like:

from pydub import AudioSegment
src = elem
dst = elem.split(".")[0]+".wav"
# convert mp3 to wav                                                            
sound = AudioSegment.from_mp3(src)
sound.export(dst, format="wav")
os.remove(elem)

@Dannypeja
Copy link
Author

Dannypeja commented Dec 12, 2021

I used ffmpeg and a shell script as I am not fluent with python scripting.
I haven't read into how python imports filesystem stuff.

Only some got corrupted, so I cannot reproduce why. I will surely use @AlexSteveChungAlvarez script for sorting since he proved it's stability. Can you tell me if it sorts the files into speaker directories? It doesn't look like it. You also mentioned before that you are not doing that, but @Babaam said it is required for encoder training.

#! /bin/bash

srcExt=mp3
destExt=wav

srcDir=$1
destDir=$2

for filename in "$srcDir"/*.$srcExt; do

        basePath=${filename%.*}
        baseName=${basePath##*/}

        ffmpeg -i "$filename"  "$destDir"/"$baseName"."$destExt"

done

@Bebaam
Copy link

Bebaam commented Dec 12, 2021

As far as I know the directory structure is mandatory, but I am not sure for which model. As @AlexSteveChungAlvarez stated, structure wasn't neccessary (at least for synthesizer), so if the model works with that, you could use the script.

Furthermore, maybe just training a synthesizer may be enough, I don't know that. But if I would have a 3090, I would want to try to train from scratch :D

@AlexSteveChungAlvarez
Copy link

I used ffmpeg and a shell script as I am not fluent with python scripting. I haven't read into how python imports filesystem stuff.

Only some got corrupted, so I cannot reproduce why. I will surely use @AlexSteveChungAlvarez script for sorting since he proved it's stability. Can you tell me if it sorts the files into speaker directories? It doesn't look like it. You also mentioned before that you are not doing that, but @Babaam said it is required for encoder training.

#! /bin/bash

srcExt=mp3
destExt=wav

srcDir=$1
destDir=$2

for filename in "$srcDir"/*.$srcExt; do

        basePath=${filename%.*}
        baseName=${basePath##*/}

        ffmpeg -i "$filename"  "$destDir"/"$baseName"."$destExt"

done

It doesn't, it just puts everything into one directory, as you may have noticed in the code itself. That code was made from a script that bluefish shared in one past issue with me, I just modified it to be used with commonvoice.

@Bebaam
Copy link

Bebaam commented Dec 12, 2021

for need of proper structure I refer to this: #431 (comment)

@pauortegariera
Copy link

Hello, I want to use a dataset in Spanish from Argentina, can this implementation be adapted for that? Any information is welcome. Thanks a lot !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants