-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File structure for training (encoder, synthesizer (vocoder)) #934
Comments
I would suggest to use that structure. You can just sort the files according to the respective speaker. These informations should be contained in the commonVoice dataset. |
So the trainers for all components use folder structure to distinguish speakers? |
Yes, only that encoder does not need text files. That's why it is easier to obtain train data. U can use 10000+ speakers for encoder, they don't need best quality and also benefit from some lower quality files such that they can make use of noisy audio, too. But for the synthesizer you should only use good quality audio with I think at least 300 speakers. |
Will it be a problem that commonvoice is mp3 and mailabs is wav if I want to use them in the same training? |
I'm not sure, but I don't think that it would be a problem. But you could easily convert CommonVoice to wav, it will take a few hours. |
You don't need to split them into one subdirectory for each speaker, but the folder levels (hierarchy) need to be the same. |
I don’t understand. Bebaam confirmed earlier that the speaker folders are used to recognise different Speakers. So do you say the opposite? |
I already trained 3 times with different datasets, I am doing a fourth one right now, just by using this folder hierarchy:
You can leave the whole 15000 recordings in one directory with the txts, I did that. |
Thanks a lot! |
Yes, it is! I tried without them and it didn't recognize the audios! You can put them any name you want, just remember to include the names of the subfolder and folder1 in the command, for example: --datasets_name subfolder --subfolders folder1 |
Did you train an encoder or did you always just train a syn? As I understood, the encoder will need to distinguish between different speakers with help of the folder structure, but I didn't go into it. |
I just trained the synthesizer. But I think the encoder uses a similarity function to distinguish between different speakers, I'm not sure if the folder structure has something to do with it. |
I am having severe Issues with preparing the commonvoice Dataset.
Result: I am left with nothing but wasted time. It seems that file corruption is a big issue. Maybe my script for sorting the files to folder is shit. Any Ideas? |
For which language are you using it? Right now @Andredenise and I are using it to train synthesizer in Spanish (#941), and the preprocessing part took like 2 days. Now, we are in the training part. I prepared the dataset, and I didn't convert the mp3 to wav, since it indeed would take a lot of time (there are like 196 006 files). The part of mimicing file hierarchy also takes a lot of time, but didn't introduce me corrupt files (all the text files are 1 kb). If you get corrupt files in this step the most probable is that the preprocessing won't work well (it already happened to me, as I explained here #789 (comment) ). Maybe you can see my script to prepare the dataset and figure out how to solve your issue, here: https://github.com/AlexSteveChungAlvarez/Real-Time-Voice-Cloning/blob/master/split_transcript.py the cvcorpus function is the one that prepares the commonvoice and directly put the audios and texts in the file hierarchy. Before, I tried with the audios from validated.tsv, now I am trying with the audios from train.tsv, everything seems good until now, we have started training today. |
How do you do mimicing file hierarchy? It should usually be just moving files into directories and I think took around 15 minutes on my SSD. |
Furthermore, I kept only I think 100 files per speaker, this should balance the dataset a bit. |
@AlexSteveChungAlvarez thank you for the script. I am a python noob but I should be able to adapt it for my german dataset. Okay I believe the preprocessing will be much faster using wavs and since I have the 500 GB space I will continue using them. maybe I will get less corrupt files. maybe I already got corrupted when transcoding into wav. Don't know. @Bebaam I am copying files with an awk script. Maybe you are faster since you are not copying but moving. But I wanted to preserve the original wavs so I can resort them for synthesizer later. @everyone: what are the differences of the .tsvs of the cv datasets? what is train.tsv? I always use validated.tsv and sort for up and downvotes to my desires. #!/usr/bin/awk -f
BEGIN {
FS = "\t"
src = "de/wavs/"
dist = "de/processed/"
print dist
while("cat de/validated-wav.tsv" | getline)
{
if($4 < 2 || $5 > 0) continue
client_id = $1
mp3path = $2
sub(/wav/, "txt", $2)
sentence = $3
up_votes = $4
down_votes = $5
age = $6
gender = $7
accent = $8
locale = $9
segment = $10
if(system("test -e "src mp3path) == 0)
{
system("mkdir -p "dist client_id"/book0/wavs/")
system("cp "src mp3path" "dist client_id"/book0/wavs/")
system("echo " sentence ">" dist client_id"/book0/wavs/"$2)
printf("Created entries for %s\n", client_id)
}
}
} |
okay, copying should be much slower than moving. #941 (comment) that is why I use train.tsv for syn+voc and validated.tsv for encoder. |
I think chances are good that data got errors when converting, as copying shouldn't do any harm. How did you do converting? I just used pydub, something like:
|
I used ffmpeg and a shell script as I am not fluent with python scripting. Only some got corrupted, so I cannot reproduce why. I will surely use @AlexSteveChungAlvarez script for sorting since he proved it's stability. Can you tell me if it sorts the files into speaker directories? It doesn't look like it. You also mentioned before that you are not doing that, but @Babaam said it is required for encoder training. #! /bin/bash
srcExt=mp3
destExt=wav
srcDir=$1
destDir=$2
for filename in "$srcDir"/*.$srcExt; do
basePath=${filename%.*}
baseName=${basePath##*/}
ffmpeg -i "$filename" "$destDir"/"$baseName"."$destExt"
done |
As far as I know the directory structure is mandatory, but I am not sure for which model. As @AlexSteveChungAlvarez stated, structure wasn't neccessary (at least for synthesizer), so if the model works with that, you could use the script. Furthermore, maybe just training a synthesizer may be enough, I don't know that. But if I would have a 3090, I would want to try to train from scratch :D |
It doesn't, it just puts everything into one directory, as you may have noticed in the code itself. That code was made from a script that bluefish shared in one past issue with me, I just modified it to be used with commonvoice. |
for need of proper structure I refer to this: #431 (comment) |
Hello, I want to use a dataset in Spanish from Argentina, can this implementation be adapted for that? Any information is welcome. Thanks a lot ! |
I want to train my own model on the mozilla common voice dataset.
All .mp3s are delivered in one folder with accompanying .tsv lists. I understood, that next to an utterance the corresponding .txt has to reside.
But what about folder structre. Can I leave all .mp3s in that one folder or do I have to split them into one subdirectory for every speaker (i'd hate to do that.).
I would be very thankful if somebody could help me with the code adjustments since I am quite new to all of this :)
The text was updated successfully, but these errors were encountered: