-
Notifications
You must be signed in to change notification settings - Fork 180
Description
I´m new to this topic, hence I have a few questions:
Your documentation of the jarvis model says the following:
The model was trained on approximately ~31,000 hours of negative data, with the approximate composition shown below:
~10,000 hours of noise, music, and speech from the ACAV100M dataset
~10,000 hours from the Common Voice 11 dataset, representing multiple languages
~10,000 hours of podcasts downloaded from the Podcastindex database
~1,000 hours of music from the Free Music Archive dataset
How is this to be interpreted?
in the config, if I specify no custom_negative_phrases, than the function generate_adversarial_texts will create the same number of n_samples_val for adversarial_texts. Those files are than being saved. Let´s assume with we have 200 k negativ files.
Those 200k files and the config "augemtnation_rounds" set to 2, will result in 400k augmented files. Here is the point. THe augment_clips methods takes the background clips pathes and the RIR Path. But not all files in the rir path and background clip path is being considered to be augmented with the input file. In simple words -> if i have 1 million files in those folders but just 5 files to augment, if I interpret it right, those 5 files will be augmented with a random pick from the 1million files.
So how can I interpret this negativ data in terms of hours. Was that just the pool of data to be available for data to be augmented with or what is the deal? I have used your Advanced training pipeline to train a model with standard "inputs" but increased the pool of background clips. I did the training on such data:
n_samples: 200000 # Number of samples to create with the piper sample generator -> Positive and negative samples 200000
n_samples_val: 40000 # Number of samples to create with the piper sample generator -> positive and negative testing samples 40000
steps: 400000
target_accuracy: 0.8
target_false_positives_per_hour: 0.2
target_phrase:
- hey truncated
target_recall: 0.9 # Target recall for the target phrase -> in simple words, how many of the target phrases should be detected
tts_batch_size: 50
but this results in something you can´t use. It doesn´t pick up the wakeword. I set the threshold to 0.9 of the predeiction score, but I usually just peak about 0.1 ...
Maybe you can explain or you have any tips on improving this. I have taken 9 models from here https://huggingface.co/rhasspy/piper-voices/tree/main