Replies: 2 comments
-
>>> sanjaesc |
Beta Was this translation helpful? Give feedback.
-
>>> Yilmaz_Ay |
Beta Was this translation helpful? Give feedback.
-
>>> sanjaesc |
Beta Was this translation helpful? Give feedback.
-
>>> Yilmaz_Ay |
Beta Was this translation helpful? Give feedback.
-
>>> Yilmaz_Ay
[April 13, 2020, 7:22am]
Hi All, slash
I trained my audio set which consists of about 27 hours of auidos of 10
seconds length and 16000 Hz sample rate with Tacotron2. It took about 4
and half days to train. At this stage the test audios still sounds a
little bit robotic. And in some of our test audios some words are
missing. In some audios there are repetitions. When I look at the graphs
on the tensorboard pages, the graphs look normal. slash
slash
slash
My config values mostly are according to the default values. Could
anyone have a look at my configs and let me know what could be wrong? Is
there any parameters that I can change to remove robotic sound from the
test values and improve the output waves quality? slash
My configs are as blow: slash
'model': 'Tacotron2', slash
'run_name': 'stspeech-stft_params', slash
'run_description': 'tacotron2 constant stf parameters', slash
'audio':{ slash
'num_mels': 80, slash
'num_freq': 1025, slash
'sample_rate': 16000, slash
'win_length': 1024, slash
'hop_length': 256, slash
'frame_length_ms': null, slash
'frame_shift_ms': null, slash
'preemphasis': 0.98, slash
'min_level_db': -100, slash
'ref_level_db': 20, slash
'power': 1.5, slash
'griffin_lim_iters': 30, slash
'signal_norm': true, slash
'symmetric_norm': true, slash
'max_norm': 4.0, slash
'clip_norm': true, slash
'mel_fmin': 0.0, slash
'mel_fmax': 8000.0, slash
'do_trim_silence': true, slash
'trim_db': 60 slash
}, slash
'characters':{ slash
'pad': ' slash _', slash
'eos': ' slash ~', slash
'bos': ' slash ^', slash
'characters':
slash 'ABCDEFGHIJKLMNOPQRSTUVWXYZÇĞİÖŞÜabcdefghijklmnopqrstuvwxyzçğıöşü!'(),-.:;?
', slash
'punctuations':'!'(),-.:;? slash ', slash
'phonemes':'iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ' slash
},
'distributed':{
'backend': 'nccl',
'url': 'tcp: slash / slash /localhost:54321'
},
'reinit_layers': slash [ slash ], slash
'batch_size': 32, slash
'eval_batch_size':16, slash
'r': 7, slash
'gradual_training': slash [ slash [0, 7, 64 slash ], slash [1, 5, 64 slash ], slash [50000, 3, 32 slash ],
slash [130000, 2, 32 slash ], slash [290000, 1, 32 slash ] slash ], slash
'loss_masking': true, slash
'run_eval': true, slash
'test_delay_epochs': 5, slash
'test_sentences_file': 'tr_sentences.txt',
'noam_schedule': false,
'grad_clip': 1.0,
'epochs': 1000,
'lr': 0.00001,
'wd': 0.000001,
'warmup_steps': 4000,
'seq_len_norm': false,
'memory_size': -1,
'prenet_type': 'original',
'prenet_dropout': true,
'attention_type': 'original',
'attention_heads': 4,
'attention_norm': 'sigmoid',
'windowing': false,
'use_forward_attn': false,
'forward_attn_mask': false,
'transition_agent': false,
'location_attn': false,
'bidirectional_decoder': false,
'stopnet': true,
'separate_stopnet': true,
'print_step': 5,
'save_step': 5000,
'checkpoint': true,
'tb_model_param_stats': false,
'text_cleaner': 'phoneme_cleaners',
'enable_eos_bos_chars': false,
'num_loader_workers': 1,
'num_val_loader_workers': 1,
'batch_group_size': 0,
'min_seq_len': 6,
'max_seq_len': 150,
'output_path': 'train_logs/',
'phoneme_cache_path': 'mozilla_tr_phonemes_2_1',
'use_phonemes': true,
'phoneme_language': 'tr',
'use_speaker_embedding': false,
'style_wav_for_test': null,
'use_gst': false,
'datasets':
[
{
'name': 'stspeech',
'path': 'STS-22K/',
'meta_file_train': 'metadata_train.csv',
'meta_file_val': 'metadata_test.csv'
}
]
}
Four config parameters are different than the default values: slash
1: the sample rate slash
2: 'griffin_lim_iters' , I reduced it to 30 the default was 60. I did
this to reduce the training time. slash
3: I reduced number of workers to 1, the defaults were 4. I thought that
is something to the with number of GPUS. Since I have just one GPU, I
thought I need to change them as 1. slash
4: min and max seq length parameters. Actually I forgot to change them
according to the my data's lengths. How much effect does it have on the
quality?
I appreciate any insights or comments or suggestions about what could be
wrong with my training. slash
Many many thanks in advance.
[This is an archived TTS discussion thread from discourse.mozilla.org/t/mozilla-tts-output-voice-still-sounds-robotic-after-almost-400k]
Beta Was this translation helpful? Give feedback.
All reactions