Trying to train telugu audio dataset from scratch on vits #3112

naveed81 · 2023-10-27T07:17:51Z

naveed81
Oct 27, 2023

My dataset is a multi speaker one. Below is my training script:
import os

from trainer import Trainer, TrainerArgs

from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits, VitsArgs, VitsAudioConfig
from TTS.tts.utils.speakers import SpeakerManager
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
from TTS.bin.compute_embeddings import compute_embeddings
from TTS.tts.utils.data import get_length_balancer_weights
from TTS.tts.utils.languages import LanguageManager, get_language_balancer_weights
from TTS.tts.utils.speakers import SpeakerManager, get_speaker_balancer_weights, get_speaker_manager

output_path = os.path.dirname(os.path.abspath(file))
dataset_config = BaseDatasetConfig(
formatter="vctk", meta_file_train="", phonemizer="espeak", language="te", path=os.path.join(output_path, "te_male")
)

audio_config = VitsAudioConfig(
sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None
)

vitsArgs = VitsArgs(
use_speaker_embedding=True,
)

config = VitsConfig(
model_args=vitsArgs,
audio=audio_config,
run_name="tel_vits",
batch_size=32,
eval_batch_size=16,
batch_group_size=5,
num_loader_workers=0,
num_eval_loader_workers=4,
run_eval=True,
test_delay_epochs=-1,
epochs=1000,
text_cleaner="multilingual_cleaners",
use_phonemes=True,
phoneme_language="te",
# phonemizer="espeak",
phoneme_cache_path=os.path.join(output_path, "phoneme_cache/tel"),
compute_input_seq_cache=True,
print_step=25,
print_eval=False,
mixed_precision=True,
max_text_len=325, # change this if you have a larger VRAM than 16GB
output_path=output_path,
datasets=[dataset_config],
cudnn_benchmark=False,
test_sentences=[
[
"నమస్తే ఖాలిద్ గారు, ఎలా ఉన్నారు?",
"VCTK_tem_00682",
None,
"te",
],
[
"నమస్కారము వెంకటరామణ గారు, వెంకటేశ్వర్లు గారు",
"VCTK_tem_00682",
None,
"te",
],
[
"నవీద్ అహ్మద్ గారు, ఎలా ఉన్నారు?.",
"VCTK_tem_00682",
None,
"te",
],
[
"శుభోదయం ఘట్టమనేని సూర్యప్రకాశ్ గారు",
"VCTK_tem_00682",
None,
"te",
],
],
)

INITIALIZE THE AUDIO PROCESSOR

Audio processor is used for feature extraction and audio I/O.

It mainly serves to the dataloader and the training loggers.

ap = AudioProcessor.init_from_config(config)

INITIALIZE THE TOKENIZER

Tokenizer is used to convert text to sequences of token IDs.

config is updated with the default characters if not defined in the config.

tokenizer, config = TTSTokenizer.init_from_config(config)

LOAD DATA SAMPLES

Each sample is a list of `[text, audio_file_path, speaker_name]`

You can define your custom sample loader returning the list of samples.

Or define your custom formatter and pass it to the `load_tts_samples`.

Check `TTS.tts.datasets.load_tts_samples` for more details.

train_samples, eval_samples = load_tts_samples(
dataset_config,
eval_split=True,
eval_split_max_size=config.eval_split_max_size,
eval_split_size=config.eval_split_size,
)

init speaker manager for multi-speaker training

it maps speaker-id to speaker-name in the model and data-loader

speaker_manager = SpeakerManager()
speaker_manager.set_ids_from_data(train_samples + eval_samples, parse_key="speaker_name")
config.model_args.num_speakers = speaker_manager.num_speakers

init model

model = Vits(config, ap, tokenizer, speaker_manager)

init the trainer and 🚀

trainer = Trainer(
TrainerArgs(),
config,
output_path,
model=model,
train_samples=train_samples,
eval_samples=eval_samples,
)
trainer.fit()

So far I have completed 7000 steps and audio I am getting in tensorboard is gibberish (doesnt sound like telugu). Attached samples. What am I doing wrong? Please correct me.

Below are the audio samples after 7000 steps of training. It doesnt sound like Telugu at all, its all gibberish.
https://drive.google.com/drive/folders/1k3OMxE5SpFgV1KQpDpDjMJYKmJhZaAoM?usp=drive_link

This is the config.json file as saved in the run directory:
{
"output_path": "/home/ubuntu/TTS",
"logger_uri": null,
"run_name": "tel_vits",
"project_name": null,
"run_description": "\ud83d\udc38Coqui trainer run.",
"print_step": 25,
"plot_step": 100,
"model_param_stats": false,
"wandb_entity": null,
"dashboard_logger": "tensorboard",
"save_on_interrupt": true,
"log_model_step": 10000,
"save_step": 10000,
"save_n_checkpoints": 5,
"save_checkpoints": true,
"save_all_best": false,
"save_best_after": 10000,
"target_loss": null,
"print_eval": false,
"test_delay_epochs": -1,
"run_eval": true,
"run_eval_steps": null,
"distributed_backend": "nccl",
"distributed_url": "tcp://localhost:54321",
"mixed_precision": true,
"precision": "fp16",
"epochs": 1000,
"batch_size": 32,
"eval_batch_size": 16,
"grad_clip": [
1000,
1000
],
"scheduler_after_epoch": true,
"lr": 0.001,
"optimizer": "AdamW",
"optimizer_params": {
"betas": [
0.8,
0.99
],
"eps": 1e-09,
"weight_decay": 0.01
},
"lr_scheduler": null,
"lr_scheduler_params": {},
"use_grad_scaler": false,
"allow_tf32": false,
"cudnn_enable": true,
"cudnn_deterministic": false,
"cudnn_benchmark": false,
"training_seed": 54321,
"model": "vits",
"num_loader_workers": 0,
"num_eval_loader_workers": 4,
"use_noise_augment": false,
"audio": {
"fft_size": 1024,
"sample_rate": 22050,
"win_length": 1024,
"hop_length": 256,
"num_mels": 80,
"mel_fmin": 0,
"mel_fmax": null
},
"use_phonemes": true,
"phonemizer": "espeak",
"phoneme_language": "te",
"compute_input_seq_cache": true,
"text_cleaner": "multilingual_cleaners",
"enable_eos_bos_chars": false,
"test_sentences_file": "",
"phoneme_cache_path": "/home/ubuntu/TTS/phoneme_cache/tel",
"characters": {
"characters_class": "TTS.tts.utils.text.characters.IPAPhonemes",
"vocab_dict": null,
"pad": "",
"eos": "",
"bos": "",
"blank": "",
"characters": "iy\u0268\u0289\u026fu\u026a\u028f\u028ae\u00f8\u0258\u0259\u0275\u0264o\u025b\u0153\u025c\u025e\u028c\u0254\u00e6\u0250a\u0276\u0251\u0252\u1d7b\u0298\u0253\u01c0\u0257\u01c3\u0284\u01c2\u0260\u01c1\u029bpbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029f\u02c8\u02cc\u02d0\u02d1\u028dw\u0265\u029c\u02a2\u02a1\u0255\u0291\u027a\u0267\u02b2\u025a\u02de\u026b",
"punctuations": "!'(),-.:;? ",
"phonemes": null,
"is_unique": false,
"is_sorted": true
},
"add_blank": true,
"batch_group_size": 5,
"loss_masking": null,
"min_audio_len": 1,
"max_audio_len": Infinity,
"min_text_len": 1,
"max_text_len": 325,
"compute_f0": false,
"compute_energy": false,
"compute_linear_spec": true,
"precompute_num_workers": 0,
"start_by_longest": false,
"shuffle": false,
"drop_last": false,
"datasets": [
{
"formatter": "vctk",
"dataset_name": "",
"path": "/home/ubuntu/TTS/te_male",
"meta_file_train": "",
"ignored_speakers": null,
"language": "te",
"phonemizer": "espeak",
"meta_file_val": "",
"meta_file_attn_mask": ""
}
],
"test_sentences": [
[
"\u0c28\u0c2e\u0c38\u0c4d\u0c24\u0c47 \u0c16\u0c3e\u0c32\u0c3f\u0c26\u0c4d \u0c17\u0c3e\u0c30\u0c41, \u0c0e\u0c32\u0c3e \u0c09\u0c28\u0c4d\u0c28\u0c3e\u0c30\u0c41?",
"VCTK_tem_00682",
null,
"te"
],
[
"\u0c28\u0c2e\u0c38\u0c4d\u0c15\u0c3e\u0c30\u0c2e\u0c41 \u0c35\u0c46\u0c02\u0c15\u0c1f\u0c30\u0c3e\u0c2e\u0c23 \u0c17\u0c3e\u0c30\u0c41, \u0c35\u0c46\u0c02\u0c15\u0c1f\u0c47\u0c36\u0c4d\u0c35\u0c30\u0c4d\u0c32\u0c41 \u0c17\u0c3e\u0c30\u0c41",
"VCTK_tem_00682",
null,
"te"
],
[
"\u0c28\u0c35\u0c40\u0c26\u0c4d \u0c05\u0c39\u0c4d\u0c2e\u0c26\u0c4d \u0c17\u0c3e\u0c30\u0c41, \u0c0e\u0c32\u0c3e \u0c09\u0c28\u0c4d\u0c28\u0c3e\u0c30\u0c41? 2024 \u0c32\u0c4b \u0c15\u0c3e\u0c02\u0c17\u0c4d\u0c30\u0c46\u0c38\u0c4d\u0c15\u0c3f \u0c35\u0c4b\u0c1f\u0c41 \u0c35\u0c47\u0c2f\u0c02\u0c21\u0c3f.",
"VCTK_tem_00682",
null,
"te"
],
[
"\u0c36\u0c41\u0c2d\u0c4b\u0c26\u0c2f\u0c02 \u0c18\u0c1f\u0c4d\u0c1f\u0c2e\u0c28\u0c47\u0c28\u0c3f \u0c38\u0c42\u0c30\u0c4d\u0c2f\u0c2a\u0c4d\u0c30\u0c15\u0c3e\u0c36\u0c4d \u0c17\u0c3e\u0c30\u0c41",
"VCTK_tem_00682",
null,
"te"
]
],
"eval_split_max_size": null,
"eval_split_size": 0.01,
"use_speaker_weighted_sampler": false,
"speaker_weighted_sampler_alpha": 1.0,
"use_language_weighted_sampler": false,
"language_weighted_sampler_alpha": 1.0,
"use_length_weighted_sampler": false,
"length_weighted_sampler_alpha": 1.0,
"model_args": {
"num_chars": 131,
"out_channels": 513,
"spec_segment_size": 32,
"hidden_channels": 192,
"hidden_channels_ffn_text_encoder": 768,
"num_heads_text_encoder": 2,
"num_layers_text_encoder": 6,
"kernel_size_text_encoder": 3,
"dropout_p_text_encoder": 0.1,
"dropout_p_duration_predictor": 0.5,
"kernel_size_posterior_encoder": 5,
"dilation_rate_posterior_encoder": 1,
"num_layers_posterior_encoder": 16,
"kernel_size_flow": 5,
"dilation_rate_flow": 1,
"num_layers_flow": 4,
"resblock_type_decoder": "1",
"resblock_kernel_sizes_decoder": [
3,
7,
11
],
"resblock_dilation_sizes_decoder": [
[
1,
3,
5
],
[
1,
3,
5
],
[
1,
3,
5
]
],
"upsample_rates_decoder": [
8,
8,
2,
2
],
"upsample_initial_channel_decoder": 512,
"upsample_kernel_sizes_decoder": [
16,
16,
4,
4
],
"periods_multi_period_discriminator": [
2,
3,
5,
7,
11
],
"use_sdp": true,
"noise_scale": 1.0,
"inference_noise_scale": 0.667,
"length_scale": 1,
"noise_scale_dp": 1.0,
"inference_noise_scale_dp": 1.0,
"max_inference_len": null,
"init_discriminator": true,
"use_spectral_norm_disriminator": false,
"use_speaker_embedding": true,
"num_speakers": 23,
"speakers_file": "/home/ubuntu/TTS/tel_vits-October-27-2023_03+51AM-99635193/speakers.pth",
"d_vector_file": null,
"speaker_embedding_channels": 256,
"use_d_vector_file": false,
"d_vector_dim": 0,
"detach_dp_input": true,
"use_language_embedding": false,
"embedded_language_dim": 4,
"num_languages": 0,
"language_ids_file": null,
"use_speaker_encoder_as_loss": false,
"speaker_encoder_config_path": "",
"speaker_encoder_model_path": "",
"condition_dp_on_speaker": true,
"freeze_encoder": false,
"freeze_DP": false,
"freeze_PE": false,
"freeze_flow_decoder": false,
"freeze_waveform_decoder": false,
"encoder_sample_rate": null,
"interpolate_z": true,
"reinit_DP": false,
"reinit_text_encoder": false
},
"lr_gen": 0.0002,
"lr_disc": 0.0002,
"lr_scheduler_gen": "ExponentialLR",
"lr_scheduler_gen_params": {
"gamma": 0.999875,
"last_epoch": -1
},
"lr_scheduler_disc": "ExponentialLR",
"lr_scheduler_disc_params": {
"gamma": 0.999875,
"last_epoch": -1
},
"kl_loss_alpha": 1.0,
"disc_loss_alpha": 1.0,
"gen_loss_alpha": 1.0,
"feat_loss_alpha": 1.0,
"mel_loss_alpha": 45.0,
"dur_loss_alpha": 1.0,
"speaker_encoder_loss_alpha": 1.0,
"return_wav": true,
"use_weighted_sampler": false,
"weighted_sampler_attrs": {},
"weighted_sampler_multipliers": {},
"r": 1,
"num_speakers": 0,
"use_speaker_embedding": true,
"speakers_file": "/home/ubuntu/TTS/tel_vits-October-27-2023_03+51AM-99635193/speakers.pth",
"speaker_embedding_channels": 256,
"language_ids_file": null,
"use_language_embedding": false,
"use_d_vector_file": false,
"d_vector_file": null,
"d_vector_dim": 0
}

naveed81 · 2023-10-29T08:02:00Z

naveed81
Oct 29, 2023
Author

I fixed it by not using phonemes. If anyone wants a detailed explanation, drop a msg.

3 replies

Rakshitha-Ummadisetti Sep 10, 2024

Can you please provide detailed explanation!
Thanks in advance

naveed81 Sep 13, 2024
Author

Email me on [email protected] with what you have achieved so far and where you are stuck

pschakravarthi Jan 29, 2025

I fixed it by not using phonemes. If anyone wants a detailed explanation, drop a msg.

Hi, I a new bee and trying to get my hands dirty in training a new model for telugu if something is not ready. Can you advise how to get telugu in tts?

naveed81 · 2025-04-05T10:48:49Z

naveed81
Apr 5, 2025
Author

Could you figure it out?

…

On Wed, Jan 29, 2025 at 1:37 PM pschakravarthi ***@***.***> wrote: I fixed it by not using phonemes. If anyone wants a detailed explanation, drop a msg. Hi, I a new bee and trying to get my hands dirty in training a new model for telugu if something is not ready. Can you advise how to get telugu in tts? — Reply to this email directly, view it on GitHub <#3112 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIAJSQQOGCWFWKHMUMSDZOL2NCD35AVCNFSM6AAAAABN6T6ZJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCOJZGI3DQMI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

pschakravarthi · 2025-04-05T11:06:45Z

pschakravarthi
Apr 5, 2025

No.. still trying to get something

…

On Sat, Apr 5, 2025 at 4:19 PM naveed81 ***@***.***> wrote: Could you figure it out? On Wed, Jan 29, 2025 at 1:37 PM pschakravarthi ***@***.***> wrote: > I fixed it by not using phonemes. If anyone wants a detailed explanation, > drop a msg. > > Hi, I a new bee and trying to get my hands dirty in training a new model > for telugu if something is not ready. Can you advise how to get telugu in > tts? > > — > Reply to this email directly, view it on GitHub > < #3112 (reply in thread)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AIAJSQQOGCWFWKHMUMSDZOL2NCD35AVCNFSM6AAAAABN6T6ZJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCOJZGI3DQMI> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub <#3112 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC5KPFBMSH6T6P4HNFU5VYL2X6YKNAVCNFSM6AAAAABN6T6ZJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTENZTGM4DGNY> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

naveed81 · 2025-04-05T11:11:23Z

naveed81
Apr 5, 2025
Author

What's your machine and it's configuration

…

On Sat, 5 Apr, 2025, 16:37 pschakravarthi, ***@***.***> wrote: No.. still trying to get something On Sat, Apr 5, 2025 at 4:19 PM naveed81 ***@***.***> wrote: > Could you figure it out? > > On Wed, Jan 29, 2025 at 1:37 PM pschakravarthi ***@***.***> > wrote: > > > I fixed it by not using phonemes. If anyone wants a detailed > explanation, > > drop a msg. > > > > Hi, I a new bee and trying to get my hands dirty in training a new model > > for telugu if something is not ready. Can you advise how to get telugu > in > > tts? > > > > — > > Reply to this email directly, view it on GitHub > > < > #3112 (reply in thread)>, > > > or unsubscribe > > < > https://github.com/notifications/unsubscribe-auth/AIAJSQQOGCWFWKHMUMSDZOL2NCD35AVCNFSM6AAAAABN6T6ZJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCOJZGI3DQMI> > > > . > > You are receiving this because you authored the thread.Message ID: > > ***@***.***> > > > > — > Reply to this email directly, view it on GitHub > < #3112 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AC5KPFBMSH6T6P4HNFU5VYL2X6YKNAVCNFSM6AAAAABN6T6ZJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTENZTGM4DGNY> > . > You are receiving this because you commented.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub <#3112 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIAJSQXWP5GFFLITV5DOQE32X62NXAVCNFSM6AAAAABN6T6ZJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTENZTGM4TEMY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

1 reply

pschakravarthi Apr 5, 2025

I mean, am able to get TTS run in docker with english. But I am trying to see if I can get a model for telugu

Rakshitha-Ummadisetti · 2025-04-05T18:22:30Z

Rakshitha-Ummadisetti
Apr 5, 2025

I've trained VITS later on. It worked fine for me

…

On Sat, 5 Apr, 2025, 23:41 pschakravarthi, ***@***.***> wrote: I mean, am able to get TTS run in docker with english. But I am trying to see if I can get a model for telugu — Reply to this email directly, view it on GitHub <#3112 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BBOWQIYLRYBUBBQ5I2AGGYT2YAMDZAVCNFSM6AAAAABN6T6ZJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTENZTGYYDCNY> . You are receiving this because you commented.Message ID: ***@***.***>

-- *Engineering **Sciences* *BVRIT <http://bvrit.ac.in>* | *SVECW* <http://svecw.edu.in> | *VIT* <http://vishnu.edu.in> | *BVRITH * <http://bvrithyderabad.edu.in> *Medical Sciences* *VDC <http://vdc.edu.in>* | *SVCP <http://svcp.edu.in>* | *VIPER <http://viper.ac.in>*| *BVRICE* <http://bvrice.edu.in>* <http://viper.ac.in>*

1 reply

pschakravarthi Apr 6, 2025

Can you share the model details and how to use it ?

naveed81 · 2025-04-06T10:16:41Z

naveed81
Apr 6, 2025
Author

I don't have access to it now

…

On Sun, 6 Apr, 2025, 15:45 pschakravarthi, ***@***.***> wrote: Can you share the model details and how to use it ? — Reply to this email directly, view it on GitHub <#3112 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIAJSQV7AE77GLWFGDSM5AL2YD5ETAVCNFSM6AAAAABN6T6ZJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTENZUGAYDENY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

Trying to train telugu audio dataset from scratch on vits #3112

Uh oh!

naveed81 Oct 27, 2023

INITIALIZE THE AUDIO PROCESSOR

Audio processor is used for feature extraction and audio I/O.

It mainly serves to the dataloader and the training loggers.

INITIALIZE THE TOKENIZER

Tokenizer is used to convert text to sequences of token IDs.

config is updated with the default characters if not defined in the config.

LOAD DATA SAMPLES

Each sample is a list of [text, audio_file_path, speaker_name]

You can define your custom sample loader returning the list of samples.

Or define your custom formatter and pass it to the load_tts_samples.

Check TTS.tts.datasets.load_tts_samples for more details.

init speaker manager for multi-speaker training

it maps speaker-id to speaker-name in the model and data-loader

init model

init the trainer and 🚀

Replies: 6 comments · 5 replies

Uh oh!

naveed81 Oct 29, 2023 Author

Uh oh!

Rakshitha-Ummadisetti Sep 10, 2024

Uh oh!

naveed81 Sep 13, 2024 Author

Uh oh!

pschakravarthi Jan 29, 2025

Uh oh!

naveed81 Apr 5, 2025 Author

Uh oh!

pschakravarthi Apr 5, 2025

Uh oh!

naveed81 Apr 5, 2025 Author

Uh oh!

pschakravarthi Apr 5, 2025

Uh oh!

Rakshitha-Ummadisetti Apr 5, 2025

Uh oh!

pschakravarthi Apr 6, 2025

Uh oh!

naveed81 Apr 6, 2025 Author

naveed81
Oct 27, 2023

Each sample is a list of `[text, audio_file_path, speaker_name]`

Or define your custom formatter and pass it to the `load_tts_samples`.

Check `TTS.tts.datasets.load_tts_samples` for more details.

Replies: 6 comments 5 replies

naveed81
Oct 29, 2023
Author

naveed81 Sep 13, 2024
Author

naveed81
Apr 5, 2025
Author

pschakravarthi
Apr 5, 2025

naveed81
Apr 5, 2025
Author

Rakshitha-Ummadisetti
Apr 5, 2025

naveed81
Apr 6, 2025
Author