Trying to train telugu audio dataset from scratch on vits #3112
Unanswered
naveed81
asked this question in
General Q&A
Replies: 6 comments 5 replies
-
I fixed it by not using phonemes. If anyone wants a detailed explanation, drop a msg. |
Beta Was this translation helpful? Give feedback.
3 replies
-
Could you figure it out?
…On Wed, Jan 29, 2025 at 1:37 PM pschakravarthi ***@***.***> wrote:
I fixed it by not using phonemes. If anyone wants a detailed explanation,
drop a msg.
Hi, I a new bee and trying to get my hands dirty in training a new model
for telugu if something is not ready. Can you advise how to get telugu in
tts?
—
Reply to this email directly, view it on GitHub
<#3112 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIAJSQQOGCWFWKHMUMSDZOL2NCD35AVCNFSM6AAAAABN6T6ZJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCOJZGI3DQMI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
-
No.. still trying to get something
…On Sat, Apr 5, 2025 at 4:19 PM naveed81 ***@***.***> wrote:
Could you figure it out?
On Wed, Jan 29, 2025 at 1:37 PM pschakravarthi ***@***.***>
wrote:
> I fixed it by not using phonemes. If anyone wants a detailed
explanation,
> drop a msg.
>
> Hi, I a new bee and trying to get my hands dirty in training a new model
> for telugu if something is not ready. Can you advise how to get telugu
in
> tts?
>
> —
> Reply to this email directly, view it on GitHub
> <
#3112 (reply in thread)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AIAJSQQOGCWFWKHMUMSDZOL2NCD35AVCNFSM6AAAAABN6T6ZJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCOJZGI3DQMI>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#3112 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC5KPFBMSH6T6P4HNFU5VYL2X6YKNAVCNFSM6AAAAABN6T6ZJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTENZTGM4DGNY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
-
What's your machine and it's configuration
…On Sat, 5 Apr, 2025, 16:37 pschakravarthi, ***@***.***> wrote:
No.. still trying to get something
On Sat, Apr 5, 2025 at 4:19 PM naveed81 ***@***.***> wrote:
> Could you figure it out?
>
> On Wed, Jan 29, 2025 at 1:37 PM pschakravarthi ***@***.***>
> wrote:
>
> > I fixed it by not using phonemes. If anyone wants a detailed
> explanation,
> > drop a msg.
> >
> > Hi, I a new bee and trying to get my hands dirty in training a new
model
> > for telugu if something is not ready. Can you advise how to get telugu
> in
> > tts?
> >
> > —
> > Reply to this email directly, view it on GitHub
> > <
>
#3112 (reply in thread)>,
>
> > or unsubscribe
> > <
>
https://github.com/notifications/unsubscribe-auth/AIAJSQQOGCWFWKHMUMSDZOL2NCD35AVCNFSM6AAAAABN6T6ZJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCOJZGI3DQMI>
>
> > .
> > You are receiving this because you authored the thread.Message ID:
> > ***@***.***>
> >
>
> —
> Reply to this email directly, view it on GitHub
> <
#3112 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AC5KPFBMSH6T6P4HNFU5VYL2X6YKNAVCNFSM6AAAAABN6T6ZJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTENZTGM4DGNY>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub
<#3112 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIAJSQXWP5GFFLITV5DOQE32X62NXAVCNFSM6AAAAABN6T6ZJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTENZTGM4TEMY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
1 reply
-
I've trained VITS later on.
It worked fine for me
…On Sat, 5 Apr, 2025, 23:41 pschakravarthi, ***@***.***> wrote:
I mean, am able to get TTS run in docker with english. But I am trying to
see if I can get a model for telugu
—
Reply to this email directly, view it on GitHub
<#3112 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BBOWQIYLRYBUBBQ5I2AGGYT2YAMDZAVCNFSM6AAAAABN6T6ZJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTENZTGYYDCNY>
.
You are receiving this because you commented.Message ID:
***@***.***>
--
*Engineering **Sciences*
*BVRIT <http://bvrit.ac.in>* | *SVECW*
<http://svecw.edu.in> | *VIT* <http://vishnu.edu.in> | *BVRITH *
<http://bvrithyderabad.edu.in>
*Medical Sciences*
*VDC
<http://vdc.edu.in>* | *SVCP <http://svcp.edu.in>* | *VIPER
<http://viper.ac.in>*| *BVRICE* <http://bvrice.edu.in>*
<http://viper.ac.in>*
|
Beta Was this translation helpful? Give feedback.
1 reply
-
I don't have access to it now
…On Sun, 6 Apr, 2025, 15:45 pschakravarthi, ***@***.***> wrote:
Can you share the model details and how to use it ?
—
Reply to this email directly, view it on GitHub
<#3112 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIAJSQV7AE77GLWFGDSM5AL2YD5ETAVCNFSM6AAAAABN6T6ZJOVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTENZUGAYDENY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
My dataset is a multi speaker one. Below is my training script:
import os
from trainer import Trainer, TrainerArgs
from TTS.tts.configs.shared_configs import BaseDatasetConfig
from TTS.tts.configs.vits_config import VitsConfig
from TTS.tts.datasets import load_tts_samples
from TTS.tts.models.vits import Vits, VitsArgs, VitsAudioConfig
from TTS.tts.utils.speakers import SpeakerManager
from TTS.tts.utils.text.tokenizer import TTSTokenizer
from TTS.utils.audio import AudioProcessor
from TTS.bin.compute_embeddings import compute_embeddings
from TTS.tts.utils.data import get_length_balancer_weights
from TTS.tts.utils.languages import LanguageManager, get_language_balancer_weights
from TTS.tts.utils.speakers import SpeakerManager, get_speaker_balancer_weights, get_speaker_manager
output_path = os.path.dirname(os.path.abspath(file))
dataset_config = BaseDatasetConfig(
formatter="vctk", meta_file_train="", phonemizer="espeak", language="te", path=os.path.join(output_path, "te_male")
)
audio_config = VitsAudioConfig(
sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None
)
vitsArgs = VitsArgs(
use_speaker_embedding=True,
)
config = VitsConfig(
model_args=vitsArgs,
audio=audio_config,
run_name="tel_vits",
batch_size=32,
eval_batch_size=16,
batch_group_size=5,
num_loader_workers=0,
num_eval_loader_workers=4,
run_eval=True,
test_delay_epochs=-1,
epochs=1000,
text_cleaner="multilingual_cleaners",
use_phonemes=True,
phoneme_language="te",
# phonemizer="espeak",
phoneme_cache_path=os.path.join(output_path, "phoneme_cache/tel"),
compute_input_seq_cache=True,
print_step=25,
print_eval=False,
mixed_precision=True,
max_text_len=325, # change this if you have a larger VRAM than 16GB
output_path=output_path,
datasets=[dataset_config],
cudnn_benchmark=False,
test_sentences=[
[
"నమస్తే ఖాలిద్ గారు, ఎలా ఉన్నారు?",
"VCTK_tem_00682",
None,
"te",
],
[
"నమస్కారము వెంకటరామణ గారు, వెంకటేశ్వర్లు గారు",
"VCTK_tem_00682",
None,
"te",
],
[
"నవీద్ అహ్మద్ గారు, ఎలా ఉన్నారు?.",
"VCTK_tem_00682",
None,
"te",
],
[
"శుభోదయం ఘట్టమనేని సూర్యప్రకాశ్ గారు",
"VCTK_tem_00682",
None,
"te",
],
],
)
INITIALIZE THE AUDIO PROCESSOR
Audio processor is used for feature extraction and audio I/O.
It mainly serves to the dataloader and the training loggers.
ap = AudioProcessor.init_from_config(config)
INITIALIZE THE TOKENIZER
Tokenizer is used to convert text to sequences of token IDs.
config is updated with the default characters if not defined in the config.
tokenizer, config = TTSTokenizer.init_from_config(config)
LOAD DATA SAMPLES
Each sample is a list of
[text, audio_file_path, speaker_name]
You can define your custom sample loader returning the list of samples.
Or define your custom formatter and pass it to the
load_tts_samples
.Check
TTS.tts.datasets.load_tts_samples
for more details.train_samples, eval_samples = load_tts_samples(
dataset_config,
eval_split=True,
eval_split_max_size=config.eval_split_max_size,
eval_split_size=config.eval_split_size,
)
init speaker manager for multi-speaker training
it maps speaker-id to speaker-name in the model and data-loader
speaker_manager = SpeakerManager()
speaker_manager.set_ids_from_data(train_samples + eval_samples, parse_key="speaker_name")
config.model_args.num_speakers = speaker_manager.num_speakers
init model
model = Vits(config, ap, tokenizer, speaker_manager)
init the trainer and 🚀
trainer = Trainer(
TrainerArgs(),
config,
output_path,
model=model,
train_samples=train_samples,
eval_samples=eval_samples,
)
trainer.fit()
So far I have completed 7000 steps and audio I am getting in tensorboard is gibberish (doesnt sound like telugu). Attached samples. What am I doing wrong? Please correct me.
Below are the audio samples after 7000 steps of training. It doesnt sound like Telugu at all, its all gibberish.
https://drive.google.com/drive/folders/1k3OMxE5SpFgV1KQpDpDjMJYKmJhZaAoM?usp=drive_link
This is the config.json file as saved in the run directory:
{
"output_path": "/home/ubuntu/TTS",
"logger_uri": null,
"run_name": "tel_vits",
"project_name": null,
"run_description": "\ud83d\udc38Coqui trainer run.",
"print_step": 25,
"plot_step": 100,
"model_param_stats": false,
"wandb_entity": null,
"dashboard_logger": "tensorboard",
"save_on_interrupt": true,
"log_model_step": 10000,
"save_step": 10000,
"save_n_checkpoints": 5,
"save_checkpoints": true,
"save_all_best": false,
"save_best_after": 10000,
"target_loss": null,
"print_eval": false,
"test_delay_epochs": -1,
"run_eval": true,
"run_eval_steps": null,
"distributed_backend": "nccl",
"distributed_url": "tcp://localhost:54321",
"mixed_precision": true,
"precision": "fp16",
"epochs": 1000,
"batch_size": 32,
"eval_batch_size": 16,
"grad_clip": [
1000,
1000
],
"scheduler_after_epoch": true,
"lr": 0.001,
"optimizer": "AdamW",
"optimizer_params": {
"betas": [
0.8,
0.99
],
"eps": 1e-09,
"weight_decay": 0.01
},
"lr_scheduler": null,
"lr_scheduler_params": {},
"use_grad_scaler": false,
"allow_tf32": false,
"cudnn_enable": true,
"cudnn_deterministic": false,
"cudnn_benchmark": false,
"training_seed": 54321,
"model": "vits",
"num_loader_workers": 0,
"num_eval_loader_workers": 4,
"use_noise_augment": false,
"audio": {
"fft_size": 1024,
"sample_rate": 22050,
"win_length": 1024,
"hop_length": 256,
"num_mels": 80,
"mel_fmin": 0,
"mel_fmax": null
},
"use_phonemes": true,
"phonemizer": "espeak",
"phoneme_language": "te",
"compute_input_seq_cache": true,
"text_cleaner": "multilingual_cleaners",
"enable_eos_bos_chars": false,
"test_sentences_file": "",
"phoneme_cache_path": "/home/ubuntu/TTS/phoneme_cache/tel",
"characters": {
"characters_class": "TTS.tts.utils.text.characters.IPAPhonemes",
"vocab_dict": null,
"pad": "",
"eos": "",
"bos": "",
"blank": "",
"characters": "iy\u0268\u0289\u026fu\u026a\u028f\u028ae\u00f8\u0258\u0259\u0275\u0264o\u025b\u0153\u025c\u025e\u028c\u0254\u00e6\u0250a\u0276\u0251\u0252\u1d7b\u0298\u0253\u01c0\u0257\u01c3\u0284\u01c2\u0260\u01c1\u029bpbtd\u0288\u0256c\u025fk\u0261q\u0262\u0294\u0274\u014b\u0272\u0273n\u0271m\u0299r\u0280\u2c71\u027e\u027d\u0278\u03b2fv\u03b8\u00f0sz\u0283\u0292\u0282\u0290\u00e7\u029dx\u0263\u03c7\u0281\u0127\u0295h\u0266\u026c\u026e\u028b\u0279\u027bj\u0270l\u026d\u028e\u029f\u02c8\u02cc\u02d0\u02d1\u028dw\u0265\u029c\u02a2\u02a1\u0255\u0291\u027a\u0267\u02b2\u025a\u02de\u026b",
"punctuations": "!'(),-.:;? ",
"phonemes": null,
"is_unique": false,
"is_sorted": true
},
"add_blank": true,
"batch_group_size": 5,
"loss_masking": null,
"min_audio_len": 1,
"max_audio_len": Infinity,
"min_text_len": 1,
"max_text_len": 325,
"compute_f0": false,
"compute_energy": false,
"compute_linear_spec": true,
"precompute_num_workers": 0,
"start_by_longest": false,
"shuffle": false,
"drop_last": false,
"datasets": [
{
"formatter": "vctk",
"dataset_name": "",
"path": "/home/ubuntu/TTS/te_male",
"meta_file_train": "",
"ignored_speakers": null,
"language": "te",
"phonemizer": "espeak",
"meta_file_val": "",
"meta_file_attn_mask": ""
}
],
"test_sentences": [
[
"\u0c28\u0c2e\u0c38\u0c4d\u0c24\u0c47 \u0c16\u0c3e\u0c32\u0c3f\u0c26\u0c4d \u0c17\u0c3e\u0c30\u0c41, \u0c0e\u0c32\u0c3e \u0c09\u0c28\u0c4d\u0c28\u0c3e\u0c30\u0c41?",
"VCTK_tem_00682",
null,
"te"
],
[
"\u0c28\u0c2e\u0c38\u0c4d\u0c15\u0c3e\u0c30\u0c2e\u0c41 \u0c35\u0c46\u0c02\u0c15\u0c1f\u0c30\u0c3e\u0c2e\u0c23 \u0c17\u0c3e\u0c30\u0c41, \u0c35\u0c46\u0c02\u0c15\u0c1f\u0c47\u0c36\u0c4d\u0c35\u0c30\u0c4d\u0c32\u0c41 \u0c17\u0c3e\u0c30\u0c41",
"VCTK_tem_00682",
null,
"te"
],
[
"\u0c28\u0c35\u0c40\u0c26\u0c4d \u0c05\u0c39\u0c4d\u0c2e\u0c26\u0c4d \u0c17\u0c3e\u0c30\u0c41, \u0c0e\u0c32\u0c3e \u0c09\u0c28\u0c4d\u0c28\u0c3e\u0c30\u0c41? 2024 \u0c32\u0c4b \u0c15\u0c3e\u0c02\u0c17\u0c4d\u0c30\u0c46\u0c38\u0c4d\u0c15\u0c3f \u0c35\u0c4b\u0c1f\u0c41 \u0c35\u0c47\u0c2f\u0c02\u0c21\u0c3f.",
"VCTK_tem_00682",
null,
"te"
],
[
"\u0c36\u0c41\u0c2d\u0c4b\u0c26\u0c2f\u0c02 \u0c18\u0c1f\u0c4d\u0c1f\u0c2e\u0c28\u0c47\u0c28\u0c3f \u0c38\u0c42\u0c30\u0c4d\u0c2f\u0c2a\u0c4d\u0c30\u0c15\u0c3e\u0c36\u0c4d \u0c17\u0c3e\u0c30\u0c41",
"VCTK_tem_00682",
null,
"te"
]
],
"eval_split_max_size": null,
"eval_split_size": 0.01,
"use_speaker_weighted_sampler": false,
"speaker_weighted_sampler_alpha": 1.0,
"use_language_weighted_sampler": false,
"language_weighted_sampler_alpha": 1.0,
"use_length_weighted_sampler": false,
"length_weighted_sampler_alpha": 1.0,
"model_args": {
"num_chars": 131,
"out_channels": 513,
"spec_segment_size": 32,
"hidden_channels": 192,
"hidden_channels_ffn_text_encoder": 768,
"num_heads_text_encoder": 2,
"num_layers_text_encoder": 6,
"kernel_size_text_encoder": 3,
"dropout_p_text_encoder": 0.1,
"dropout_p_duration_predictor": 0.5,
"kernel_size_posterior_encoder": 5,
"dilation_rate_posterior_encoder": 1,
"num_layers_posterior_encoder": 16,
"kernel_size_flow": 5,
"dilation_rate_flow": 1,
"num_layers_flow": 4,
"resblock_type_decoder": "1",
"resblock_kernel_sizes_decoder": [
3,
7,
11
],
"resblock_dilation_sizes_decoder": [
[
1,
3,
5
],
[
1,
3,
5
],
[
1,
3,
5
]
],
"upsample_rates_decoder": [
8,
8,
2,
2
],
"upsample_initial_channel_decoder": 512,
"upsample_kernel_sizes_decoder": [
16,
16,
4,
4
],
"periods_multi_period_discriminator": [
2,
3,
5,
7,
11
],
"use_sdp": true,
"noise_scale": 1.0,
"inference_noise_scale": 0.667,
"length_scale": 1,
"noise_scale_dp": 1.0,
"inference_noise_scale_dp": 1.0,
"max_inference_len": null,
"init_discriminator": true,
"use_spectral_norm_disriminator": false,
"use_speaker_embedding": true,
"num_speakers": 23,
"speakers_file": "/home/ubuntu/TTS/tel_vits-October-27-2023_03+51AM-99635193/speakers.pth",
"d_vector_file": null,
"speaker_embedding_channels": 256,
"use_d_vector_file": false,
"d_vector_dim": 0,
"detach_dp_input": true,
"use_language_embedding": false,
"embedded_language_dim": 4,
"num_languages": 0,
"language_ids_file": null,
"use_speaker_encoder_as_loss": false,
"speaker_encoder_config_path": "",
"speaker_encoder_model_path": "",
"condition_dp_on_speaker": true,
"freeze_encoder": false,
"freeze_DP": false,
"freeze_PE": false,
"freeze_flow_decoder": false,
"freeze_waveform_decoder": false,
"encoder_sample_rate": null,
"interpolate_z": true,
"reinit_DP": false,
"reinit_text_encoder": false
},
"lr_gen": 0.0002,
"lr_disc": 0.0002,
"lr_scheduler_gen": "ExponentialLR",
"lr_scheduler_gen_params": {
"gamma": 0.999875,
"last_epoch": -1
},
"lr_scheduler_disc": "ExponentialLR",
"lr_scheduler_disc_params": {
"gamma": 0.999875,
"last_epoch": -1
},
"kl_loss_alpha": 1.0,
"disc_loss_alpha": 1.0,
"gen_loss_alpha": 1.0,
"feat_loss_alpha": 1.0,
"mel_loss_alpha": 45.0,
"dur_loss_alpha": 1.0,
"speaker_encoder_loss_alpha": 1.0,
"return_wav": true,
"use_weighted_sampler": false,
"weighted_sampler_attrs": {},
"weighted_sampler_multipliers": {},
"r": 1,
"num_speakers": 0,
"use_speaker_embedding": true,
"speakers_file": "/home/ubuntu/TTS/tel_vits-October-27-2023_03+51AM-99635193/speakers.pth",
"speaker_embedding_channels": 256,
"language_ids_file": null,
"use_language_embedding": false,
"use_d_vector_file": false,
"d_vector_file": null,
"d_vector_dim": 0
}
Beta Was this translation helpful? Give feedback.
All reactions