Compatibility with TTS systems #47

wetdog · 2024-02-19T00:10:39Z

Various Text-to-Speech(TTS) implementations( Grad-TTS, Matcha-TTS, P-flow ) rely on the mel spectrogram feature extractor code found in hifi-gan

This PR introduces modifications to the feature extractor in order to enable the Vocos to work seamlessly with the outputs generated by the those TTS systems.

To achieve this, the parameters within the torchaudio.transforms.MelSpectrogram were adjusted to match the features generated in the hifi-gan codebase. Specifically the changes were made in the frequency limits and the mel scale.

We trained Vocos 400k steps using this changes and we're able to obtain a reasonable good quality audio from the output of Matcha-TTS.

Closes #39

polarathene · 2025-10-27T00:16:37Z

vocos/loss.py

+                sample_rate: int = 22050,
+                n_fft: int = 1024,
+                hop_length: int = 256,
+                n_mels: int = 80,


Is n_mels in loss.py here meant to have the default changed to 80? In feature_extractors.py it remains at 100, presumably the default in loss.py was also meant to stay at 100 and only be adjusted by the vocos-matcha.yaml?

You're right, we should keep n_mels to 100 in loss.py. Also, in feature_extractors.py the defaults should be

f_max=None norm=None, mel_scale="htk"

Would you happen to have any reference on the decision between 80 and 100 n_mels?

I understand 80 has been quite common so many models are trained with that as a result, but for the actual decision originally I am curious?

Is 80 intended to be sufficient for speech specifically?

I came across a paper recently that cited 96 as a minimum for covering not only speech, but also music and general sound effects.

With 80 and 96, these are multiples of 8 which I'm familiar with being preferential compute (at least traditionally, just like games used for textures - although that'd tend to be more like powers of 2, thus 64 vs 128). Perhaps Vocos just rounded that up to 100 🤔 I'm not sure if that'd actually regress somewhere vs 96 😅

wetdog added 3 commits February 17, 2024 14:23

Update torchaudio mel spectrogram paramters

10316e2

update reconstruction loss with new mel features

342276d

Create new config with matcha parameters

734bc2f

polarathene reviewed Oct 27, 2025

View reviewed changes

polarathene mentioned this pull request Oct 27, 2025

Quality wetdog/wavenext_pytorch#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compatibility with TTS systems #47

Compatibility with TTS systems #47

Uh oh!

wetdog commented Feb 19, 2024

Uh oh!

polarathene Oct 27, 2025

Uh oh!

wetdog Oct 28, 2025

Uh oh!

polarathene Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Compatibility with TTS systems #47

Are you sure you want to change the base?

Compatibility with TTS systems #47

Uh oh!

Conversation

wetdog commented Feb 19, 2024

Uh oh!

polarathene Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

wetdog Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

polarathene Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants