Skip to content

Conversation

@wetdog
Copy link

@wetdog wetdog commented Feb 19, 2024

Various Text-to-Speech(TTS) implementations( Grad-TTS, Matcha-TTS, P-flow ) rely on the mel spectrogram feature extractor code found in hifi-gan

This PR introduces modifications to the feature extractor in order to enable the Vocos to work seamlessly with the outputs generated by the those TTS systems.

To achieve this, the parameters within the torchaudio.transforms.MelSpectrogram were adjusted to match the features generated in the hifi-gan codebase. Specifically the changes were made in the frequency limits and the mel scale.

image

We trained Vocos 400k steps using this changes and we're able to obtain a reasonable good quality audio from the output of Matcha-TTS.

Closes #39

sample_rate: int = 22050,
n_fft: int = 1024,
hop_length: int = 256,
n_mels: int = 80,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is n_mels in loss.py here meant to have the default changed to 80? In feature_extractors.py it remains at 100, presumably the default in loss.py was also meant to stay at 100 and only be adjusted by the vocos-matcha.yaml?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, we should keep n_mels to 100 in loss.py. Also, in feature_extractors.py the defaults should be

f_max=None
norm=None,
mel_scale="htk"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you happen to have any reference on the decision between 80 and 100 n_mels?

I understand 80 has been quite common so many models are trained with that as a result, but for the actual decision originally I am curious?

  • Is 80 intended to be sufficient for speech specifically?
  • I came across a paper recently that cited 96 as a minimum for covering not only speech, but also music and general sound effects.

With 80 and 96, these are multiples of 8 which I'm familiar with being preferential compute (at least traditionally, just like games used for textures - although that'd tend to be more like powers of 2, thus 64 vs 128). Perhaps Vocos just rounded that up to 100 🤔 I'm not sure if that'd actually regress somewhere vs 96 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compatibility with Matcha TTS

2 participants