Fine-grained emotional control for Text-to-Speech enables generation of speech with varying emotional intensities. This repository implements a ranking model that learns inter- and intra-class emotion strength and a FastSpeech2 based TTS system conditioned on those intensities. Preprocessing converts raw audio to features, aligns transcripts using Montreal Forced Aligner (MFA) and splits data for training. The EmoV-DB dataset is used, containing multiple speakers with several emotions each. Example scripts are provided for preparing data, training models and performing inference.
For the original work, please refer to the ICASSP 2023 paper.
- Docker image:
pytorch/pytorch:2.2.0-cuda11.8-cudnn8-devel
- GPU: NVIDIA RTX 4060 (8GB VRAM)
- Clone this repository and install Python requirements:
pip install -r requirements.txt
- Download the EmoV-DB dataset and place it under
/workspace/data/EmoV-DB
(path can be changed inparameter.yaml
). - Download the pretrained HiFi-GAN vocoder for LibriTTS (16kHz) to
/workspace/pretrained_models/tts-hifigan-libritts-16kHz
.
-
Prepare MFA corpus
python rank_model/prepare_mfa.py
-
Install Montreal Forced Aligner
# Create and activate environment conda create -n aligner -c conda-forge montreal-forced-aligner -y conda activate aligner # Download models and dictionary mfa model download acoustic english_us_arpa wget -O /workspace/montreal_forced_aligner/librispeech-lexicon.txt \ https://openslr.org/resources/11/librispeech-lexicon.txt # Validate and align mfa validate /workspace/montreal_forced_aligner/corpus \ /workspace/montreal_forced_aligner/librispeech-lexicon.txt english_us_arpa mfa align /workspace/montreal_forced_aligner/corpus \ /workspace/montreal_forced_aligner/librispeech-lexicon.txt english_us_arpa \ /workspace/montreal_forced_aligner/aligned # Return to base environment conda activate base
-
Feature extraction
python rank_model/preprocess.py
-
Prepare FastSpeech2 dataset splits
python fastspeech2/preprocess.py
Train the rank model and FastSpeech2 model sequentially:
PYTHONENV=. python rank_model/train.py
PYTHONENV=. python fastspeech2/train.py
Generate speech using the trained models:
PYTHONENV=. python rank_model/inference.py
PYTHONENV=. python fastspeech2/inference.py
This plot visualizes the learned intensity representations extracted by the RankModel using T-SNE. Each point corresponds to a sentence-level representation, color-coded by emotion labels (e.g., Angry, Neutral, Amused, etc.).
From the plot, we observe that emotional utterances are well-separated in the latent space, indicating that the intensity extractor effectively captures emotion-specific characteristics. Notably, neutral and sleepiness samples form distinct clusters, supporting the model’s ability to generalize emotion intensity.
Below is a comparison between predicted and ground-truth mel-spectrograms for randomly sampled utterances at epoch 20.
- Top 8: Predicted Mel-Spectrograms
- Bottom 8: Ground Truth Mel-Spectrograms
We observe that the model captures the overall prosody and spectral shape well. However, subtle mismatches in pitch contour and energy levels still exist, especially in high-emotion utterances. Improvements are expected with additional fine-tuning or by incorporating emotion intensity explicitly.
Wang, S., Guðnason, J., & Borth, D. (2023, June). Fine-grained emotional control of text-to-speech: Learning to rank inter-and intra-class emotion intensities. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.