Fine-grained Emotional Control of Text-to-Speech

Learning to Rank Inter- and Intra-Class Emotion Intensities

Shijun Wang, Jón Guðnason, Damian Borth

ICASSP 2023

Fine-grained emotional control for Text-to-Speech enables generation of speech with varying emotional intensities. This repository implements a ranking model that learns inter- and intra-class emotion strength and a FastSpeech2 based TTS system conditioned on those intensities. Preprocessing converts raw audio to features, aligns transcripts using Montreal Forced Aligner (MFA) and splits data for training. The EmoV-DB dataset is used, containing multiple speakers with several emotions each. Example scripts are provided for preparing data, training models and performing inference.

⚠️ This is an unofficial implementation of the paper.
For the original work, please refer to the ICASSP 2023 paper.

Environment

Docker image: pytorch/pytorch:2.2.0-cuda11.8-cudnn8-devel
GPU: NVIDIA RTX 4060 (8GB VRAM)

Setup

Clone this repository and install Python requirements:
```
pip install -r requirements.txt
```
Download the EmoV-DB dataset and place it under /workspace/data/EmoV-DB (path can be changed in parameter.yaml).
Download the pretrained HiFi-GAN vocoder for LibriTTS (16kHz) to /workspace/pretrained_models/tts-hifigan-libritts-16kHz.

Preprocessing

Prepare MFA corpus
```
python rank_model/prepare_mfa.py
```

Install Montreal Forced Aligner

# Create and activate environment
conda create -n aligner -c conda-forge montreal-forced-aligner -y
conda activate aligner

# Download models and dictionary
mfa model download acoustic english_us_arpa
wget -O /workspace/montreal_forced_aligner/librispeech-lexicon.txt \
    https://openslr.org/resources/11/librispeech-lexicon.txt

# Validate and align
mfa validate /workspace/montreal_forced_aligner/corpus \
            /workspace/montreal_forced_aligner/librispeech-lexicon.txt english_us_arpa

mfa align /workspace/montreal_forced_aligner/corpus \
        /workspace/montreal_forced_aligner/librispeech-lexicon.txt english_us_arpa \
        /workspace/montreal_forced_aligner/aligned

# Return to base environment
conda activate base

Feature extraction
```
python rank_model/preprocess.py
```
Prepare FastSpeech2 dataset splits
```
python fastspeech2/preprocess.py
```

Training

Train the rank model and FastSpeech2 model sequentially:

PYTHONENV=. python rank_model/train.py
PYTHONENV=. python fastspeech2/train.py

Inference

Generate speech using the trained models:

PYTHONENV=. python rank_model/inference.py
PYTHONENV=. python fastspeech2/inference.py

Results

1. T-SNE Visualization of Intensity Representation

This plot visualizes the learned intensity representations extracted by the RankModel using T-SNE. Each point corresponds to a sentence-level representation, color-coded by emotion labels (e.g., Angry, Neutral, Amused, etc.).

From the plot, we observe that emotional utterances are well-separated in the latent space, indicating that the intensity extractor effectively captures emotion-specific characteristics. Notably, neutral and sleepiness samples form distinct clusters, supporting the model’s ability to generalize emotion intensity.

2. Predicted Mel-Spectrograms of FastSpeech2 (Epoch 20, Batch Size 8)

Below is a comparison between predicted and ground-truth mel-spectrograms for randomly sampled utterances at epoch 20.

Top 8: Predicted Mel-Spectrograms
Bottom 8: Ground Truth Mel-Spectrograms

We observe that the model captures the overall prosody and spectral shape well. However, subtle mismatches in pitch contour and energy levels still exist, especially in high-emotion utterances. Improvements are expected with additional fine-tuning or by incorporating emotion intensity explicitly.

Reference

Wang, S., Guðnason, J., & Borth, D. (2023, June). Fine-grained emotional control of text-to-speech: Learning to rank inter-and intra-class emotion intensities. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
assets		assets
emo_rank_tts		emo_rank_tts
.gitignore		.gitignore
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fine-grained Emotional Control of Text-to-Speech

Learning to Rank Inter- and Intra-Class Emotion Intensities

Shijun Wang, Jón Guðnason, Damian Borth

ICASSP 2023

Environment

Setup

Preprocessing

Training

Inference

Results

1. T-SNE Visualization of Intensity Representation

2. Predicted Mel-Spectrograms of FastSpeech2 (Epoch 20, Batch Size 8)

Reference

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

Orca0917/fine-grained-emotional-control-of-tts

Folders and files

Latest commit

History

Repository files navigation

Fine-grained Emotional Control of Text-to-Speech

Learning to Rank Inter- and Intra-Class Emotion Intensities

Shijun Wang, Jón Guðnason, Damian Borth

ICASSP 2023

Environment

Setup

Preprocessing

Training

Inference

Results

1. T-SNE Visualization of Intensity Representation

2. Predicted Mel-Spectrograms of FastSpeech2 (Epoch 20, Batch Size 8)

Reference

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages