Skip to content

Unofficial implementation of "Fine-grained Emotional Control of TTS (ICASSP 2023)" — combines a rank-based intensity model with FastSpeech2 to synthesize speech with controllable emotion intensity.

Notifications You must be signed in to change notification settings

Orca0917/fine-grained-emotional-control-of-tts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fine-grained Emotional Control of Text-to-Speech

Learning to Rank Inter- and Intra-Class Emotion Intensities

Shijun Wang, Jón Guðnason, Damian Borth

ICASSP 2023


Fine-grained emotional control for Text-to-Speech enables generation of speech with varying emotional intensities. This repository implements a ranking model that learns inter- and intra-class emotion strength and a FastSpeech2 based TTS system conditioned on those intensities. Preprocessing converts raw audio to features, aligns transcripts using Montreal Forced Aligner (MFA) and splits data for training. The EmoV-DB dataset is used, containing multiple speakers with several emotions each. Example scripts are provided for preparing data, training models and performing inference.



⚠️ This is an unofficial implementation of the paper.
For the original work, please refer to the ICASSP 2023 paper.



Environment

  • Docker image: pytorch/pytorch:2.2.0-cuda11.8-cudnn8-devel
  • GPU: NVIDIA RTX 4060 (8GB VRAM)

Setup

  1. Clone this repository and install Python requirements:
    pip install -r requirements.txt
  2. Download the EmoV-DB dataset and place it under /workspace/data/EmoV-DB (path can be changed in parameter.yaml).
  3. Download the pretrained HiFi-GAN vocoder for LibriTTS (16kHz) to /workspace/pretrained_models/tts-hifigan-libritts-16kHz.

Preprocessing

  1. Prepare MFA corpus

    python rank_model/prepare_mfa.py
  2. Install Montreal Forced Aligner

    # Create and activate environment
    conda create -n aligner -c conda-forge montreal-forced-aligner -y
    conda activate aligner
    
    # Download models and dictionary
    mfa model download acoustic english_us_arpa
    wget -O /workspace/montreal_forced_aligner/librispeech-lexicon.txt \
        https://openslr.org/resources/11/librispeech-lexicon.txt
    
    # Validate and align
    mfa validate /workspace/montreal_forced_aligner/corpus \
                /workspace/montreal_forced_aligner/librispeech-lexicon.txt english_us_arpa
    
    mfa align /workspace/montreal_forced_aligner/corpus \
            /workspace/montreal_forced_aligner/librispeech-lexicon.txt english_us_arpa \
            /workspace/montreal_forced_aligner/aligned
    
    # Return to base environment
    conda activate base
    
  3. Feature extraction

    python rank_model/preprocess.py
  4. Prepare FastSpeech2 dataset splits

    python fastspeech2/preprocess.py

Training

Train the rank model and FastSpeech2 model sequentially:

PYTHONENV=. python rank_model/train.py
PYTHONENV=. python fastspeech2/train.py

Inference

Generate speech using the trained models:

PYTHONENV=. python rank_model/inference.py
PYTHONENV=. python fastspeech2/inference.py

Results

1. T-SNE Visualization of Intensity Representation

This plot visualizes the learned intensity representations extracted by the RankModel using T-SNE. Each point corresponds to a sentence-level representation, color-coded by emotion labels (e.g., Angry, Neutral, Amused, etc.).

From the plot, we observe that emotional utterances are well-separated in the latent space, indicating that the intensity extractor effectively captures emotion-specific characteristics. Notably, neutral and sleepiness samples form distinct clusters, supporting the model’s ability to generalize emotion intensity.

2. Predicted Mel-Spectrograms of FastSpeech2 (Epoch 20, Batch Size 8)

Below is a comparison between predicted and ground-truth mel-spectrograms for randomly sampled utterances at epoch 20.

melspectrogram

  • Top 8: Predicted Mel-Spectrograms
  • Bottom 8: Ground Truth Mel-Spectrograms

We observe that the model captures the overall prosody and spectral shape well. However, subtle mismatches in pitch contour and energy levels still exist, especially in high-emotion utterances. Improvements are expected with additional fine-tuning or by incorporating emotion intensity explicitly.


Reference

Wang, S., Guðnason, J., & Borth, D. (2023, June). Fine-grained emotional control of text-to-speech: Learning to rank inter-and intra-class emotion intensities. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.


Acknowledgements

About

Unofficial implementation of "Fine-grained Emotional Control of TTS (ICASSP 2023)" — combines a rank-based intensity model with FastSpeech2 to synthesize speech with controllable emotion intensity.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages