This repository contains code for training and inference of a multi-speaker and single-speaker speech synthesis model using Prosody2Vec. The repo is conceptually based on https://arxiv.org/pdf/2212.06972 Paper and enhances it with multi-speaker prosody conversion.
-
Clone the repository:
git clone https://github.com/yourusername/Prosody2Vec.git cd Prosody2Vec
-
Install the required dependencies:
pip install -r requirements.txt
- Place your dataset in the
Emotion Speech Dataset
directory. - Ensure the dataset is organized in subdirectories for each emotion and speaker.
The models use a combination of pre-trained models from https://github.com/bshall/acoustic-model/releases/tag/v0.1 and custom layers for speech synthesis. The main components include:
- Encoder: Extracts features from the input speech.
- Decoder: Generates the output speech from the encoded features.
- Fusion Layers: Combine features from different sources (e.g., emotion vectors, speaker vectors).
This project uses pre-trained models from the following repositories:
We thank the authors of these repositories for their contributions to the community.
s