You can install the Python dependencies with
pip3 install -r requirements.txt
The link of raw data can be found from here, password: vvhn
There are two ways to obtain the features of V2C dataset: 1) Directly download the features from here; 2) Process features by ourselves.
Please download all the features (.zip) and json files from here and unzip them in the folder "./preprocessed_data/MovieAnimation"
First, run
python3 prepare_align.py config/MovieAnimation/preprocess.yaml
for some preparations.
As described in the paper, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alignments of the supported datasets are provided here. You have to unzip the files in "preprocessed_data/MovieAnimation/TextGrid/".
After that, run the preprocessing script by
python3 preprocess.py config/MovieAnimation/preprocess.yaml
Alternately, you can align the corpus by yourself. Download the official MFA package and run
./montreal-forced-aligner/bin/mfa_align raw_data/MovieAnimation/ lexicon/librispeech-lexicon.txt english preprocessed_data/MovieAnimation
or
./montreal-forced-aligner/bin/mfa_train_and_align raw_data/MovieAnimation/ lexicon/librispeech-lexicon.txt preprocessed_data/MovieAnimation
to align the corpus and then run the preprocessing script.
python3 preprocess.py config/MovieAnimation/preprocess.yaml
python ./speaker_encoder/speaker_encoder.py
python ./emotion_encoder/video_features/emotion_encoder.py
Download the checkpoints 900000.pth.tar from here and put them in "./output/ckpt/MovieAnimation/"
Train your model with
python3 train.py --restore_step 900000 -p config/MovieAnimation/preprocess.yaml -m config/MovieAnimation/model.yaml -t config/MovieAnimation/train.yaml -p2 config/MovieAnimation/preprocess.yaml
Quickly evaluation: set "quick_eval = True" in evaluate.py for only evaluating 32 samples
Full evaluation: set "quick_eval = False" in evaluate.py for evaluating all samples
#Tensorboard
Use
tensorboard --logdir output/log/MovieAnimation
to serve TensorBoard on your localhost. The loss curves, mcd curves, synthesized mel-spectrograms, and audios are shown.