[RecurrentNN × Regression × Regularized] based Mouth Opening Estimation via SSL
- Install PyTorch from official instructions: https://pytorch.org/get-started/locally/
- Install dependencies:
pip install -r requirements.txt-
Collect data using LipsSync. Directory structure:
2025-02-04_22-01-52/ audio.wav mouth_data.csv 2025-02-04_22-43-56/ audio.wav mouth_data.csv valid.txt- Prepare seen validation set (in-distribution speakers) and unseen validation set (out-of-distribution speakers)
- Add audio paths to
valid.txt - For SSL: Prepare unlabeled vocal-only audio (intact spectrum below 16kHz)
-
Run preprocessing:
# Labeled data python recipes/mouth_opening/preprocess.py <SOURCE_DIR> <TARGET_DIR> # Unlabeled data (SSL) python recipes/mouth_opening/preprocess_unlabel.py <SOURCE_DIR> <TARGET_DIR>
Run training:
python train.py --exp_name <EXP_NAME> --dataset <DATA_PATH> --gpu <GPU_ID>View all options with python train.py --help. Variants:
train_r_drop.py(R-Drop regularization)train_mse.py(MSE loss)
Command:
python train_ssl.py --exp_name <EXP_NAME> --dataset <DATA_PATH> --unlabel_dataset <UNLABEL_PATH> --gpu <GPU_ID>Prerequisites:
- Create
valid2.txtwith unseen validation paths --conv_dropoutmust be non-zero
- Use 10+ hours of seen data
- Prepare 50+ hours of unlabeled data
- Tested datasets: Labeled: mouth opening research project MultiModal: Acappella GRID URSing Unlabeled: PopBuTFy from NeuralSVB, PopCS from DiffSinger, M4Singer, Jingju a Cappella Recordings Collection, tiny-singing-voice-database, OpenSinger, GTSinger
python eval.py --model <model_path> --wav <wav_path>-
Framework cloned from GeneralCurveEstimator
-
Training code adapted from vocal-remover
-
Early model reference: FCPE
-
SSL inspiration: SOFA
-
Core references:
R-Drop: Regularized Dropout for Neural Networks [CODE]
Temporal Ensembling for Semi-Supervised Learning [CODE]
Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results [CODE]
-
Partial Dataset Reference:
Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006). The Grid Audio-Visual Speech Corpus (1.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3625687
Bochen Li, Yuxuan Wang, and Zhiyao Duan, Audiovisual singing voice separation, Transactions of the International Society for Music Information Retrieval, 4(1), pp.195–209, 2021. DOI: http://doi.org/10.5334/tismir.108.
Rong Gong, Rafael Caro, Yile Yang, & Xavier Serra. (2022). Jingju a Cappella Recordings Collection (2.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6536490
Zhang, L., Li, R., Wang, S., Deng, L., Liu, J., Ren, Y., He, J., Huang, R., Zhu, J., Chen, X., & Zhao, Z. (2022). M4Singer: A multi-style, multi-singer and musical score provided Mandarin singing corpus [Data set]. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Data collection tool: LipsSync
- Visualization tool: lips-sync-visualizer
- .ass mask tools: mask_fix_tools
- Data expansion initiative: DiffSinger Discussion
