This is the official repository of the IEEE SLT 2024 paper Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT.
sudo apt install git-lfs # for UTMOS
conda create -y -n py310 -c pytorch -c nvidia -c conda-forge python=3.10.18 pip=24.0 faiss-gpu=1.11.0
conda activate py310
pip install -r requirements/requirements.txt
sh scripts/setup.sh
import torchaudio
from src.flow_matching import FlowMatchingWithBigVGan
from src.s5hubert import S5HubertForSyllableDiscovery
wav_path = "/path/to/wav"
# download pretrained models from hugging face hub
encoder = S5HubertForSyllableDiscovery.from_pretrained("ryota-komatsu/s5-hubert", device_map="cuda")
decoder = FlowMatchingWithBigVGan.from_pretrained("ryota-komatsu/s5-hubert-decoder", device_map="cuda")
# load a waveform
waveform, sr = torchaudio.load(wav_path)
waveform = torchaudio.functional.resample(waveform, sr, 16000)
# encode a waveform into syllabic units
outputs = encoder(waveform.cuda())
# syllabic units
units = outputs[0]["units"] # [3950, 67, ..., 503]
units = units.unsqueeze(0) + 1 # 0: pad
# unit-to-speech synthesis
audio_values = decoder(units)
Google Colab demo is found here.
You can download a pretrained model from Hugging Face.
You can download datasets under dataset_root
.
dataset_root=data # be consistent with dataset.root in a config file
sh scripts/download_librispeech.sh ${dataset_root}
sh scripts/download_libritts.sh ${dataset_root}
sh scripts/download_librilight.sh ${dataset_root} # 7TB
sh scripts/download_slm21.sh # download sWUGGY and sBLIMP
Tip
If you already have LibriSpeech, you can use it by editing a config file;
dataset:
root: "/path/to/LibriSpeech/root" # ${dataset.root}/LibriSpeech/train-clean-100, train-clean-360, ...
Check the directory structure
dataset.root in a config file
└── LibriSpeech/
├── train-clean-100/
├── train-clean-360/
├── train-other-500/
├── dev-clean/
├── dev-other/
├── test-clean/
├── test-other/
└── SPEAKERS.TXT
python main_speech2unit.py --config configs/speech2unit/default.yaml
To run only a sub-task (train, syllable_segmentation, quantize, or evaluate), specify it as an argument.
python main_speech2unit.py train --config configs/speech2unit/default.yaml
python main_unit2speech.py train_flow_matching --config=configs/unit2speech/default.yaml
GROUP_NAME=
qsub -g ${GROUP_NAME} scripts/run_speechlm.bash configs/speechlm/default.yaml
python main_speechlm.py evaluate --config=configs/speechlm/default.yaml
@inproceedings{Komatsu_Self-Supervised_Syllable_Discovery_2024,
author = {Komatsu, Ryota and Shinozaki, Takahiro},
title = {Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT},
year = {2024},
month = {Dec.},
booktitle = {IEEE Spoken Language Technology Workshop},
pages = {1131--1136},
doi = {10.1109/SLT61566.2024.10832325},
}