S5-HuBERT: Self-Supervised Speaker-Separated Syllable HuBERT

This is the official repository of the IEEE SLT 2024 paper Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT.

Setup

sudo apt install git-lfs  # for UTMOS

conda create -y -n py310 -c pytorch -c nvidia -c conda-forge python=3.10.18 pip=24.0 faiss-gpu=1.11.0
conda activate py310
pip install -r requirements/requirements.txt

sh scripts/setup.sh

Usage: encoding waveforms into pseudo-syllabic units

import torchaudio

from src.flow_matching import FlowMatchingWithBigVGan
from src.s5hubert import S5HubertForSyllableDiscovery

wav_path = "/path/to/wav"

# download pretrained models from hugging face hub
encoder = S5HubertForSyllableDiscovery.from_pretrained("ryota-komatsu/s5-hubert", device_map="cuda")
decoder = FlowMatchingWithBigVGan.from_pretrained("ryota-komatsu/s5-hubert-decoder", device_map="cuda")

# load a waveform
waveform, sr = torchaudio.load(wav_path)
waveform = torchaudio.functional.resample(waveform, sr, 16000)

# encode a waveform into syllabic units
outputs = encoder(waveform.cuda())

# syllabic units
units = outputs[0]["units"]  # [3950, 67, ..., 503]
units = units.unsqueeze(0) + 1  # 0: pad

# unit-to-speech synthesis
audio_values = decoder(units)

Demo

Google Colab demo is found here.

Models

You can download a pretrained model from Hugging Face.

Data Preparation

You can download datasets under dataset_root.

dataset_root=data  # be consistent with dataset.root in a config file

sh scripts/download_librispeech.sh ${dataset_root}
sh scripts/download_libritts.sh ${dataset_root}
sh scripts/download_librilight.sh ${dataset_root}  # 7TB
sh scripts/download_slm21.sh  # download sWUGGY and sBLIMP

Tip

If you already have LibriSpeech, you can use it by editing a config file;

dataset:
  root: "/path/to/LibriSpeech/root" # ${dataset.root}/LibriSpeech/train-clean-100, train-clean-360, ...

Check the directory structure

dataset.root in a config file
└── LibriSpeech/
    ├── train-clean-100/
    ├── train-clean-360/
    ├── train-other-500/
    ├── dev-clean/
    ├── dev-other/
    ├── test-clean/
    ├── test-other/
    └── SPEAKERS.TXT

Syllable discovery

python main_speech2unit.py --config configs/speech2unit/default.yaml

To run only a sub-task (train, syllable_segmentation, quantize, or evaluate), specify it as an argument.

python main_speech2unit.py train --config configs/speech2unit/default.yaml

Unit-to-speech synthesis

python main_unit2speech.py train_flow_matching --config=configs/unit2speech/default.yaml

Speech language modeling

GROUP_NAME=

qsub -g ${GROUP_NAME} scripts/run_speechlm.bash configs/speechlm/default.yaml
python main_speechlm.py evaluate --config=configs/speechlm/default.yaml

Citation

@inproceedings{Komatsu_Self-Supervised_Syllable_Discovery_2024,
  author    = {Komatsu, Ryota and Shinozaki, Takahiro},
  title     = {Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT},
  year      = {2024},
  month     = {Dec.},
  booktitle = {IEEE Spoken Language Technology Workshop},
  pages     = {1131--1136},
  doi       = {10.1109/SLT61566.2024.10832325},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

S5-HuBERT: Self-Supervised Speaker-Separated Syllable HuBERT

Setup

Usage: encoding waveforms into pseudo-syllabic units

Demo

Models

Data Preparation

Syllable discovery

Unit-to-speech synthesis

Speech language modeling

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
configs		configs
data		data
docs		docs
figures		figures
models		models
requirements		requirements
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.ipynb		demo.ipynb
main_speech2unit.py		main_speech2unit.py
main_speechlm.py		main_speechlm.py
main_unit2speech.py		main_unit2speech.py

License

ryota-komatsu/speaker_disentangled_hubert

Folders and files

Latest commit

History

Repository files navigation

S5-HuBERT: Self-Supervised Speaker-Separated Syllable HuBERT

Setup

Usage: encoding waveforms into pseudo-syllabic units

Demo

Models

Data Preparation

Syllable discovery

Unit-to-speech synthesis

Speech language modeling

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages