Codecs_reconstruction_eval is an evaluation toolkit that wraps a broad collection of codec-reconstruction APIs under a single interface, letting you decode audio with one call and instantly compute an extensive set of objective metrics—including
- PESQ(NB/WB)
- STOI
- Speaker-embedding similarity(SIM)
- Mel-spectrogram loss
- Word-error rate (WER) on LibriSpeech-test-clean
- UMOS
- Usage and entropy
Ready-to-run scripts are provided, and you can define additional metrics in the metrics
Supported models include
- DAC
- EnCodec
- EnCodec of UniAudio
- WavTokenizer
- SpeechTokenizer
- Mimi
- SemantiCodec
- HiFiCodec
- StableCodec
- FACodec
- BigCodec
- XCodec
- XCodec2
You can define your own model in wrapper.py; it needs to inherit from the AudioTokenizer class and implement the load_model, get_code, and recon_wav methods.
pip install -r requirements.txt
google/visqol: Perceptual Quality Estimator for speech and audio github.com
# visqol
bazel-5.3.2-installer-linux-x86_64.sh
git clone https://github.com/google/visqol.git
bazel build :visqol -c optThe following situations may occur:
ImportError: ~/miniconda3/envs/py310/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by ~/miniconda3/envs/py310/lib/python3.10/site-packages/visqol/visqol_lib_py.s)
Refer to
解决 libstdc++.so.6: version ‘GLIBCXX_3.4.30‘ not found 问题_libstdc++.so.6 not found-CSDN博客
Delete libstdc++.so.6
cd ~/miniconda3/envs/py310/lib
strings libstdc++.so.6 | grep GLIBCXX_3.4.30
strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 |grep GLIBCXX_3.4.30
export PATH=$PATH:~/binfirst,you can run python wrapper.py to get the bitrate or latent dimension of the codec.
Many works evaluate speech tokenizers on LibriSpeech/test-clean and we use this dataset as an example. First, download test-clean.tar.gz from https://www.openslr.org/12 and extract or move its contents to exp_recon/test-clean/.
mkdir exp_recon
mv path/to/LibriSpeech/test-clean exp_recon/test-clean_flacand then convert the audio into WAV format.
python trans_folder_to_wav.pyWe explicitly store the resampled 16 kHz audio for evaluation.
python resample_folder.pyDuring evaluation, each audio clip is then resampled to the codec’s sampling rate for reconstruction, and afterward resampled back to 16 kHz for storage and evaluation.
Run recon_folder.sh (or recon_folder_multi.sh if you have multiple GPUs) to reconstruct the audio clips in exp_recon/test-clean using your codec.
You will get a folder structure as shown below:
exp_recon/
├── DAC_24k_9 # Reconstructed audio using DAC (24kHz) model with 9 RVQ codebooks
├── test-clean # Original audio clips (original sampling rate)
└── test-clean_16000 # Resampled original audio clips at 16 kHz
└── test-clean_flac
For pairwise metrics such as PESQ, STOI, and mel distance—where two audio folders must be compared—both folders should have the same sampling rate.
bash run_pesq_stoi.sh
bash run_mel_stft.shfor other metric,run
bash run_usage.sh
bash run_entropy.sh
bash run_wer.sh
bash run_umos.sh
bash run_spk.shThis toolkit reuses code cloned directly from the following projects to simplify setup: