Skip to content

shwj114514/Codec-Evaluation

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

535 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎧 Welcome to AudioCodecBench 🎵

AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation

  AudioCodecBench allows for a comprehensive assessment of codecs' capabilities which evaluate across four dimensions: audio reconstruction metric, codebook index (ID) stability, decoder-only transformer perplexity, and performance on downstream probe tasks. Our results show the correctness of the provided suitable definitions and the correlation among reconstruction metrics, codebook ID stability, downstream probe tasks and perplexity.
arXiv Paper: AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation arXiv Paper: AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation

mountain Purpose

  1. how to evaluate the quality of codebook (for lm modeling)
  2. collect all existing metrics for reconstruction
  3. collect all existing metrics for Linear Probing (Music and Speech)

compass Env Build

The following explains how to quickly create the required environment and install codec_evaluation for use.

Setup environment and dependencies

We strongly recommended to use conda for managing your Python environment.

  • Create a virtual environment using conda.

     # create a virtual environment using conda
     conda create -n codec_eval python==3.10 -y	# Python ==3.10 is recommended.
     conda activate codec_eval
    
  • Install codec_evaluation from source

     git clone https://github.com/wuzhiyue111/Codec-Evaluation.git
     cd Codec-Evaluation
     bash env_build.sh
    

ruler Usage

The following will introduce how to conduct evaluations using codecs and downstream tasks. For details, please refer to the instruction document. [EN][ZH]

map Dataset Download

Dataset download address: AudioCodecBench-Dataset

toolbox Probe

penProbe task results

bookmarkReconstruction Metric

Speech

Codec Metrics
PESQ Speaker_Sim WER_GT WER_REC CER_GT CER_REC STOI VISQOL Mel distance
DAC 3.69 0.965 0.155 0.202 0.09 0.125 0.94 4.51 0.21
Encodec 3.21 0.919 0.155 0.198 0.09 0.114 0.93 4.37 0.31
Mimi 2.77 0.928 0.155 0.287 0.09 0.173 0.88 3.84 0.38
SemantiCodec 2.64 0.907 0.155 0.318 0.09 0.195 0.86 4.04 0.32
WavTokenizer 2.17 0.743 0.155 0.494 0.09 0.325 0.83 3.43 0.68
SpeechTokenizer 2.97 0.924 0.155 0.216 0.09 0.120 0.89 4.22 0.25
XCodec 3.23 0.942 0.155 0.185 0.09 0.106 0.91 4.34 0.24
YuE 3.17 0.938 0.155 0.195 0.09 0.113 0.90 4.33 0.25

Music

Codec Metrics
PESQ STOI VISQOL Mel distance
DAC 2.66 0.86 4.40 0.73
Encodec 2.27 0.85 4.25 0.78
SemantiCodec 1.32 0.60 4.19 0.98
WavTokenizer 1.14 0.49 3.84 1.15
XCodec 1.85 0.76 4.35 0.91
YuE 1.84 0.75 4.35 0.90

bookmarkProbe Experiment

Music Probe

Codec Mode Dataset
emomusic GTZAN MTT NSynthI NSynthP VocalSetSinger VocalSetTech GS MTGGenre MTGInstrument MTGMoodtheme MTGTop50
A V Acc AP AUCROC Acc Acc Acc Acc Acc AP AUCROC AP AUCROC AP AUCROC AP AUCROC
DAC quantized_emb 0.470 0.064 0.575 0.203 0.785 0.602 0.468 0.419 0.376 0.088 0.0295 0.530 0.108 0.638 0.076 0.651 0.141 0.687
Encodec quantized_emb 0.467 0.066 0.570 0.184 0.759 0.537 0.547 0.299 0.301 0.102 0.035 0.528 0.104 0.620 0.057 0.642 0.137 0.701
SemantiCodec quantized_emb 0.507 0.316 0.703 0.318 0.877 0.658 0.764 0.344 0.451 0.343 0.035 0.526 0.149 0.720 0.099 0.723 0.230 0.795
WavTokenizer quantized_emb 0.455 0.066 0.423 0.168 0.739 0.537 0.444 0.130 0.287 0.093 0.034 0.530 0.107 0.635 0.056 0.627 0.137 0.698
Xcodec quantized_emb 0.553 0.143 0.664 0.323 0.873 0.640 0.905 0.537 0.570 0.455 0.034 0.519 0.164 0.707 0.101 0.710 0.216 0.777
YuE quantized_emb 0.573 0.156 0.669 0.315 0.870 0.622 0.896 0.523 0.594 0.454 0.034 0.517 0.133 0.700 0.102 0.711 0.191 0.758

Music Probe codebook0

Codec Dataset
emomusic GTZAN MTT NSynthI VocalSetSinger VocalSetTech GS MTGInstrument MTGTop50
A V Acc AP AUCROC Acc Acc Acc Acc AP AUCROC AP AUCROC
DAC 0.354 0.000 0.600 0.175 0.741 0.563 0.226 0.315 0.088 0.117 0.638 0.135 0.690
Encodec 0.465 0.092 0.543 0.119 0.681 0.563 0.086 0.268 0.088 0.110 0.630 0.136 0.701
SemantiCodec 0.456 0.267 0.629 0.227 0.825 0.625 0.134 0.477 0.229 0.150 0.724 0.224 0.793
Xcodec 0.375 0.461 0.628 0.261 0.838 0.611 0.320 0.488 0.389 0.140 0.669 0.191 0.755
YuE 0.439 0.085 0.616 0.249 0.831 0.623 0.335 0.475 0.346 0.133 0.670 0.191 0.758

Speech and Sound Probe

Codec Mode Dataset
Common_Voice Vocalsound MELD ESC50
WER CER Acc Acc Acc
DAC quantized_emb 0.526 0.229 0.535 0.483 0.325
Encodec quantized_emb 0.503 0.209 0.574 0.481 0.275
SemantiCodec quantized_emb 0.490 0.200 0.723 0.482 0.620
WavTokenizer quantized_emb 0.582 0.288 0.524 0.484 0.135
Mimi quantized_emb 0.442 0.168 0.833 0.481 0.335
SpeechTokenizer quantized_emb 0.469 0.190 0.776 0.498 0.670
Xcodec quantized_emb 0.474 0.188 0.731 0.491 0.640
YuE quantized_emb 0.472 0.187 0.782 0.515 0.640
hubert unquantized_emb - - 0.877 0.495 0.525
qwen2audioencoder unquantized_emb - - 0.953 0.590 0.975

Speech and Sound Probe codebook0

Codec Dataset
Vocalsound MELD ESC50
Acc Acc Acc
DAC 0.511 0.481 0.285
Encodec 0.479 0.481 0.230
SemantiCodec 0.646 0.482 0.465
Mimi 0.794 0.481 0.265
SpeechTokenizer 0.698 0.489 0.420
Xcodec 0.656 0.487 0.525
YuE 0.684 0.481 0.515

PPL Experiment

LibriTTS

Codec ppl↓ cb1_ppl cb2_ppl cb3_ppl cb4_ppl cb5_ppl cb6_ppl cb7_ppl cb8_ppl
DAC 420.6 48.9 284.1 428.6 560.2 609.7 728.1 814.0 835.5
Encodec 111.4 28.0 59.1 93.7 130.5 153.7 183.3 202.0 213.5
WavTokenizer 317.1 317.1 - - - - - - -
X-Codec 56.2 20.6 24.9 37.8 57.3 77.3 92.0 103.0 126.3
YuE 52.7 18.3 29.6 37.7 52.6 74.3 89.8 95.3 90.3
SpeechTokenizer 24.2 2.4 12.4 24.8 33.6 40.9 46.0 50.3 52.8
Mimi 269.6 32.9 189.3 334.7 383.3 424.2 431.7 456.9 459.7
SemamiCodec 14.8 1.2 191.0 - - - - - -

Emilia_EN(100ksteps)

Codec ppl↓ cb1_ppl cb2_ppl cb3_ppl cb4_ppl cb5_ppl cb6_ppl cb7_ppl cb8_ppl
DAC 247 20.6 146.7 218 315.1 395.9 482.9 569.6 628.2
Encodec 75.7 14.8 33.4 59 88.7 111.3 138.4 158.5 172.6
WavTokenizer 104.7 104.7 - - - - - - -
X-Codec 30.3 10.0 12.7 20.2 30.6 41.9 50.7 61.6 71.4
YuE 29.0 9.3 16.0 19.9 29.3 38.7 51.0 55.2 54.1
SpeechTokenizer 13.5 1.9 5.5 12.1 18.3 22.3 25.1 28.6 30.8
Mimi 126.9 9.1 58.2 148.0 185.0 228.7 256.6 278.9 298.5
SemamiCodec 7.9 1.0 82.1 - - - - - -

MTG-Jamendo(100ksteps)

Codec ppl↓ cb1_ppl cb2_ppl cb3_ppl cb4_ppl cb5_ppl cb6_ppl cb7_ppl cb8_ppl
DAC 194 28.6 122.8 152.4 212.8 270.7 352.9 413.4 473.5
Encodec 141.3 17.6 62.5 110.7 170 225.9 287 336.8 375.6
WavTokenizer 38.2 38.2 - - - - - - -
X-Codec 47.5 20.4 19.6 32.4 51.1 64.5 74.5 86.8 100.2
YuE 46.2 18.3 28.7 30.4 48.2 60.0 74.9 83.0 76.3
SemamiCodec 15.5 1.0 272.4 - - - - - -

Acknowledgement

We would like to extend a special thanks to authors of https://github.com/lucadellalib/audiocodecs and Marble. Their work has been a great source of inspiration for us.

Citation

@misc{wang2025audiocodecbenchcomprehensivebenchmarkaudio,
      title={AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation}, 
      author={Lu Wang and Hao Chen and Siyu Wu and Zhiyue Wu and Hao Zhou and Chengfeng Zhang and Ting Wang and Haodi Zhang},
      year={2025},
      eprint={2509.02349},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2509.02349}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 99.9%
  • Shell 0.1%