This repository provides an unofficial evaluation implementation for LLaDA 2.0, based on the lm-evaluation-harness.
⚠️ Disclaimer: Since the official evaluation reports for LLaDA 2.0 are not yet available, the results presented below are based on independent testing conducted on my own equipment. They may not fully represent the model's official performance capabilities.
- Hardware: NVIDIA A100 GPU
- Software:
torch == 2.5.1transformers == 4.57.1
First, run the test script to ensure the environment is set up correctly and the model can generate samples:
python test.pyExecute the shell script to start the evaluation process
bash eval_LLaDA2.shPlease pay attention to the following parameters differences compared to LLaDA v1:
-
stepsParameter: In LLaDA 2.0, steps refers to intra-block steps (steps within a block), which differs from the definition used in LLaDA 1.0. -
eos_early_stop: Added a new parametereos_early_stop(default:True). This allows generation to stop immediately upon encountering the EOS token, improving efficiency without affecting generation quality.
-
Log Samples: You must enable the
log_samplesoption, as the final metrics rely heavily on Python post-processing of these logs. -
Data Management: The post-processing script calculates the average accuracy based on ALL .jsonl files found in the current result directory
-
- Recommendation: Before starting a new run, please delete old JSONL files or specify a new output directory to avoid mixing results from different experiments.
Observation: The accuracy at a length of 256 is significantly lower. This is likely because LLaDA 2.0 tends to generate longer responses; a short context window (256) truncates the reasoning process, leading to incomplete answers.
| Model | Len | Method | HumanEval:Acc | MBPP:Acc | GSM8K:Acc | MATH500:Acc |
|---|---|---|---|---|---|---|
| LLaDA2.0-mini-preview | 256 | baseline | 5.5 | 19.2 | 63.8 | 16.2 |
| 512 | baseline | 54.3 | 54.7 | 86.5 | 44.6 | |
| 1024 | baseline | 74.2 | 63.2 | 87.7 | 61.2 |
This project is built upon the open-source repository daedal. Special thanks to the author for their contributions.