🔥 TL;DR: We show that scaling up thinking tokens of English-centric reasoning language models, such as s1 models, can improve multilingual math reasoning performance. We also analyze the language-mixing patterns, effects of different reasoning languages (controlled by our language forcing strategies), and cross-domain generalization (from STEM to domains such as social sciences and cultural benchmarks).
We used the modified lm_eval_harness from s1 repository. We further modify it for supporting our evaluation setup.
### installation (python 3.10+)
git lfs # check that you have git lfs installed before cloning the repo
git clone https://github.com/BatsResearch/crosslingual-test-time-scaling.git
cd crosslingual-test-time-scaling
pip install -r requirements.txt
cd lm-evaluation-harness
pip install -e .[math,vllm]
Here's a quick eval run on 50 Chinese MGSM samples using s1.1-3B models with 2000 maximum thinking tokens. This should take less than 10 minutes to complete the command on 4 L40S GPUs.
See Codes-and-Artifacts for full evaluation scripts.
cd lm-evaluation-harness/
LANG=zh
MODEL=s1.1-3B
THINKING=2000 # truncation strategy: 2000 max thinking token
NGPUS=4
NSAMPLES=50
OUTPUT_FP=../outputs/${MODEL}-mgsm_direct_${LANG}_${THINKING}
lm_eval --model vllm --model_args pretrained=simplescaling/${MODEL},dtype=bfloat16,tensor_parallel_size=${NGPUS} --tasks mgsm_direct_${LANG} --batch_size auto --apply_chat_template --output_path ${OUTPUT_FP} --log_samples --gen_kwargs max_gen_toks=32768,max_tokens_thinking=${THINKING} --limit ${NSAMPLES}
# | Tasks |Version| Filter |n-shot| Metric | |Value| |Stderr|
# |--------------|------:|-----------------|-----:|-----------|---|----:|---|------|
# |mgsm_direct_zh| 2|flexible-extract | 0|exact_match|↑ | 0.78|± | N/A|
# | | |remove_whitespace| 0|exact_match|↑ | 0.00|± | N/A|
#
# the MGSM accuracy is 78.0% for this subset of 50 samples.
The experiments/
folder contains our experiment codes and artifacts of models' generations in our experiments. We structure our repository according to the paper sections.
- crosslingual_mgsm: Crosslingual test-time scaling experiments (Section 4)
- language_mixing: Language-mixing experiments (Section 5)
- language_forcing: Language-forcing experiments (Section 6)
- crossdomain: Cross-domain experiments (Section 7)
@article{yong2025crosslingual-test-time-scaling,
title={Crosslingual Reasoning through Test-Time Scaling},
author={Zheng-Xin Yong and M. Farid Adilazuarda and Jonibek Mansurov and Ruochen Zhang and Niklas Muennighoff and Carsten Eickhoff and Genta Indra Winata and Julia Kreutzer and Stephen H. Bach and Alham Fikri Aji},
year={2025},
journal={arxiv preprint},
eprint={2505.05408},
}