Skip to content

BatsResearch/crosslingual-test-time-scaling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Crosslingual Reasoning through Test-Time Scaling

🔥 TL;DR: We show that scaling up thinking tokens of English-centric reasoning language models, such as s1 models, can improve multilingual math reasoning performance. We also analyze the language-mixing patterns, effects of different reasoning languages (controlled by our language forcing strategies), and cross-domain generalization (from STEM to domains such as social sciences and cultural benchmarks).

Crosslingual MGSM performance


Getting Started

Installation

We used the modified lm_eval_harness from s1 repository. We further modify it for supporting our evaluation setup.

### installation (python 3.10+)
git lfs # check that you have git lfs installed before cloning the repo

git clone https://github.com/BatsResearch/crosslingual-test-time-scaling.git
cd crosslingual-test-time-scaling
pip install -r requirements.txt
cd lm-evaluation-harness
pip install -e .[math,vllm]

Quick Start

Here's a quick eval run on 50 Chinese MGSM samples using s1.1-3B models with 2000 maximum thinking tokens. This should take less than 10 minutes to complete the command on 4 L40S GPUs.

See Codes-and-Artifacts for full evaluation scripts.

cd lm-evaluation-harness/

LANG=zh
MODEL=s1.1-3B
THINKING=2000 # truncation strategy: 2000 max thinking token
NGPUS=4
NSAMPLES=50

OUTPUT_FP=../outputs/${MODEL}-mgsm_direct_${LANG}_${THINKING}
lm_eval --model vllm --model_args pretrained=simplescaling/${MODEL},dtype=bfloat16,tensor_parallel_size=${NGPUS} --tasks mgsm_direct_${LANG} --batch_size auto --apply_chat_template --output_path ${OUTPUT_FP} --log_samples --gen_kwargs max_gen_toks=32768,max_tokens_thinking=${THINKING} --limit ${NSAMPLES}

# |    Tasks     |Version|     Filter      |n-shot|  Metric   |   |Value|   |Stderr|
# |--------------|------:|-----------------|-----:|-----------|---|----:|---|------|
# |mgsm_direct_zh|      2|flexible-extract |     0|exact_match|↑  | 0.78|±  |   N/A|
# |              |       |remove_whitespace|     0|exact_match|↑  | 0.00|±  |   N/A|
#
# the MGSM accuracy is 78.0% for this subset of 50 samples.

Codes and Artifacts

The experiments/ folder contains our experiment codes and artifacts of models' generations in our experiments. We structure our repository according to the paper sections.

Citation

@article{yong2025crosslingual-test-time-scaling,
  title={Crosslingual Reasoning through Test-Time Scaling}, 
  author={Zheng-Xin Yong and M. Farid Adilazuarda and Jonibek Mansurov and Ruochen Zhang and Niklas Muennighoff and Carsten Eickhoff and Genta Indra Winata and Julia Kreutzer and Stephen H. Bach and Alham Fikri Aji},
  year={2025},
  journal={arxiv preprint},
  eprint={2505.05408},
}

About

Crosslingual Reasoning through Test-Time Scaling

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •