Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

TL;DR: In vision-centric reasoning (e.g., mazes, puzzles), "Short is Long". Our research reveals that concise, coordinate-based reasoning generalizes significantly better than verbose language or visual manipulation CoT chains.

📖 Overview

Recent advancements in Vision-Language Models (VLMs) often rely on lengthy Chain-of-Thought (CoT) or "Thinking with Images" (Visual CoT) to improve reasoning. However, does more verbose or visual reasoning actually lead to better generalization?

In this work, we systematically evaluate different CoT designs using a controlled maze-solving benchmark (and validate on real-world tasks). We utilize a SFT-then-RL (Supervised Fine-Tuning followed by Reinforcement Learning) pipeline based on Qwen2.5-VL-7B.

Key Findings

Visual CoT Accelerates, but Doesn't Elevate: Methods that "draw" on images (Visual CoT) speed up convergence during RL but do not raise the final performance ceiling compared to simpler methods.
Concise Grounding is Better: A minimal trajectory of coordinates (Implicit CoT / G-CoT-least) outperforms verbose step-by-step language reasoning.
The "Short is Long" Effect: The most concise formats (containing only essential grounding/path info) achieve the best generalization across different scales (e.g., training on $6\times6$ mazes and solving $7\times7$ mazes).

🧠 CoT Paradigms Compared

We isolate and compare three distinct ways of externalizing reasoning:

Method	Description
Language CoT	Pure text reasoning (e.g., "Move North, then West...")
Grounding CoT	Text linked with coordinates (e.g., "Move to [x,y]...")
Visual CoT	Interleaved image manipulation (drawing lines/marks)
G-CoT-least	Implicit sequence of coordinates/path only.

📊 Experimental Highlights

1. Visual CoT Improves Efficiency, Not Efficacy

Visual CoT does not offer a higher performance upper bound compared to Language or Grounding CoT, but only accelerates training.

2. Short CoT Surpasses Longer Ones

The VLM can conduct implicit reasoning after its grounding ability is aligned with the visual environment, achieving full accuracy and faster convergence.

3. Least CoT Achieves Better Generalization

Properly aligned grounding ability allows the model to generalize its spatial reasoning effectively to new visual environments.

4. Beyond Mazes

We validated our "Short is Long" hypothesis on broader vision-centric tasks:

Visual Games: FrozenLake, Jigsaw Puzzles.
Real-world VQA: $V^*$ Bench, HR-Bench.
Result: G-CoT-least consistently achieved state-of-the-art results compared to verbose CoT methods.

Evaluation results on other vision reasoning tasks, where the best-performed results are marked as bold.

Model	$V^*$ Bench			HR-Bench 4K			Frozenlake	Jigsaw
Model	Attr	Spatial	Overall	FSP	FCP	Overall	Frozenlake	Jigsaw
Qwen2.5-VL-7B	67.83	78.95	72.25	88.00	57.00	72.50	20.00	0.00
+ V-CoT RL	86.09	78.95	83.25	87.00	57.00	72.00	-	-
+ G-CoT-least RL	87.83	82.89	85.86	90.75	57.50	74.12	90.33	75.60

citation

If you find this work helpful for your research, please consider citing:

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
img		img
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

📖 Overview

Key Findings

🧠 CoT Paradigms Compared

📊 Experimental Highlights

1. Visual CoT Improves Efficiency, Not Efficacy

2. Short CoT Surpasses Longer Ones

3. Least CoT Achieves Better Generalization

4. Beyond Mazes

citation

About

Uh oh!

Releases

Packages

Richar-Du/Revisiting-Visual-CoT

Folders and files

Latest commit

History

Repository files navigation

Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

📖 Overview

Key Findings

🧠 CoT Paradigms Compared

📊 Experimental Highlights

1. Visual CoT Improves Efficiency, Not Efficacy

2. Short CoT Surpasses Longer Ones

3. Least CoT Achieves Better Generalization

4. Beyond Mazes

citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages