DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning

Hang Wu^1,3, Hongkai Chen^3†, Yujun Cai², Chang Liu³, Qingwen Ye³, Ming-Hsuan Yang¹, Yiwei Wang ¹,

¹University of California, Merced, ²The University of Queensland,
³vivo Mobile Communication Co., Ltd
^†Indicates Corresponding Author

🔥 Update

[2025-07-02]: 🚀 Paper released on Arxiv.
[2025-06-18]: 🚀 Paper released on TechRxiv.
[2025-06-11]: 🚀 Codes released.

🎯 Overview

We propose DiMo-GUI, a training-free framework that can be seamlessly integrated as a plug-and-play component into any GUI agent. Without requiring additional training or external data, DiMo-GUI effectively enhances grounding performance across various GUI tasks.
DiMo-GUI introduces three key innovations:
1. A divide-and-conquer strategy that separates text and icon components for targeted processing.
2. A progressive zoom-in mechanism to increasingly focus on the target region.
3. A dynamic halting system that enables timely decision-making and early stopping to reduce overthinking and unnecessary computational cost.

Extensive and comprehensive experiments demonstrate that DiMo-GUI can significantly enhance the grounding performance of various GUI agents across multiple benchmarks with minimal computational overhead, showcasing the effectiveness and generalizability of the proposed framework.

🕹️ Usage

Environment Setup

conda env create -n dimo-gui
source activate dimo-gui
cd DiMo-GUI
pip install -r requirements.txt

Note that the transformers version required by osatlas-4b is different from others, you need to run the following command to run osatlas-4b:

pip install transformers==4.37.2

Data Preparation

You can download the ScreenSpot-Pro dataset from this huggingface link, or use the download code below:

huggingface-cli download --resume-download  --repo-type dataset likaixin/ScreenSpot-Pro --local-dir ./data/pro

You can obtain the ScreenSpot-V2 dataset from this link, and refer to this issue

Make sure you put the data under ./data path, or you may need to change the bash script.

Run DiMo-GUI

Use the shell script to run DiMo-GUI:

bash run_ss_pro.sh
bash run_ss_v2.sh

You can change the parameters like models and max_iter to run different experiments.

We provide the json file of experimental results in the paper in results folder.

🏅 Experiments

Comparison of various models on ScreenSpot-Pro.

Comparison of various models on ScreenSpot-V2

Please refer to our paper for detailed experimental results.

📌 Examples

Examples on ScreenSpot-Pro. On the left is the original model's prediction, where the red box represents the ground truth and the blue dot indicates the predicted coordinates. On the right is the result after integrating DiMo-GUI, where the model is able to localize more accurately according to the instruction

Examples on ScreenSpot-V2. On the Screenspot benchmark, which features relatively low resolution and simple scenes, DiMo-GUI also enhances the model's localization capabilities.

📑 Citation

If you find our project useful, we hope you can star our repo and cite our paper as follows:

@article{wu2025dimo,
  title={DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning},
  author={Wu, Hang and Chen, Hongkai and Cai, Yujun and Liu, Chang and Ye, Qingwen and Yang, Ming-Hsuan and Wang, Yiwei},
  year={2025}
}

📝 Related Projects

Our repository is based on the following projects, we sincerely thank them for their great efforts and excellent work.

ScreenSpot-Pro: latest GUI grounding benchmark.
Iterative-Narrowing: Iterative Narrowing for GUI grounding.
OS-Atlas , UGround: SOTA GUI agents.

License

This project is licensed under the terms of the Apache License 2.0. You are free to use, modify, and distribute this software under the conditions of the license. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
images		images
models		models
results		results
DiMo-GUI.pdf		DiMo-GUI.pdf
LICENSE		LICENSE
eval_screenspot_pro.py		eval_screenspot_pro.py
readme.md		readme.md
requirements.txt		requirements.txt
run_ss_pro.sh		run_ss_pro.sh
run_ss_v2.sh		run_ss_v2.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning

🔥 Update

🎯 Overview

🕹️ Usage

Environment Setup

Data Preparation

Run DiMo-GUI

🏅 Experiments

📌 Examples

📑 Citation

📝 Related Projects

License

About

Uh oh!

Languages

License

vivo/DiMo-GUI

Folders and files

Latest commit

History

Repository files navigation

DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning

🔥 Update

🎯 Overview

🕹️ Usage

Environment Setup

Data Preparation

Run DiMo-GUI

🏅 Experiments

📌 Examples

📑 Citation

📝 Related Projects

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages