3vivo Mobile Communication Co., Ltd
†Indicates Corresponding Author
- [2025-07-02]: 🚀 Paper released on Arxiv.
- [2025-06-18]: 🚀 Paper released on TechRxiv.
- [2025-06-11]: 🚀 Codes released.
- We propose DiMo-GUI, a training-free framework that can be seamlessly integrated as a plug-and-play component into any GUI agent. Without requiring additional training or external data, DiMo-GUI effectively enhances grounding performance across various GUI tasks.
- DiMo-GUI introduces three key innovations:
- A divide-and-conquer strategy that separates text and icon components for targeted processing.
- A progressive zoom-in mechanism to increasingly focus on the target region.
- A dynamic halting system that enables timely decision-making and early stopping to reduce overthinking and unnecessary computational cost.
- Extensive and comprehensive experiments demonstrate that DiMo-GUI can significantly enhance the grounding performance of various GUI agents across multiple benchmarks with minimal computational overhead, showcasing the effectiveness and generalizability of the proposed framework.
conda env create -n dimo-gui
source activate dimo-gui
cd DiMo-GUI
pip install -r requirements.txt
Note that the transformers version required by osatlas-4b is different from others, you need to run the following command to run osatlas-4b:
pip install transformers==4.37.2
You can download the ScreenSpot-Pro dataset from this huggingface link, or use the download code below:
huggingface-cli download --resume-download --repo-type dataset likaixin/ScreenSpot-Pro --local-dir ./data/pro
You can obtain the ScreenSpot-V2 dataset from this link, and refer to this issue
Make sure you put the data under ./data
path, or you may need to change the bash script.
Use the shell script to run DiMo-GUI:
bash run_ss_pro.sh
bash run_ss_v2.sh
You can change the parameters like models
and max_iter
to run different experiments.
We provide the json file of experimental results in the paper in results
folder.
- Comparison of various models on ScreenSpot-Pro.
- Comparison of various models on ScreenSpot-V2
- Please refer to our paper for detailed experimental results.
- Examples on ScreenSpot-Pro. On the left is the original model's prediction, where the red box represents the ground truth and the blue dot indicates the predicted coordinates. On the right is the result after integrating DiMo-GUI, where the model is able to localize more accurately according to the instruction
- Examples on ScreenSpot-V2. On the Screenspot benchmark, which features relatively low resolution and simple scenes, DiMo-GUI also enhances the model's localization capabilities.
If you find our project useful, we hope you can star our repo and cite our paper as follows:
@article{wu2025dimo,
title={DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning},
author={Wu, Hang and Chen, Hongkai and Cai, Yujun and Liu, Chang and Ye, Qingwen and Yang, Ming-Hsuan and Wang, Yiwei},
year={2025}
}
Our repository is based on the following projects, we sincerely thank them for their great efforts and excellent work.
- ScreenSpot-Pro: latest GUI grounding benchmark.
- Iterative-Narrowing: Iterative Narrowing for GUI grounding.
- OS-Atlas , UGround: SOTA GUI agents.
This project is licensed under the terms of the Apache License 2.0. You are free to use, modify, and distribute this software under the conditions of the license. See the LICENSE file for details.