Skip to content

vivo/DiMo-GUI

Repository files navigation

DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning

1University of California, Merced, 2The University of Queensland,
3vivo Mobile Communication Co., Ltd

Indicates Corresponding Author

🔥 Update

  • [2025-07-02]: 🚀 Paper released on Arxiv.
  • [2025-06-18]: 🚀 Paper released on TechRxiv.
  • [2025-06-11]: 🚀 Codes released.

🎯 Overview

teaser
  • We propose DiMo-GUI, a training-free framework that can be seamlessly integrated as a plug-and-play component into any GUI agent. Without requiring additional training or external data, DiMo-GUI effectively enhances grounding performance across various GUI tasks.
  • DiMo-GUI introduces three key innovations:
    1. A divide-and-conquer strategy that separates text and icon components for targeted processing.
    2. A progressive zoom-in mechanism to increasingly focus on the target region.
    3. A dynamic halting system that enables timely decision-making and early stopping to reduce overthinking and unnecessary computational cost.
radar
  • Extensive and comprehensive experiments demonstrate that DiMo-GUI can significantly enhance the grounding performance of various GUI agents across multiple benchmarks with minimal computational overhead, showcasing the effectiveness and generalizability of the proposed framework.

🕹️ Usage

Environment Setup

conda env create -n dimo-gui
source activate dimo-gui
cd DiMo-GUI
pip install -r requirements.txt

Note that the transformers version required by osatlas-4b is different from others, you need to run the following command to run osatlas-4b:

pip install transformers==4.37.2

Data Preparation

You can download the ScreenSpot-Pro dataset from this huggingface link, or use the download code below:

huggingface-cli download --resume-download  --repo-type dataset likaixin/ScreenSpot-Pro --local-dir ./data/pro

You can obtain the ScreenSpot-V2 dataset from this link, and refer to this issue

Make sure you put the data under ./data path, or you may need to change the bash script.

Run DiMo-GUI

Use the shell script to run DiMo-GUI:

bash run_ss_pro.sh
bash run_ss_v2.sh

You can change the parameters like models and max_iter to run different experiments.

We provide the json file of experimental results in the paper in results folder.

🏅 Experiments

  • Comparison of various models on ScreenSpot-Pro.
teaser
  • Comparison of various models on ScreenSpot-V2
teaser
  • Please refer to our paper for detailed experimental results.

📌 Examples

teaser
  • Examples on ScreenSpot-Pro. On the left is the original model's prediction, where the red box represents the ground truth and the blue dot indicates the predicted coordinates. On the right is the result after integrating DiMo-GUI, where the model is able to localize more accurately according to the instruction
teaser
  • Examples on ScreenSpot-V2. On the Screenspot benchmark, which features relatively low resolution and simple scenes, DiMo-GUI also enhances the model's localization capabilities.

📑 Citation

If you find our project useful, we hope you can star our repo and cite our paper as follows:

@article{wu2025dimo,
  title={DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning},
  author={Wu, Hang and Chen, Hongkai and Cai, Yujun and Liu, Chang and Ye, Qingwen and Yang, Ming-Hsuan and Wang, Yiwei},
  year={2025}
}

📝 Related Projects

Our repository is based on the following projects, we sincerely thank them for their great efforts and excellent work.

License

This project is licensed under the terms of the Apache License 2.0. You are free to use, modify, and distribute this software under the conditions of the license. See the LICENSE file for details.

About

Repository for paper "DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning"

Resources

License

Stars

Watchers

Forks