The code for running the model from paper: TinyClick: Single-Turn Agent for Empowering GUI Automation
We present a single-turn agent for graphical user interface (GUI) interaction tasks, using Vision-Language Model Florence-2-Base. Main goal of the agent is to click on desired UI element based on the screenshot and user command. It demonstrates strong performance on Screenspot and OmniAct, while maintaining a compact size of 0.27B parameters and minimal latency.
Before running, set up the environment and install the required packages:
pip install -r requirements.txt
To see example inference with TinyClick, run this command:
python3 main.py --image-path "<PATH>" --text "<COMMAND>"
@misc{pawlowski2024tinyclicksingleturnagentempowering,
title={TinyClick: Single-Turn Agent for Empowering GUI Automation},
author={Pawel Pawlowski and Krystian Zawistowski and Wojciech Lapacz and Marcin Skorupa and Adam Wiacek and Sebastien Postansque and Jakub Hoscilowicz},
year={2024},
eprint={2410.11871},
archivePrefix={arXiv},
primaryClass={cs.HC},
url={https://arxiv.org/abs/2410.11871},
}
Please check the MIT license that is listed in this repository. See LICENSE
for more information.