LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision
, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.
- [2025-7-8] 🚀🚀🚀 We are excited to release
PyVision
, inluding:- Techniqual report, code and online demo.
Prepare the running environment, both for the main process and the environment runtime.
git clone https://github.com/agents-x-project/PyVision.git
cd PyVision
conda create -n pyvision python=3.10
conda activate pyvision
pip install -r requirements.txt
Before running PyVision
, you need to first setup the API config file, including the key and the base_url. We provide three types of clients: OpenAI, Azure and vLLM.
# ./api_config_files/api_config_openai.json
{
"api_key": [
"sk-xxx"
],
"base_url": "xxx"
}
# ./api_config_files/api_config_azure.json
{
"azure_openai_api_key": [
"xxx"
],
"azure_openai_endpoint": "xxx"
}
# ./api_config_files/api_config_vllm.json
{
"api_key": [
"xxx"
],
"base_url": "xxx"
}
If you have setup the OpenAI API config file, you can run the run.sh
file.
# openai client
python main.py \
--image_path ./test_data/one_image_demo.png \
--question "What is the color of the liquid contained in the glass on the table?" \
--api_config ./api_config_files/api_config_openai.json \
--client_type openai \
--prompt_template ./prompt_template/prompt_template_vis.json \
--prompt vistool_with_img_info_v2 \
--exe_code \
--max_tokens 10000 \
--temperature 0.6 \
--output_dir ./test_data \
--save_messages
After running the run.sh file, the generated message is stored at ./test_data/test_message.json
.
Upload the message file to our hosted visualization HuggingFace space: visualization demo.
@article{zhao2025pyvision,
title={PyVision: Agentic Vision with Dynamic Tooling.},
author={Zhao, Shitian and Zhang, Haoquan and Lin, Shaoheng and Li, Ming and Wu, Qilong and Zhang, Kaipeng and Wei, Chen},
journal={arxiv preprint arxiv:2507.07998},
year={2025},
}