Skip to content

agents-x-project/PyVision

Repository files navigation

PyVision: Agentic Vision with Dynamic Tooling

arXiv Website HF Model: ViGaL

🎯Overview

LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.

🚩News

📋Contents

📦Installation

Prepare the running environment, both for the main process and the environment runtime.

git clone https://github.com/agents-x-project/PyVision.git
cd PyVision

conda create -n pyvision python=3.10
conda activate pyvision
pip install -r requirements.txt

💥Run PyVision

1. Setup API Config

Before running PyVision, you need to first setup the API config file, including the key and the base_url. We provide three types of clients: OpenAI, Azure and vLLM.

OpenAI Client

# ./api_config_files/api_config_openai.json
{
    "api_key": [
        "sk-xxx"
    ],
    "base_url": "xxx"
}

Azure Client

# ./api_config_files/api_config_azure.json
{
    "azure_openai_api_key": [
        "xxx"
    ],
    "azure_openai_endpoint": "xxx"
}

vLLM Client

# ./api_config_files/api_config_vllm.json
{
    "api_key": [
        "xxx"
    ],
    "base_url": "xxx"
}

2. Run

If you have setup the OpenAI API config file, you can run the run.sh file.

# openai client

python main.py \
    --image_path ./test_data/one_image_demo.png \
    --question "What is the color of the liquid contained in the glass on the table?" \
    --api_config ./api_config_files/api_config_openai.json \
    --client_type openai \
    --prompt_template ./prompt_template/prompt_template_vis.json \
    --prompt vistool_with_img_info_v2 \
    --exe_code \
    --max_tokens 10000 \
    --temperature 0.6 \
    --output_dir ./test_data \
    --save_messages 

3. Visualization

After running the run.sh file, the generated message is stored at ./test_data/test_message.json.
Upload the message file to our hosted visualization HuggingFace space: visualization demo.

📜Citation

@article{zhao2025pyvision,
  title={PyVision: Agentic Vision with Dynamic Tooling.},
  author={Zhao, Shitian and Zhang, Haoquan and Lin, Shaoheng and Li, Ming and Wu, Qilong and Zhang, Kaipeng and Wei, Chen},
  journal={arxiv preprint arxiv:2507.07998},
  year={2025},
}

About

Official implementation of "PyVision: Agentic Vision with Dynamic Tooling."

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published