ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay

This repository contains the code and models for the paper:

ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay
Fanbin Lu, Zhisheng Zhong, Shu Liu, Chi-Wing Fu, Jiaya Jia
CUHK, SmartMore, HKUST
[Paper] • [Project Page] • [Model on HF]

Overview

ARPO (Agentic Replay Policy Optimization) is a novel reinforcement learning framework designed to train vision-language GUI agents to complete long-horizon desktop tasks. It builds upon Group Relative Policy Optimization (GRPO) and introduces:

Distributed Rollouts: Scalable task execution across parallel OSWorld environments with docker.
Multi-modal Input Support: Processes long histories (15 steps) of screenshots + actions in an end-to-end way.

Access our model on huggingface and view training logs on the Weights & Biases.

📊 Results on OSWorld

Model	128 training tasks	OSWorld overall
UI-Tars-1.5	68.7%	23.5%
UI-Tars-1.5 + GRPO	72.9%	26.0%
UI-Tars-1.5 + ARPO (Ours)	83.9%	29.9%

Evaluated with a max of 15 steps per trajectory.

🛠 Installation

1. Clone the repository and create environment

git clone --recurse-submodules https://github.com/dvlab-research/ARPO.git
cd ARPO

# Create and activate Conda environment
conda create -n arpo python=3.10
conda activate arpo

# Install Python dependencies
pip install -r requirements.txt

2. Install OSWorld

Follow the origin installation guide of OSWorld if you only want to evaluate the model. If you want to train with GRPO, you are required to pip install it.

cd OSWorld
pip install -e .
cd ..

💡 We strongly recommend running a full evaluation with Docker before training to prepare the docker image, Ubuntu VM data, and cache_dir required.

⚙️ Setup for Evaluation with OSWorld

To evaluate ARPO on the OSWorld benchmark with the released model using Docker-based virtual environments, follow these steps:

1. Prepare the Environment

Ensure you have correctly installed OSWorld by following its Docker setup instructions. Once OSWorld is set up:

nohup bash start_server.sh &

2. Run Evaluation Script

Navigate into the OSWorld directory and execute the evaluation script:

cd OSWorld

python run_multienv_uitars.py \
    --headless \
    --observation_type screenshot \
    --max_steps 15 \
    --max_trajectory_length 15 \
    --temperature 0.6 \
    --model ui-tars \
    --action_space pyautogui \
    --num_envs 8 \
    --result_dir ./results/ \
    --test_all_meta_path ./evaluation_examples/test_all.json \
    --trial-id 0 \
    --server_ip http://127.0.0.1

✅ Parameters Explained

--headless: Enables headless mode (no GUI rendering).
--observation_type screenshot: Use visual observations for the agent.
--max_steps / --max_trajectory_length: Limit per-task interaction steps.
--temperature: Sampling temperature for model output.
--model: Name of the model.
--num_envs: Number of parallel environments (VMs).
--result_dir: Directory to store evaluation results.
--test_all_meta_path: JSON file with evaluation task metadata.
--trial-id: ID for the evaluation trial.
--server_ip: IP of the evaluation server (usually localhost).

You will find vmware_vm_data/, docker_vm_data/, and cache/ folders under the OSWorld after evaluation.

⚙️ Setup for GRPO Training

# Link evaluation examples and cache
ln -s $(pwd)/OSWorld/evaluation_examples ./
mkdir cache_dirs/
ln -s $(pwd)/OSWorld/cache ./cache_dirs/cache_0
ln -s $(pwd)/OSWorld/vmware_vm_data ./
ln -s $(pwd)/OSWorld/docker_vm_data ./

To run Docker without sudo:

sudo usermod -aG docker $USER
newgrp docker

Training ARPO/GRPO with OSWorld

Single Node (subset training: 32 tasks)

If you only have one node, we suggest training on a subset of OSWorld tasks with at most 16 Docker environments.

RAY_PORT=2468
RAY_HEAD_IP=<Your IP>
ray start --head --port=$RAY_PORT --resources='{"docker:'$RAY_HEAD_IP'": 128}'
bash ./examples/osworld_subset32.sh

Multi-Node Setup with Ray (e.g. 8 nodes, 128 envs)

On Ray master node:

RAY_PORT=2468
RAY_HEAD_IP=<Your IP>
ray start --head --port=$RAY_PORT --resources='{"docker:'$RAY_HEAD_IP'": 128}'

On Ray slave nodes (with GPU):

ray start --address=$RAY_HEAD_IP:$RAY_PORT --num-gpus=8 --resources='{"docker:'$CURRENT_IP'": 128}'

Or (CPU only):

ray start --address=$RAY_HEAD_IP:$RAY_PORT --resources='{"docker:'$CURRENT_IP'": 128}'

Then run:

bash ./examples/osworld_full_arpo.sh

🔗 Related Projects

OSWorld — Realistic GUI environments for multimodal agents modified for GRPO training.
EasyR1 An efficient, scalable, multi-modality RL training framework based on veRL, supporting advanced VLMs and algorithms like GRPO.

📄 Citation

If you find ARPO useful, please consider citing our work.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.github		.github
OSWorld @ 7a6409d		OSWorld @ 7a6409d
assets		assets
examples		examples
scripts		scripts
verl		verl
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py
start_server.sh		start_server.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay

Overview

📊 Results on OSWorld

🛠 Installation

1. Clone the repository and create environment

2. Install OSWorld

⚙️ Setup for Evaluation with OSWorld

1. Prepare the Environment

2. Run Evaluation Script

✅ Parameters Explained

⚙️ Setup for GRPO Training

Training ARPO/GRPO with OSWorld

Single Node (subset training: 32 tasks)

Multi-Node Setup with Ray (e.g. 8 nodes, 128 envs)

🔗 Related Projects

📄 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

dvlab-research/ARPO

Folders and files

Latest commit

History

Repository files navigation

ARPO: End-to-End Policy Optimization for GUI Agents with Experience Replay

Overview

📊 Results on OSWorld

🛠 Installation

1. Clone the repository and create environment

2. Install OSWorld

⚙️ Setup for Evaluation with OSWorld

1. Prepare the Environment

2. Run Evaluation Script

✅ Parameters Explained

⚙️ Setup for GRPO Training

Training ARPO/GRPO with OSWorld

Single Node (subset training: 32 tasks)

Multi-Node Setup with Ray (e.g. 8 nodes, 128 envs)

🔗 Related Projects

📄 Citation

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages