Skip to content

Support for VisualWebArena evaluation in OpenHands #4773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
Jan 23, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
a05b707
Modified run_controller() function to allow passing image urls instea…
adityasoni9998 Sep 28, 2024
a927bd7
Merge branch 'main' into aditya_miis
adityasoni9998 Sep 28, 2024
1b178dd
Merge branch 'main' into aditya_miis
adityasoni9998 Sep 28, 2024
9df8bb2
Merge branch 'main' into aditya_miis
adityasoni9998 Sep 28, 2024
fdcf230
Merge branch 'main' into aditya_miis
xingyaoww Sep 28, 2024
965cee7
added gitignore
adityasoni9998 Sep 28, 2024
f9202a4
Merge branch 'All-Hands-AI:main' into aditya_miis
adityasoni9998 Sep 29, 2024
a3c8bcc
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Oct 5, 2024
63eca93
Modified run_controller to have a single Action object as the initial…
adityasoni9998 Oct 6, 2024
13c4c2c
Merge branch 'main' into aditya_miis
adityasoni9998 Oct 6, 2024
bdfae11
Added support for VWA Evaluation (work still in progress)
adityasoni9998 Oct 7, 2024
c2505c0
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Oct 7, 2024
6aba462
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Oct 12, 2024
26fbd7d
Merge branch 'main' into aditya_vwa_eval
adityasoni9998 Oct 12, 2024
a900380
VWA benchmark evaluation. Debugging for OSError: handle is closed.
adityasoni9998 Oct 18, 2024
f9aa2ff
VWA Agent implementation done. TODOs: Cache prompt, decide prompt for…
adityasoni9998 Nov 5, 2024
1fef52a
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Nov 6, 2024
ea8c14b
Minor fixes, added runtime.close() to clean up docker containers.
adityasoni9998 Nov 8, 2024
eb681b1
Incorporate action and thought history in Browsing Agent
adityasoni9998 Nov 23, 2024
e260dd9
minor fixes for OneStopMarket evaluation
adityasoni9998 Dec 6, 2024
bad7ccf
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Dec 6, 2024
92aaa1c
Merge branch 'main' into aditya_vwa_eval
adityasoni9998 Dec 6, 2024
5258fe8
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Dec 6, 2024
16594f6
Merge branch 'main' into aditya_vwa_eval
adityasoni9998 Dec 6, 2024
c67729b
minor fixes and cleanup
adityasoni9998 Dec 6, 2024
8a7697d
minor merging errors need to be fixed
adityasoni9998 Dec 6, 2024
dcdb448
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Dec 6, 2024
7a77e85
Merge branch 'main' into aditya_vwa_eval
adityasoni9998 Dec 6, 2024
3a51d4f
Completed setup of VisualBrowsingAgent.
adityasoni9998 Dec 8, 2024
187e248
cleanup
adityasoni9998 Dec 8, 2024
072d956
Merge remote-tracking branch 'upstream/main'
adityasoni9998 Jan 20, 2025
ef1f188
Merge remote-tracking branch 'origin/main' into aditya_vwa_eval
adityasoni9998 Jan 20, 2025
51b40d0
Added README for VisualWebArena benchmark. Minor changes to resolve P…
adityasoni9998 Jan 20, 2025
b492a5a
Update Poetry lock file
adityasoni9998 Jan 20, 2025
f4d2431
Check integration tests for VisualBrowsingAgent
adityasoni9998 Jan 20, 2025
dbc17e8
Added integration test workflow for VisualBrowsingAgent using DeepSee…
adityasoni9998 Jan 20, 2025
37b9420
Fix merge conflict in poetry.lock
openhands-agent Jan 23, 2025
5370b89
Update poetry lock
neubig Jan 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 34 additions & 2 deletions .github/workflows/integration-runner.yml
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,6 @@ jobs:
echo "api_key = \"$LLM_API_KEY\"" >> config.toml
echo "base_url = \"$LLM_BASE_URL\"" >> config.toml
echo "temperature = 0.0" >> config.toml

- name: Run integration test evaluation for DelegatorAgent (DeepSeek)
env:
SANDBOX_FORCE_REBUILD_RUNTIME: True
Expand All @@ -174,12 +173,42 @@ jobs:
cat $REPORT_FILE_DELEGATOR_DEEPSEEK >> $GITHUB_ENV
echo >> $GITHUB_ENV
echo "EOF" >> $GITHUB_ENV
# -------------------------------------------------------------
# Run VisualBrowsingAgent tests for DeepSeek, limited to t05 and t06
- name: Wait a little bit (again)
run: sleep 5

- name: Configure config.toml for testing VisualBrowsingAgent (DeepSeek)
env:
LLM_MODEL: "litellm_proxy/deepseek-chat"
LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
LLM_BASE_URL: ${{ secrets.LLM_BASE_URL }}
MAX_ITERATIONS: 15
run: |
echo "[llm.eval]" > config.toml
echo "model = \"$LLM_MODEL\"" >> config.toml
echo "api_key = \"$LLM_API_KEY\"" >> config.toml
echo "base_url = \"$LLM_BASE_URL\"" >> config.toml
echo "temperature = 0.0" >> config.toml
- name: Run integration test evaluation for VisualBrowsingAgent (DeepSeek)
env:
SANDBOX_FORCE_REBUILD_RUNTIME: True
run: |
poetry run ./evaluation/integration_tests/scripts/run_infer.sh llm.eval HEAD VisualBrowsingAgent '' 15 $N_PROCESSES "t05_simple_browsing,t06_github_pr_browsing.py" 'visualbrowsing_deepseek_run'

# Find and export the visual browsing agent test results
REPORT_FILE_VISUALBROWSING_DEEPSEEK=$(find evaluation/evaluation_outputs/outputs/integration_tests/VisualBrowsingAgent/deepseek*_maxiter_15_N* -name "report.md" -type f | head -n 1)
echo "REPORT_FILE_VISUALBROWSING_DEEPSEEK: $REPORT_FILE_VISUALBROWSING_DEEPSEEK"
echo "INTEGRATION_TEST_REPORT_VISUALBROWSING_DEEPSEEK<<EOF" >> $GITHUB_ENV
cat $REPORT_FILE_VISUALBROWSING_DEEPSEEK >> $GITHUB_ENV
echo >> $GITHUB_ENV
echo "EOF" >> $GITHUB_ENV

- name: Create archive of evaluation outputs
run: |
TIMESTAMP=$(date +'%y-%m-%d-%H-%M')
cd evaluation/evaluation_outputs/outputs # Change to the outputs directory
tar -czvf ../../../integration_tests_${TIMESTAMP}.tar.gz integration_tests/CodeActAgent/* integration_tests/DelegatorAgent/* # Only include the actual result directories
tar -czvf ../../../integration_tests_${TIMESTAMP}.tar.gz integration_tests/CodeActAgent/* integration_tests/DelegatorAgent/* integration_tests/VisualBrowsingAgent/* # Only include the actual result directories

- name: Upload evaluation results as artifact
uses: actions/upload-artifact@v4
Expand Down Expand Up @@ -227,4 +256,7 @@ jobs:
**Integration Tests Report Delegator (DeepSeek)**
${{ env.INTEGRATION_TEST_REPORT_DELEGATOR_DEEPSEEK }}
---
**Integration Tests Report VisualBrowsing (DeepSeek)**
${{ env.INTEGRATION_TEST_REPORT_VISUALBROWSING_DEEPSEEK }}
---
Download testing outputs (includes both Haiku and DeepSeek results): [Download](${{ steps.upload_results_artifact.outputs.artifact-url }})
50 changes: 50 additions & 0 deletions evaluation/benchmarks/visualwebarena/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# VisualWebArena Evaluation with OpenHands Browsing Agents

This folder contains evaluation for [VisualWebArena](https://github.com/web-arena-x/visualwebarena) benchmark, powered by [BrowserGym](https://github.com/ServiceNow/BrowserGym) for easy evaluation of how well an agent capable of browsing can perform on realistic web browsing tasks.

## Setup Environment and LLM Configuration

Please follow instruction [here](../../README.md#setup) to setup your local development environment and LLM.

## Setup VisualWebArena Environment

VisualWebArena requires you to set up websites containing pre-populated content that is accessible via URL to the machine running the OpenHands agents.
Follow [this document](https://github.com/web-arena-x/visualwebarena/blob/main/environment_docker/README.md) to set up your own VisualWebArena environment through local servers or AWS EC2 instances.
Take note of the base URL (`$VISUALWEBARENA_BASE_URL`) of the machine where the environment is installed.

## Test if your environment works

Access with browser the above VisualWebArena website URLs and see if they load correctly.
If you cannot access the website, make sure the firewall allows public access of the aforementioned ports on your server
Check the network security policy if you are using an AWS machine.
Follow the VisualWebArena environment setup guide carefully, and make sure the URL fields are populated with the correct base URL of your server.

## Run Evaluation

```bash
export VISUALWEBARENA_BASE_URL=<YOUR_SERVER_URL_HERE>
export OPENAI_API_KEY="yourkey" # this OpenAI API key is required for some visualWebArena validators that utilize LLMs
export OPENAI_BASE_URL="https://api.openai.com/v1/" # base URL for OpenAI model used for VisualWebArena evaluation
bash evaluation/benchmarks/visualwebarena/scripts/run_infer.sh llm.claude HEAD VisualBrowsingAgent
```

Results will be in `evaluation/evaluation_outputs/outputs/visualwebarena/`

To calculate the success rate, run:

```sh
poetry run python evaluation/benchmarks/visualwebarena/get_success_rate.py evaluation/evaluation_outputs/outputs/visualwebarena/SOME_AGENT/EXP_NAME/output.jsonl
```

## Submit your evaluation results

You can start your own fork of [our huggingface evaluation outputs](https://huggingface.co/spaces/OpenHands/evaluation) and submit a PR of your evaluation results following the guide [here](https://huggingface.co/docs/hub/en/repositories-pull-requests-discussions#pull-requests-and-discussions).

## VisualBrowsingAgent V1.0 result

Tested on VisualBrowsingAgent V1.0

VisualWebArena, 910 tasks (high cost, single run due to fixed task), max step 15. Resolve rates are:

- GPT4o: 26.15%
- Claude-3.5 Sonnet: 25.27%
Empty file.
40 changes: 40 additions & 0 deletions evaluation/benchmarks/visualwebarena/get_success_rate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import argparse
import json

import browsergym.visualwebarena # noqa F401 register visualwebarena tasks as gym environments
import gymnasium as gym

parser = argparse.ArgumentParser(description='Calculate average reward.')
parser.add_argument('output_path', type=str, help='path to output.jsonl')

args = parser.parse_args()

if __name__ == '__main__':
env_ids = [
id
for id in gym.envs.registry.keys()
if id.startswith('browsergym/visualwebarena')
]
total_num = len(env_ids)
print('Total number of tasks: ', total_num)
total_reward = 0
total_cost = 0
actual_num = 0
with open(args.output_path, 'r') as f:
for line in f:
data = json.loads(line)
actual_num += 1
total_cost += data['metrics']['accumulated_cost']
reward = data['test_result']['reward']
if reward >= 0:
total_reward += data['test_result']['reward']
else:
actual_num -= 1
avg_reward = total_reward / total_num
print('Total reward: ', total_reward)
print('Success Rate: ', avg_reward)

avg_cost = total_cost / actual_num
print('Avg Cost: ', avg_cost)
print('Total Cost: ', total_cost)
print('Actual number of tasks finished: ', actual_num)
Loading
Loading