This repository provides the evaluation environment and scripts for UI-Redline-bench, a benchmark for Web UI code modification based on visual instructions.
🤗 Hugging Face Dataset: https://huggingface.co/datasets/future-architect/UI-Redline-bench (The visual instruction images and metadata are hosted on Hugging Face.)
.
├── data/ # Contains HTML/CSS code for base and reference sites
│ ├── news/
│ │ ├── bootstrap/
│ │ │ ├── src/ # Original Website (Base)
│ │ │ │ ├── index.html
│ │ │ │ ├── styles.css
│ │ │ │ └── images/ # Image assets
│ │ │ ├── ref_01/ # Reference Website (Ground Truth)
│ │ │ │ ├── index.html
│ │ │ │ └── styles.css
│ │ │ └── ...
│ │ └── ...
│ └── ...
├── script/
│ ├── llm_eval.py # LLM-based automatic evaluation script
│ ├── llm_utils.py # Common utilities for LLM clients and image processing
│ ├── prediction_based_on_image_claude.py # Inference script for Claude (Bedrock)
│ ├── prediction_based_on_image_gemini.py # Inference script for Gemini
│ ├── prediction_based_on_image_gpt5.py # Inference script for GPT (Azure/OpenAI)
│ ├── prediction_based_on_image_qwen.py # Inference script for Qwen (vLLM)
│ ├── launch_vllm_server.sh # Launch script for vLLM server (Qwen)
│ └── setup_images.py # Helper script to distribute image assets
├── cpu-env/ # Environment for API-based models & evaluation
│ ├── pyproject.toml
│ └── uv.lock
└── gpu-env/ # Environment for local models (vLLM/Qwen)
├── pyproject.toml
└── uv.lock
This project uses uv for dependency management.
We separate environments into cpu-env (for API-based models and evaluation) and gpu-env (for local models needing CUDA).
This project uses uv.
Run the sync command for the environment you need.
For API Models & Evaluation (CPU): This environment is used for GPT, Claude, Gemini scripts, and the evaluation script.
uv sync --project cpu-envFor Local Models (GPU): This environment is used for running Qwen (vLLM). Requires NVIDIA drivers.
uv sync --project gpu-envBy default, image assets are stored only in the src directories to avoid duplication. To make the ref (Reference) HTML files render correctly in a browser or for evaluation, run the following script using the cpu-env.
uv run --project cpu-env script/setup_images.pyNow you can open any index.html (e.g., data/news/bootstrap/ref_01/index.html) in your browser to inspect the UI.
We provide scripts to generate modified HTML/CSS code based on visual instructions using various VLMs.
Set the environment variables corresponding to the model you wish to use.
For GPT (Azure OpenAI / OpenAI):
# Azure OpenAI (Recommended)
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"
export OPENAI_API_KEY="your-api-key"
export OPENAI_API_VERSION="2024-10-21"
# Or Standard OpenAI
export OPENAI_API_KEY="your-api-key"For Claude (AWS Bedrock):
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_DEFAULT_REGION="ap-northeast-1" # Region supporting the modelFor Gemini (Google GenAI):
export GEMINI_API_KEY="your-api-key"For Qwen (Local vLLM):
You need a GPU environment (e.g., 4x GPUs for 32B model tensor parallelism) to run the local model.
Start the vLLM server using the gpu-env before running inference:
uv run --project gpu-env bash script/launch_vllm_server.shThis will start the OpenAI-compatible server at http://localhost:8000 with API key local.
Use uv run --project <env> to execute scripts in the correct environment.
Example usage (GPT-5) [CPU Env]:
uv run --project cpu-env script/prediction_based_on_image_gpt5.py \
--html_path "data/news/bootstrap/src/index.html" \
--css_path "data/news/bootstrap/src/styles.css" \
--image "path/to/instruction_image.png" \
--output "output/news/bootstrap/ref_01"Example usage (Qwen via vLLM) [GPU Env]:
uv run --project gpu-env script/prediction_based_on_image_qwen.py \
--html_path "data/news/bootstrap/src/index.html" \
--css_path "data/news/bootstrap/src/styles.css" \
--image "path/to/instruction_image.png" \
--output "output/news/bootstrap/ref_01"Available Scripts:
script/prediction_based_on_image_gpt5.py(usecpu-env)script/prediction_based_on_image_claude.py(usecpu-env)script/prediction_based_on_image_gemini.py(usecpu-env)script/prediction_based_on_image_qwen.py(usegpu-env)
Note: Replace arguments with your actual paths. The instruction images can be retrieved from the Hugging Face dataset.
We provide an automatic evaluation script using LLM as described in the paper.
The evaluation script uses GPT-5. Set your OpenAI/Azure API keys as described in the Inference section.
Use script/llm_eval.py with cpu-env to evaluate a predicted code against the ground truth.
uv run --project cpu-env script/llm_eval.py \
--org_html "data/news/bootstrap/src/index.html" \
--org_css "data/news/bootstrap/src/styles.css" \
--ref_html "data/news/bootstrap/ref_01/index.html" \
--ref_css "data/news/bootstrap/ref_01/styles.css" \
--pred_html "output/news/bootstrap/ref_01/index.html" \
--pred_css "output/news/bootstrap/ref_01/styles.css" \
--image "path/to/instruction_image.png" \
--output "evaluation_result.json" @inproceedings{hiai2026uiredline,
title={UI-Redline-bench: 赤入れ指示によるWebUIコード修正ベンチマーク},
author={肥合智史 and 藤井諒 and 岸波洋介 and 森下睦},
booktitle={Proceedings of the 32nd Annual Meeting of the Association for Natural Language Processing (NLP2026)},
year={2026}
}