EDAplot (VegaChat)

This repository contains a snapshot of the code used for the paper "Generating and Evaluating Declarative Charts Using Large Language Models".

Usage

Run the interactive Streamlit prototype locally with:

poetry run python -m streamlit run frontend/app.py

To use the code as a library, look into api.py.

Evaluation

Setup

Download evaluation datasets:

NLV Corpus is included
chart-llm should be cloned into ./dataset/

Benchmarks

Example for running the NLV Corpus benchmark:

poetry run python -m scripts.run_benchmark nlv_corpus --dataset_dir dataset/nlv_corpus --output_path out/benchmarks

Run the interactive results report with:

poetry run python -m streamlit run benchmark/reports/vega_chat_benchmark_report.py out/benchmarks

where out is the path to the directory containing the saved outputs.

Evals

Our set of custom test cases (evals) are defined as yaml files. Each eval specifies the actions to take and the checks to perform after each action.

Run the evals with:

poetry run python -m scripts.run_benchmark evals --output_path out/evals

Run the interactive results report with:

poetry run python -m streamlit run benchmark/reports/evals_report.py out/evals

where out is the path to the directory containing the saved outputs.

Update existing results with new checks using:

poetry run python -m scripts.run_eval_checks out/evals/

Request Analyzer

Run the request analyzer benchmark with:

poetry run python -m scripts.run_request_analyzer_benchmark --dataset_dir dataset/chart-llm --take_n 180 --output_path out/request_analyzer_benchmark/ chart_llm_gold

View the results with:

poetry run python -m streamlit run benchmark/reports/request_analyzer_benchmark_report.py out/request_analyzer_benchmark/

LLM as a judge

Vision Judge

The vision judge uses a multimodal LLM to compare the generated image to the reference image. It can be used to compare results from different plotting libraries (e.g., matplotlib and Vega-Lite).

To run the vision judge evaluation on existing outputs use:

poetry run python -m scripts.run_vision_judge example.jsonl

or use the --vision_judge flag together with scripts/run_benchmark.py

Vision Judge Benchmark

To evaluate the vision judge, we use a separate benchmark.

Run it with:

poetry run python -m scripts.run_vision_judge_benchmark

View the results with:

poetry run python -m streamlit run benchmark/reports/vision_judge_benchmark_report.py out/vision_judge_benchmark/

LIDA Self-Evaluation

LIDA's self-evaluation can be run with:

poetry run python -m scripts.run_lida_self_eval example.jsonl

Configuring dev environment

Install poetry: poetry self update 2.1.3
Install dependencies:

poetry sync --no-root

Run poetry run pre-commit install
Add LLM providers' keys to env variables

Run tests with:

poetry run pytest tests

For some tests you need to first download the Evaluation datasets.

Docker

Build the image and run the container:

docker build -f frontend.Dockerfile -t edaplot .
docker run --rm -p 8501:8501 -e OPENAI_API_KEY -t edaplot

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmark		benchmark
dataset/nlv_corpus		dataset/nlv_corpus
edaplot		edaplot
frontend		frontend
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
frontend.Dockerfile		frontend.Dockerfile
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements-frontend.txt		requirements-frontend.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EDAplot (VegaChat)

Usage

Evaluation

Setup

Benchmarks

Evals

Request Analyzer

LLM as a judge

Vision Judge

Vision Judge Benchmark

LIDA Self-Evaluation

Configuring dev environment

Docker

About

Uh oh!

Releases

Packages

Uh oh!

Languages

JetBrains-Research/EDAplot

Folders and files

Latest commit

History

Repository files navigation

EDAplot (VegaChat)

Usage

Evaluation

Setup

Benchmarks

Evals

Request Analyzer

LLM as a judge

Vision Judge

Vision Judge Benchmark

LIDA Self-Evaluation

Configuring dev environment

Docker

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages