Skip to content

JetBrains-Research/EDAplot

Repository files navigation

EDAplot (VegaChat)

This repository contains a snapshot of the code used for the paper "Generating and Evaluating Declarative Charts Using Large Language Models".

Usage

Run the interactive Streamlit prototype locally with:

poetry run python -m streamlit run frontend/app.py

To use the code as a library, look into api.py.

Evaluation

Setup

Download evaluation datasets:

Benchmarks

Example for running the NLV Corpus benchmark:

poetry run python -m scripts.run_benchmark nlv_corpus --dataset_dir dataset/nlv_corpus --output_path out/benchmarks

Run the interactive results report with:

poetry run python -m streamlit run benchmark/reports/vega_chat_benchmark_report.py out/benchmarks

where out is the path to the directory containing the saved outputs.

Evals

Our set of custom test cases (evals) are defined as yaml files. Each eval specifies the actions to take and the checks to perform after each action.

Run the evals with:

poetry run python -m scripts.run_benchmark evals --output_path out/evals

Run the interactive results report with:

poetry run python -m streamlit run benchmark/reports/evals_report.py out/evals

where out is the path to the directory containing the saved outputs.

Update existing results with new checks using:

poetry run python -m scripts.run_eval_checks out/evals/

Request Analyzer

Run the request analyzer benchmark with:

poetry run python -m scripts.run_request_analyzer_benchmark --dataset_dir dataset/chart-llm --take_n 180 --output_path out/request_analyzer_benchmark/ chart_llm_gold

View the results with:

poetry run python -m streamlit run benchmark/reports/request_analyzer_benchmark_report.py out/request_analyzer_benchmark/

LLM as a judge

Vision Judge

The vision judge uses a multimodal LLM to compare the generated image to the reference image. It can be used to compare results from different plotting libraries (e.g., matplotlib and Vega-Lite).

To run the vision judge evaluation on existing outputs use:

poetry run python -m scripts.run_vision_judge example.jsonl

or use the --vision_judge flag together with scripts/run_benchmark.py

Vision Judge Benchmark

To evaluate the vision judge, we use a separate benchmark.

Run it with:

poetry run python -m scripts.run_vision_judge_benchmark

View the results with:

poetry run python -m streamlit run benchmark/reports/vision_judge_benchmark_report.py out/vision_judge_benchmark/

LIDA Self-Evaluation

LIDA's self-evaluation can be run with:

poetry run python -m scripts.run_lida_self_eval example.jsonl

Configuring dev environment

  1. Install poetry: poetry self update 2.1.3
  2. Install dependencies:
poetry sync --no-root
  1. Run poetry run pre-commit install
  2. Add LLM providers' keys to env variables

Run tests with:

poetry run pytest tests

For some tests you need to first download the Evaluation datasets.

Docker

Build the image and run the container:

docker build -f frontend.Dockerfile -t edaplot .
docker run --rm -p 8501:8501 -e OPENAI_API_KEY -t edaplot

About

VegaChat paper: Generating and Evaluating Declarative Charts Using Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published