This repository contains a snapshot of the code used for the paper "Generating and Evaluating Declarative Charts Using Large Language Models".
Run the interactive Streamlit prototype locally with:
poetry run python -m streamlit run frontend/app.py
To use the code as a library, look into api.py.
Download evaluation datasets:
- NLV Corpus is included
- chart-llm should be cloned into
./dataset/
Example for running the NLV Corpus benchmark:
poetry run python -m scripts.run_benchmark nlv_corpus --dataset_dir dataset/nlv_corpus --output_path out/benchmarks
Run the interactive results report with:
poetry run python -m streamlit run benchmark/reports/vega_chat_benchmark_report.py out/benchmarks
where out
is the path to the directory containing the saved outputs.
Our set of custom test cases (evals) are defined as yaml
files.
Each eval specifies the actions to take and the checks to perform after each action.
Run the evals with:
poetry run python -m scripts.run_benchmark evals --output_path out/evals
Run the interactive results report with:
poetry run python -m streamlit run benchmark/reports/evals_report.py out/evals
where out
is the path to the directory containing the saved outputs.
Update existing results with new checks using:
poetry run python -m scripts.run_eval_checks out/evals/
Run the request analyzer benchmark with:
poetry run python -m scripts.run_request_analyzer_benchmark --dataset_dir dataset/chart-llm --take_n 180 --output_path out/request_analyzer_benchmark/ chart_llm_gold
View the results with:
poetry run python -m streamlit run benchmark/reports/request_analyzer_benchmark_report.py out/request_analyzer_benchmark/
The vision judge uses a multimodal LLM to compare the generated image to the reference image. It can be used to compare results from different plotting libraries (e.g., matplotlib and Vega-Lite).
To run the vision judge evaluation on existing outputs use:
poetry run python -m scripts.run_vision_judge example.jsonl
or use the --vision_judge
flag together with scripts/run_benchmark.py
To evaluate the vision judge, we use a separate benchmark.
Run it with:
poetry run python -m scripts.run_vision_judge_benchmark
View the results with:
poetry run python -m streamlit run benchmark/reports/vision_judge_benchmark_report.py out/vision_judge_benchmark/
LIDA's self-evaluation can be run with:
poetry run python -m scripts.run_lida_self_eval example.jsonl
- Install poetry:
poetry self update 2.1.3
- Install dependencies:
poetry sync --no-root
- Run
poetry run pre-commit install
- Add LLM providers' keys to env variables
Run tests with:
poetry run pytest tests
For some tests you need to first download the Evaluation datasets.
Build the image and run the container:
docker build -f frontend.Dockerfile -t edaplot .
docker run --rm -p 8501:8501 -e OPENAI_API_KEY -t edaplot