Protecting MLLMs against misleading visualizations

This repository contains the implementation of the arXiv preprint: Protecting Multimodal LLMs against misleading visualizations. The code is released under an Apache 2.0 license.

Contact person: Jonathan Tonglet

UKP Lab | TU Darmstadt

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

News 📢

We released a follow-up work "Is this chart lying to me? Automating the detection of misleading visualizations", check it out 🔥

Abstract

Visualizations play a pivotal role in daily communication in an increasingly data-driven world. Research on multimodal large language models (MLLMs) for automated chart understanding has accelerated massively, with steady improvements on standard benchmarks. However, for MLLMs to be reliable, they must be robust to misleading visualizations, charts that distort the underlying data, leading readers to draw inaccurate conclusions that may support disinformation. Here, we uncover an important vulnerability: MLLM question-answering accuracy on misleading visualizations drops on average to the level of a random baseline. To address this, we introduce the first inference-time methods to improve performance on misleading visualizations, without compromising accuracy on non-misleading ones. The most effective method extracts the underlying data table and uses a text-only LLM to answer the question based on the table. Our findings expose a critical blind spot in current research and establish benchmark results to guide future efforts in reliable MLLMs.

tl;dr

Misleading visualizations are charts that distort the underlying data table, leading readers to inaccurate interpretations that may support disinformation 📊
- Distortions include truncated and inverted axes, 3D effects, or inconsistent tick intervals
- Misleading visusalizations negatively affect the performance of human readers in QA tasks. What about MLLMs?
MLLMs are very vulnerable to misleading visualizations too ⚠️
- their QA performance drops
  - to the level of the random baseline
  - by up to 65.5 percentage points compared to the standard benchmark ChartQA
- they cannot answer questions consistently depending on whether they observe a misleading or non-misleading visualization of the same data

We propose six inference-time correction methods to improve performance on misleading visualizations 🛠️
- the best method is to extract the table using the MLLM, then answer with a LLM using the table only
- this improves QA performance on misleading visualizations by up to 19.6 percentage points
- However this degrades the performance on non-misleading visualizations
- An alternative is to redraw the chart based on the extracted table, yielding smaller improvements

Environment

Follow these instructions to recreate the environment used for all our experiments.

$ conda create --name misviz python=3.9
$ conda activate misviz
$ pip install -r requirements.txt

Datasets

CALVI
- dataset introduced by Get el. (2023) in "CALVI: Critical Thinking Assessment for Literacy in Visualizations".
- Ready to use
- License: CC-BY 4.0
Lauer & O'Brien
- dataset introduced by Lauer & O'Brien (2020) in "The Deceptive Potential of Common Design Tactics Used in Data Visualizations"
- Ready to use
Real-world
- dataset introduced in this work, based on visualizations collected by Lo et al. (2022) in "Misinformed by visualization: What do we learn from misinformative visualizations?"
- Images should be downloaded using the script below
- License for the QA pairs: CC-BY-SA 4.0
CHARTOM
- dataset introduced by Bharti et al. (2024) in "CHARTOM: A Visual Theory-of-Mind Benchmark for Multimodal Large Language Models"
- Please contact the authors to get access to the dataset.
- Run the script below to process the dataset.
VLAT
- dataset introduced by Lee et al. (2017) in "VLAT: Development of a Visualization Literacy Assessment Test"
- Ready to use

The following script will prepare the datasets, including downloading the real-world images.

$ python src/dataset_preparation.py

Quick start

The following code lets you evaluate the performance of MLLMs on misleading and non-misleading visualizations, with or without one of the six correction methods proposed in the paper. Some correction methods require intermediate steps like extracting the axes or table, or redrawing the visualization.

Evaluate a multimodal LLM on one or more dataset

$ python src/question_answering.py --datasets calvi-chartom-real_world-vlat --model internvl2.5/8B/

The --datasets argument expects a string of dataset names separated by -. By default, available datasets are calvi, chartom, real_world, lauer, and vlat.

The --model argument expects a string in the format model_name/model_size/. By default, the following models are available:

Name	Available sizes	🤗 models
internvl2.5	2B, 4B, 8B, 26B, 38B	Link
ovis 1.6	9B, 27B	Link
llava-v1.6-vicuna	7B, 13B	Link
qwen2vl	2B, 7B	Link
chartinstruction	13B	Link
chartgemma	3B	Link
tinychart	3B	Link

If you want to use TinyChart: you need to copy this folder and place it in the root folder of this repo.

If you want to use ChartInstruction: you need to copy this folder and place it in the root folder of this repo.

We also provide code to conduct experiments with GPT4, GPT4o, Gemini-1.5-flash, and Gemini-1.5-pro using the Azure OpenAI Service and the Google AI Studio. You will first need to obtain API keys from both providers and store them as environment variables.

Generate metadata (table, axis)

$ python src/chart2metadata.py --datasets calvi-chartom-real_world-vlat --model internvl2.5/8B/

Redraw a visualization based on the extracted table

$ python src/table2code.py --datasets calvi-chartom-real_world-vlat --model qwen2.5/7B/

Evaluation

Finally, evaluate the accuracy of the models

$ python src/evaluate.py --results_folder results_qa --output_file results_qa.csv

Citation

If you find this work relevant to your research or use this code in your work, please cite our paper as follows:

@article{tonglet2025misleadingvisualizations,
  title={Protecting multimodal LLMs against misleading visualizations},
  author={Tonglet, Jonathan and Tuytelaars, Tinne and Moens, Marie-Francine and Gurevych, Iryna},
  journal={arXiv preprint arXiv:2502.20503},
  year={2025},
  url={https://arxiv.org/abs/2502.20503},
  doi={10.48550/arXiv.2502.20503}
}

Disclaimer

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
VLAT		VLAT
assets		assets
calvi_examples		calvi_examples
datasets		datasets
lauer_examples		lauer_examples
src		src
static		static
.gitignore		.gitignore
.nojekyll		.nojekyll
LICENSE		LICENSE
README.md		README.md
index.html		index.html
notice		notice
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Protecting MLLMs against misleading visualizations

News 📢

Abstract

tl;dr

Environment

Datasets

Quick start

Evaluate a multimodal LLM on one or more dataset

Generate metadata (table, axis)

Redraw a visualization based on the extracted table

Evaluation

Citation

Disclaimer

About

Uh oh!

Languages

License

UKPLab/arxiv2025-misleading-visualizations

Folders and files

Latest commit

History

Repository files navigation

Protecting MLLMs against misleading visualizations

News 📢

Abstract

tl;dr

Environment

Datasets

Quick start

Evaluate a multimodal LLM on one or more dataset

Generate metadata (table, axis)

Redraw a visualization based on the extracted table

Evaluation

Citation

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages