🌐 Website: vectorinstitute.github.io/humanibench | 📄 Paper: arxiv.org/abs/2505.11454 | 📊 Dataset: Hugging Face
As multimodal generative AI systems become increasingly integrated into human-centered applications, evaluating their alignment with human values has become critical.
HumaniBench is the first comprehensive benchmark designed to evaluate Large Multimodal Models (LMMs) on seven Human-Centered AI (HCAI) principles:
- Fairness
- Ethics
- Understanding
- Reasoning
- Language Inclusivity
- Empathy
- Robustness
This repository provides code and scripts for evaluating LMMs across 7 human-aligned tasks.
- 📷 32,000+ Real-World Image–Question Pairs
- ✅ Human-Verified Ground Truth Annotations
- 🌐 Multilingual QA Support (10+ languages)
- 🧠 Open and Closed-Ended VQA Formats
- 🧪 Visual Robustness & Bias Stress Testing
- 📑 Chain-of-Thought Reasoning + Perceptual Grounding
| Task | Focus | Folder |
|---|---|---|
| Task 1: Scene Understanding | Visual reasoning + bias/toxicity analysis in social attributes (gender, age, occupation, etc.) | code/task1_Scene_Understanding |
| Task 2: Instance Identity | Visual reasoning in culturally rich, socially grounded settings | code/task2_Instance_Identity |
| Task 3: Multiple Choice QA | Structured attribute recognition via multi-choice questions | code/task3_Multiple_Choice_VQA |
| Task 4: Multilingual Visual QA | VQA across 10+ languages, including low-resource ones | code/task4_Multilingual |
| Task 5: Visual Grounding | Bounding box localization of socially salient regions | code/task5_Visual_Grounding |
| Task 6: Empathetic Captioning | Human-style emotional captioning evaluation | code/task6_Empathetic_Captioning |
| Task 7: Image Resilience | Robustness testing via image perturbations | code/task7_Image_Resilience |
🔍 Each task folder includes a README with setup instructions, task structure, and metrics.
Three-stage process:
-
Data Collection Curated from global news imagery, tagged by social attributes (age, gender, race, occupation, sport)
-
Annotation GPT-4o–assisted labeling + human expert verification
-
Evaluation Comprehensive scoring across:
- Accuracy
- Fairness
- Robustness
- Empathy
- Faithfulness
- 🔍 Bias persists, especially across gender and race
- 🌐 Multilingual gaps affect low-resource language performance
- ❤️ Empathy and ethics vary significantly by model family
- 🧠 Chain-of-Thought reasoning improves performance but doesn’t fully mitigate bias
- 🧪 Robustness tests reveal fragility to noise, occlusion, and blur
If you use HumaniBench or this evaluation suite in your work, please cite:
@article{raza2025humanibench,
title={Humanibench: A human-centric framework for large multimodal models evaluation},
author={Raza, Shaina and Narayanan, Aravind and Khazaie, Vahid Reza and Vayani, Ashmal and Radwan, Ahmed Y and Chettiar, Mukund S and Singh, Amandeep and Shah, Mubarak and Pandya, Deval},
journal={arXiv preprint arXiv:2505.11454},
year={2025}
}
For questions, collaborations, or dataset access requests, please open an issue in this repository or contact the corresponding author at [email protected], as listed in the paper.
We invite researchers, developers, and policymakers to explore, evaluate, and extend HumaniBench. 🚀
