This repository contains the inference and evaluation scripts for the paper "Document Haystack: A Long Context Multimodal Image/Document Understanding Vision LLM Benchmark".
The proliferation of multimodal Large Language Models has significantly advanced the ability to analyze and understand complex data inputs from different modalities. However, the processing of long documents remains under-explored, largely due to a lack of suitable benchmarks. To address this, we introduce Document Haystack, a comprehensive benchmark designed to evaluate the performance of Vision Language Models (VLMs) on long, visually complex documents. Document Haystack features documents ranging from 5 to 200 pages and strategically inserts pure text or multimodal text+image "needles" at various depths within the documents to challenge VLMs' retrieval capabilities. Comprising 400 document variants and a total of 8,250 questions, it is supported by an objective, automated evaluation framework. We detail the construction and characteristics of the Document Haystack dataset, present results from prominent VLMs and discuss potential research avenues in this area.
- Document Formats: Text, Image, PDF
- Document Range: 5-200 pages
- Dataset Size: 400 document variants
- Question Pool: 8,250 evaluation questions
- Needle Types:
- Pure text
- Multimodal (text + image)
- Automated Evaluation Framework
- Strategic needle placement at various document depths
- Three inference and evaluation settings:
- TextNeedlesFromParsedText
- TextNeedlesFromDocumentImages
- TextImagesNeedlesFromDocumentImages
.
βββ src/
β βββ evaluation/ # Evaluation and analysis scripts
β β βββ aliases.txt # Alias definitions for TextImagesNeedlesFromDocumentImages evaluation
β β βββ depth_analysis_heatmap.py # Generates heatmap visualizations
β β βββ full_set_evaluation_nova.sh # Batch evaluation script
β β βββ print_average_scores.sh # Computes average performance
β β βββ single_doc_evaluation.py # Single document evaluation
β βββ inference/ # Model inference scripts
β βββ full_set_inference_nova.sh # Batch inference script
β βββ single_doc_inference_nova.py # Single document inference
First, download the Document Haystack dataset. When running the scripts below, provide the path to your local Document Haystack dataset, or optionally to your S3 Document Haystack folder when running inference.
- Python 3.x
- Required Python packages for evaluation:
matplotlib
numpy
seaborn
- AWS Account with Bedrock access
boto3
package
Note: The AWS account and Bedrock access are only required if you run the scripts as provided (which use the Nova Lite model via Amazon Bedrock). If you modify the scripts to use your own models or different APIs, these AWS-specific requirements are not needed. See the Model Adaptation section for details on using different models.
- Clone the repository:
git clone $REPO_LINK
- Install required dependencies:
pip install boto3 matplotlib numpy seaborn
./src/inference/full_set_inference_nova.sh \
--setting TextNeedlesFromParsedText \
--document-haystack-path /path/to/documents \
--results-path /path/to/results \
[--region-name aws-region] \
[--bucket-owner aws-account-id] \
[--temperature temperature-value] \
[--topp topp-value] \
[--topk topk-value]
Available settings:
TextNeedlesFromParsedText
(1)TextNeedlesFromDocumentImages
(2)TextImagesNeedlesFromDocumentImages
(3)
Optional arguments:
--region-name
: Required if running inference from an S3 folder and/or to configure the region of your boto client--bucket-owner
: Required if running inference from an S3 folder; specifies the AWS account ID--temperature
: Optional; temperature value for the model.--topp
: Optional; top-p value for the model.--topk
: Optional; top-k value for the model.
Note: The script supports inference from S3 folders. In this case, --document-haystack-path
should point to the S3 document haystack folder (e.g., s3://my-bucket/document-haystack).
./src/evaluation/full_set_evaluation_nova.sh \
--setting TextNeedlesFromParsedText \
--document-haystack-path /path/to/documents \
--results-path /path/to/results
Available settings:
TextNeedlesFromParsedText
(1)TextNeedlesFromDocumentImages
(2)TextImagesNeedlesFromDocumentImages
(3)
./src/evaluation/print_average_scores.sh \
--results-path /path/to/results
python src/evaluation/depth_analysis_heatmap.py \
--output /path/to/heatmap.png \
--title "Custom Heatmap Title" \
--results-path /path/to/results
To use a different model instead of Amazon Bedrock's Nova model, you'll need to:
- Modify the
main()
function insingle_doc_inference_nova.py
to use your preferred API client instead ofboto3
andbedrock-runtime
. - Adapt the
process_nova_request()
function to match your model's API requirements. - Ensure your modified implementation maintains the same output format and structure for compatibility with evaluation scripts.
Results/
βββ AIG/
β βββ AIG_5Pages/
β β βββ results.txt # Raw model outputs (see format below)
β β βββ results_scores.txt # Evaluation results
β βββ AIG_10Pages/
β βββ ...
βββ AmericanAirlines/
β βββ AmericanAirlines_5Pages/
β βββ ...
βββ ...
Each document's inference results (results.txt) should follow this format:
#1
Prompt: [Your prompt text here]
Output: [Model response here]
#2
Prompt: [Your prompt text here]
Output: [Model response here]
...
- Maintain the exact question ID format (#1, #2, etc.)
- Include both "Prompt:" and "Output:" labels
- Preserve blank lines between entries
- Follow the Results directory structure exactly as shown
- Handle both text and image inputs according to your model's capabilities
This project is licensed under the CC-BY-NC-4.0 License - see the LICENSE file for details.
For any questions or concerns, please open an issue in the repository.
Amazon AGI
- Goeric Huybrechts
- Srikanth Ronanki
- Sai Muralidhar Jayanthi
- Jack Fitzgerald
- Srinivasan Veeravanallur