This repository presents a comprehensive study and benchmarking of PDF extraction tools, focusing on their suitability for diverse document processing workflows. The evaluation covers tools' performance across various content types and metrics critical for modern AI applications like Retrieval-Augmented Generation (RAG) and intelligent agents.
PDF extraction tools are vital for enabling AI systems to process structured and unstructured content effectively. This study evaluates six tools on their capabilities for text, table, and image extraction, OCR accuracy, Markdown conversion, and logical reading order preservation.
The views and feedback shared in this article are based on internal testing and evaluations conducted by Actualize's engineering team. This study does not intend to criticize, guarantee ownership, or take any responsibility for the performance or effectiveness of the tools discussed. Our aim is to transparently share the findings from our testing process without bias, providing insights for informational purposes only.
The following tools were benchmarked:
- MinerU
- Xerox
- Docling
- Llama Parse
- Marker
- Unstructured
Each tool has a dedicated main.py
file in its respective directory for running its benchmarking script.
- Text and table extraction accuracy
- Image clarity and positioning
- Markdown conversion fidelity
- OCR performance for scanned PDFs
- Logical reading order accuracy
- Resource utilization on CPU, MPS, and GPU platforms
Clone the repository to your local system:
git clone https://github.com/actualize-ae/Pdf-Benchmarking.git
Install the necessary dependencies for the project:
pip install -r requirements.txt
To run MinerU, follow these steps to create a separate Conda environment and set up its dependencies:
-
Create and activate the Conda environment:
conda create -n MinerU python=3.10 conda activate MinerU
-
Install the required dependencies for MinerU:
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
-
Download and set up the necessary model files:
-
Install the Hugging Face library:
pip install huggingface_hub
-
Download the model setup script:
wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
-
Run the script to download the models:
python download_models_hf.py
-
Run the main.py
file for the specific tool you want to benchmark:
-
For MinerU:
python mineru/main.py
-
For Xerox:
python xerox/main.py
-
For other tools, navigate to their directories and run their respective
main.py
scripts.
The study highlights the need for further advancements in GPU support, table recognition, and resource optimization to enhance these tools' performance in AI-driven workflows.
For further details about each tool, refer to their official documentation and repositories: