A Google Colab script for batch processing PDF documents into multiple structured formats. This tool extracts and preserves document structure, making content easily accessible for further analysis or processing using the docling library.
- Batch processing of multiple PDF files
- Maintains document structure and formatting
- Multiple output formats:
- Plain text (
.txt) - Markdown (
.md) - Structured JSON (
.json) - Document tokens (
.doctags)
- Plain text (
- Table structure preservation
- Automatic zip file creation of all outputs
- No OCR by default (configurable)
- Google Colab account
- PDF files to process
- Python packages (automatically installed):
- docling
- poppler-utils
- google-colab
-
Open in Google Colab:
- Create a new notebook
- Copy the script into three separate cells as marked in the comments
-
Run the cells in order:
# First cell: Package installation # Second cell: Function definitions # Third cell: Main execution
-
When prompted, upload your PDF files
- You can select multiple files at once
- Files will be processed sequentially
-
Results:
- View processed files in the scratch directory
- See text and markdown output in the console
- Download all outputs as a zip file
- Text (.txt): Clean, plain text extraction
- Markdown (.md): Preserves headers, lists, and basic formatting
- JSON (.json): Structured representation of the document
- DocTags (.doctags): Document tokens for NLP/analysis
Modify PdfPipelineOptions in the main function to customize processing:
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False # Enable/disable OCR
pipeline_options.do_table_structure = True # Table detection
pipeline_options.table_structure_options.do_cell_matching = True.
├── tests/
│ └── data/ # Uploaded PDF files
├── scratch/ # Processed outputs
└── output_files.zip # Final compressed outputs
- Create necessary directories
- Upload PDF files
- Process each file:
- Convert to multiple formats
- Save outputs to scratch directory
- Display results
- Create and download zip file
- Runs only in Google Colab environment
- OCR disabled by default
- Memory limitations based on Colab's resources
- Processing time varies with document size and complexity
Feel free to open issues or submit pull requests with improvements.
This project is open source and available under the MIT License.