Azure Document Intelligence Output Parser

A comprehensive document analysis system that processes Azure Document Intelligence JSON files to extract personal names and classify sensitive content using advanced AI models.

🚀 Features

Document Analysis: Processes Azure DI JSON files with advanced segmentation
AI-Powered Classification: Uses Gemini Pro 2.5 for intelligent content analysis
Page Number Tracking: Tracks original page numbers for each classification
CV/Resume Merging: Automatically merges multi-page CVs into single classifications
Professional Reports: Generates formatted Word documents and Markdown reports
Caching Support: Implements prompt caching for cost optimization
Batch Processing: Handles multiple documents efficiently

📋 Classification Categories

The system classifies content into these categories:

Personal Information

1.1 Personal Information
1.2 Governors'/Executive Directors' Communications
1.3 Ethics Committee Materials
1.4 Attorney–Client Privilege
1.5 Security & Safety Information
1.6 Restricted Investigative Info
1.7 Confidential Third-Party Information
1.8 Corporate Administrative Matters
1.9 Financial Information

Document Types

2.1 CV or Resume Content
2.2 Derogatory or Offensive Language

Special Categories

3.1 Documents from Specific Entities (IFC, MIGA, INT, IMF)
3.2 Joint WBG Documents
3.3 Security-Marked Documents
3.4 Procurement Content

🛠️ Installation

Prerequisites

Python 3.8+
Google Cloud credentials for Vertex AI
Required Python packages (see requirements below)

Setup

Clone the repository:

git clone https://github.com/Minaekramnia/Azure-DI-document-parser.git
cd Azure-DI-document-parser

Install dependencies:

pip install google-genai python-docx python-dotenv

Set up your environment variables:

# Create .env file with your Google Cloud credentials
export GOOGLE_APPLICATION_CREDENTIALS="path/to/your/credentials.json"

📖 Usage

Main Processing Script

The primary script for document analysis:

python Azure_DI_output_parser_WORKING_PageNumbers.py

Features:

Processes Azure DI JSON files
Extracts personal names
Classifies sensitive content
Includes page number tracking
Merges multi-page CVs
Outputs structured JSON results

Output Structure

The system generates JSON files with this structure:

{
  "document_path": "path/to/document.pdf.json",
  "total_pages": 5,
  "total_segments": 2,
  "segments": [
    {
      "segment_id": "segment_1",
      "pages": [1, 2],
      "page_range": "1-2",
      "extracted_names": ["John Doe", "Jane Smith"],
      "classifications": [
        {
          "category": "2.1 CV or Resume Content",
          "text": "Professional experience and education...",
          "bounding_box": [0, 0, 10, 12],
          "page_number": 1,
          "confidence_score": 0.95,
          "reason": "Contains complete CV information"
        }
      ]
    }
  ]
}

📄 Report Generation

Convert to Word Documents

Generate professional Word reports:

python convert_to_word_fixed.py

Features:

Formal document styling
Cambria font throughout
Professional formatting
Summary statistics
Classification breakdown

Remove Names from Reports

Create clean versions without extracted names:

python remove_names_from_word_FIXED.py

Final Formatting

Apply final formatting to reports:

python format_final_word_documents_FINAL.py

📁 Output Folders

The system creates organized output folders:

PI/
├── markdown_reports/          # Markdown versions of reports
├── word_reports/             # Original Word documents
├── final_word_reports/       # Cleaned Word documents (no names)
└── *_WORKING_PageNumbers_analysis.json  # Analysis results

⚙️ Configuration

Prompt File

The system uses Executive_Prompt.md for AI analysis instructions. This file contains:

Detailed classification rules
Output format requirements
Sensitivity guidelines

Caching

The system implements intelligent caching:

First file: Full prompt sent (~2000+ tokens)
Subsequent files: Cached prompt (~100 tokens)
Cost savings: Significant reduction in API costs

🔧 Advanced Features

Document Segmentation

Size-based: Detects document boundaries by page size changes
Title-based: Identifies new documents by title presence
Page sequence: Uses page numbering to merge related pages

CV Processing

Automatically detects multi-page CVs
Merges related CV sections into single classification
Maintains page number tracking across merged content

Error Handling

Graceful fallback for caching failures
Comprehensive logging and validation
Robust JSON parsing with error recovery

📊 Performance

Batch processing: Handles multiple documents efficiently
Caching optimization: Reduces API costs by ~90%
Memory efficient: Processes large documents without memory issues
Fast execution: Optimized for production use

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

For support and questions:

Create an issue in the GitHub repository
Check the documentation in the /docs folder
Review the example outputs in the repository

🎯 Use Cases

This system is designed for:

Compliance teams processing sensitive documents
Legal departments reviewing confidential materials
HR teams analyzing CVs and personal information
Security teams classifying sensitive content
Research organizations processing large document collections

📈 Recent Updates

✅ Added page number tracking for all classifications
✅ Implemented CV merging for multi-page documents
✅ Enhanced Word document formatting
✅ Added comprehensive error handling
✅ Optimized caching for cost reduction
✅ Created professional report templates

Built with ❤️ for efficient document analysis and compliance

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
ACTION_PLAN.md		ACTION_PLAN.md
ADD_NAME_EXTRACTION.md		ADD_NAME_EXTRACTION.md
Azure_DI_output_parser copy.py		Azure_DI_output_parser copy.py
Azure_DI_output_parser.py		Azure_DI_output_parser.py
Azure_DI_output_parser_CLEAN.py		Azure_DI_output_parser_CLEAN.py
Azure_DI_output_parser_DIAGNOSTIC.py		Azure_DI_output_parser_DIAGNOSTIC.py
Azure_DI_output_parser_FINAL.py		Azure_DI_output_parser_FINAL.py
Azure_DI_output_parser_SIMPLE.py		Azure_DI_output_parser_SIMPLE.py
Azure_DI_output_parser_V2.py		Azure_DI_output_parser_V2.py
Azure_DI_output_parser_V2_Simple.py		Azure_DI_output_parser_V2_Simple.py
Azure_DI_output_parser_V3.py		Azure_DI_output_parser_V3.py
Azure_DI_output_parser_V3_Cached.py		Azure_DI_output_parser_V3_Cached.py
Azure_DI_output_parser_V3_FINAL_WITH_CACHING.py		Azure_DI_output_parser_V3_FINAL_WITH_CACHING.py
Azure_DI_output_parser_V3_FIXED.py		Azure_DI_output_parser_V3_FIXED.py
Azure_DI_output_parser_V3_Final.py		Azure_DI_output_parser_V3_Final.py
Azure_DI_output_parser_WORKING.py		Azure_DI_output_parser_WORKING.py
Azure_DI_output_parser_WORKING_PageNumbers.py		Azure_DI_output_parser_WORKING_PageNumbers.py
CHANGES_PAGE_NUMBER.md		CHANGES_PAGE_NUMBER.md
CODE_EXPLANATION.md		CODE_EXPLANATION.md
COMPLETE_VERIFICATION.md		COMPLETE_VERIFICATION.md
CONVERTER_INSTRUCTIONS.md		CONVERTER_INSTRUCTIONS.md
CONVERTER_WITH_FOLDER.md		CONVERTER_WITH_FOLDER.md
CV_MERGING_ADDED.md		CV_MERGING_ADDED.md
DATABRICKS_INSTRUCTIONS.md		DATABRICKS_INSTRUCTIONS.md
DIAGNOSTIC_OUTPUT_EXAMPLE.md		DIAGNOSTIC_OUTPUT_EXAMPLE.md
EMERGENCY_FIX.md		EMERGENCY_FIX.md
EXECUTIVE_PROMPT_READY.md		EXECUTIVE_PROMPT_READY.md
Executive_Prompt.md		Executive_Prompt.md
FINAL_CORRECTED_FORMATTING.md		FINAL_CORRECTED_FORMATTING.md
FINAL_FORMATTING_CHANGES.md		FINAL_FORMATTING_CHANGES.md
FINAL_INSTRUCTIONS.md		FINAL_INSTRUCTIONS.md
FINAL_SOLUTION_WITH_CACHING.md		FINAL_SOLUTION_WITH_CACHING.md
FINAL_WITH_CACHE_DELETION.md		FINAL_WITH_CACHE_DELETION.md
FIND_PROMPT_FIRST.md		FIND_PROMPT_FIRST.md
FIXED_FORMAT_ISSUE.md		FIXED_FORMAT_ISSUE.md
FUNCTION_BY_FUNCTION_ANALYSIS.md		FUNCTION_BY_FUNCTION_ANALYSIS.md
IMPORT_FIX.md		IMPORT_FIX.md
ISSUE_FOUND.md		ISSUE_FOUND.md
ISSUE_RESOLUTION.md		ISSUE_RESOLUTION.md
JSON_OUTPUT_FORMAT.md		JSON_OUTPUT_FORMAT.md
MARKDOWN_CONVERTER.md		MARKDOWN_CONVERTER.md
MARKDOWN_TO_WORD_INSTRUCTIONS.md		MARKDOWN_TO_WORD_INSTRUCTIONS.md
MasterPromp_V4.md		MasterPromp_V4.md
MasterPrompt_V2.docx		MasterPrompt_V2.docx
MasterPrompt_V2_Modified.txt		MasterPrompt_V2_Modified.txt
MasterPrompt_V3.docx		MasterPrompt_V3.docx
MasterPrompt_V3_Updated.docx		MasterPrompt_V3_Updated.docx
MasterPrompt_V4_WITH_NAMES.md		MasterPrompt_V4_WITH_NAMES.md
MasterPrompt_Vlada.docx		MasterPrompt_Vlada.docx
Master_Prompt_V1.docx		Master_Prompt_V1.docx
NEW_DOCUMENT_STRUCTURE.md		NEW_DOCUMENT_STRUCTURE.md
Name_Fuzzy_Match.py		Name_Fuzzy_Match.py
OUTPUT_FILES.md		OUTPUT_FILES.md
Output.txt		Output.txt
PRE_RUN_VERIFICATION.md		PRE_RUN_VERIFICATION.md
PROBLEM_FOUND.md		PROBLEM_FOUND.md
PROMPT_UPDATED_FINAL.md		PROMPT_UPDATED_FINAL.md
PROMPT_VERIFICATION.md		PROMPT_VERIFICATION.md
PROPER_SEGMENTATION_FIXED.md		PROPER_SEGMENTATION_FIXED.md
README.md		README.md
README_PageNumbers_Version.md		README_PageNumbers_Version.md
READY_TO_RUN.md		READY_TO_RUN.md
READY_TO_RUN_FINAL.md		READY_TO_RUN_FINAL.md
REMOVE_NAMES_INSTRUCTIONS.md		REMOVE_NAMES_INSTRUCTIONS.md
RUN_DIAGNOSTIC_FIRST.md		RUN_DIAGNOSTIC_FIRST.md
RUN_ME_FIRST.md		RUN_ME_FIRST.md
START_FRESH_INSTRUCTIONS.md		START_FRESH_INSTRUCTIONS.md
TEAM_EMAIL.md		TEAM_EMAIL.md
TEST_ON_5_FILES.md		TEST_ON_5_FILES.md
THIS_IS_THE_ONE.md		THIS_IS_THE_ONE.md
TWO_VERSIONS_EXPLAINED.md		TWO_VERSIONS_EXPLAINED.md
USE_THIS_SIMPLE_VERSION.md		USE_THIS_SIMPLE_VERSION.md
V2_VS_V3_COMPARISON.md		V2_VS_V3_COMPARISON.md
WORD_CONVERTER_READY.md		WORD_CONVERTER_READY.md
WORD_FORMAT_CHANGES.md		WORD_FORMAT_CHANGES.md
WORKFLOW_PageNumbers.md		WORKFLOW_PageNumbers.md
check_word_files.py		check_word_files.py
convert_PageNumbers_to_reports.py		convert_PageNumbers_to_reports.py
convert_PageNumbers_to_reports_FINAL.py		convert_PageNumbers_to_reports_FINAL.py
convert_json_to_markdown.py		convert_json_to_markdown.py
convert_to_reports.py		convert_to_reports.py
convert_to_word_fixed.py		convert_to_word_fixed.py
find_prompt.py		find_prompt.py
format_final_word_documents.py		format_final_word_documents.py
format_final_word_documents_FINAL.py		format_final_word_documents_FINAL.py
json_to_word_converter.py		json_to_word_converter.py
markdown_to_word.py		markdown_to_word.py
master_prompt_output.txt		master_prompt_output.txt
master_prompt_vlada_content.txt		master_prompt_vlada_content.txt
page_1_vlada_prompt.txt		page_1_vlada_prompt.txt
page_2_vlada_prompt.txt		page_2_vlada_prompt.txt
process_PI2.py		process_PI2.py
process_with_llm_options.py		process_with_llm_options.py
process_with_vlada_prompt.py		process_with_vlada_prompt.py
prompt_v3.docx		prompt_v3.docx
prompt_v3.md		prompt_v3.md
remove_names_from_word.py		remove_names_from_word.py
remove_names_from_word_FIXED.py		remove_names_from_word_FIXED.py
segmented_documents_output.txt		segmented_documents_output.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Azure Document Intelligence Output Parser

🚀 Features

📋 Classification Categories

Personal Information

Document Types

Special Categories

🛠️ Installation

Prerequisites

Setup

📖 Usage

Main Processing Script

Output Structure

📄 Report Generation

Convert to Word Documents

Remove Names from Reports

Final Formatting

📁 Output Folders

⚙️ Configuration

Prompt File

Caching

🔧 Advanced Features

Document Segmentation

CV Processing

Error Handling

📊 Performance

🤝 Contributing

📝 License

🆘 Support

🎯 Use Cases

📈 Recent Updates

About

Uh oh!

Releases

Packages

Languages

Minaekramnia/Azure-DI-document-parser

Folders and files

Latest commit

History

Repository files navigation

Azure Document Intelligence Output Parser

🚀 Features

📋 Classification Categories

Personal Information

Document Types

Special Categories

🛠️ Installation

Prerequisites

Setup

📖 Usage

Main Processing Script

Output Structure

📄 Report Generation

Convert to Word Documents

Remove Names from Reports

Final Formatting

📁 Output Folders

⚙️ Configuration

Prompt File

Caching

🔧 Advanced Features

Document Segmentation

CV Processing

Error Handling

📊 Performance

🤝 Contributing

📝 License

🆘 Support

🎯 Use Cases

📈 Recent Updates

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages