A comprehensive document analysis system that processes Azure Document Intelligence JSON files to extract personal names and classify sensitive content using advanced AI models.
- Document Analysis: Processes Azure DI JSON files with advanced segmentation
- AI-Powered Classification: Uses Gemini Pro 2.5 for intelligent content analysis
- Page Number Tracking: Tracks original page numbers for each classification
- CV/Resume Merging: Automatically merges multi-page CVs into single classifications
- Professional Reports: Generates formatted Word documents and Markdown reports
- Caching Support: Implements prompt caching for cost optimization
- Batch Processing: Handles multiple documents efficiently
The system classifies content into these categories:
- 1.1 Personal Information
- 1.2 Governors'/Executive Directors' Communications
- 1.3 Ethics Committee Materials
- 1.4 AttorneyβClient Privilege
- 1.5 Security & Safety Information
- 1.6 Restricted Investigative Info
- 1.7 Confidential Third-Party Information
- 1.8 Corporate Administrative Matters
- 1.9 Financial Information
- 2.1 CV or Resume Content
- 2.2 Derogatory or Offensive Language
- 3.1 Documents from Specific Entities (IFC, MIGA, INT, IMF)
- 3.2 Joint WBG Documents
- 3.3 Security-Marked Documents
- 3.4 Procurement Content
- Python 3.8+
- Google Cloud credentials for Vertex AI
- Required Python packages (see requirements below)
- Clone the repository:
git clone https://github.com/Minaekramnia/Azure-DI-document-parser.git
cd Azure-DI-document-parser- Install dependencies:
pip install google-genai python-docx python-dotenv- Set up your environment variables:
# Create .env file with your Google Cloud credentials
export GOOGLE_APPLICATION_CREDENTIALS="path/to/your/credentials.json"The primary script for document analysis:
python Azure_DI_output_parser_WORKING_PageNumbers.pyFeatures:
- Processes Azure DI JSON files
- Extracts personal names
- Classifies sensitive content
- Includes page number tracking
- Merges multi-page CVs
- Outputs structured JSON results
The system generates JSON files with this structure:
{
"document_path": "path/to/document.pdf.json",
"total_pages": 5,
"total_segments": 2,
"segments": [
{
"segment_id": "segment_1",
"pages": [1, 2],
"page_range": "1-2",
"extracted_names": ["John Doe", "Jane Smith"],
"classifications": [
{
"category": "2.1 CV or Resume Content",
"text": "Professional experience and education...",
"bounding_box": [0, 0, 10, 12],
"page_number": 1,
"confidence_score": 0.95,
"reason": "Contains complete CV information"
}
]
}
]
}Generate professional Word reports:
python convert_to_word_fixed.pyFeatures:
- Formal document styling
- Cambria font throughout
- Professional formatting
- Summary statistics
- Classification breakdown
Create clean versions without extracted names:
python remove_names_from_word_FIXED.pyApply final formatting to reports:
python format_final_word_documents_FINAL.pyThe system creates organized output folders:
PI/
βββ markdown_reports/ # Markdown versions of reports
βββ word_reports/ # Original Word documents
βββ final_word_reports/ # Cleaned Word documents (no names)
βββ *_WORKING_PageNumbers_analysis.json # Analysis results
The system uses Executive_Prompt.md for AI analysis instructions. This file contains:
- Detailed classification rules
- Output format requirements
- Sensitivity guidelines
The system implements intelligent caching:
- First file: Full prompt sent (~2000+ tokens)
- Subsequent files: Cached prompt (~100 tokens)
- Cost savings: Significant reduction in API costs
- Size-based: Detects document boundaries by page size changes
- Title-based: Identifies new documents by title presence
- Page sequence: Uses page numbering to merge related pages
- Automatically detects multi-page CVs
- Merges related CV sections into single classification
- Maintains page number tracking across merged content
- Graceful fallback for caching failures
- Comprehensive logging and validation
- Robust JSON parsing with error recovery
- Batch processing: Handles multiple documents efficiently
- Caching optimization: Reduces API costs by ~90%
- Memory efficient: Processes large documents without memory issues
- Fast execution: Optimized for production use
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
For support and questions:
- Create an issue in the GitHub repository
- Check the documentation in the
/docsfolder - Review the example outputs in the repository
This system is designed for:
- Compliance teams processing sensitive documents
- Legal departments reviewing confidential materials
- HR teams analyzing CVs and personal information
- Security teams classifying sensitive content
- Research organizations processing large document collections
- β Added page number tracking for all classifications
- β Implemented CV merging for multi-page documents
- β Enhanced Word document formatting
- β Added comprehensive error handling
- β Optimized caching for cost reduction
- β Created professional report templates
Built with β€οΈ for efficient document analysis and compliance