Prerequisites: This application requires Ollama to be installed and running for AI-powered analysis features.
An advanced document comparison tool that leverages semantic analysis and AI to help identify changes between policy documents, government memos, and other official documents.
- Document Upload & Processing: Support for PDF, TXT, and HTML documents
- Semantic Document Comparison: Advanced algorithms to match and compare document sections
- AI-Powered Analysis: LLM integration for intelligent change detection and summarization
- Structured Data Extraction: Automatically extract definitions, requirements, actions, and deadlines
- Visual Diff Generation: HTML-based diff views for easy change identification
- Change Impact Classification: Categorize changes by impact level and type
- Fallback Mechanisms: Robust error handling with simplified analysis when AI services are unavailable
- Backend: Flask (Python web framework)
- Database: SQLAlchemy with SQLite
- Document Processing: PyPDF2, pdfplumber, BeautifulSoup4
- AI/ML: Ollama for local LLM analysis
- Semantic Matching: Sentence transformers and scikit-learn for document similarity
- Frontend: Bootstrap 5 with vanilla JavaScript
- Python 3.11+
- Ollama installed and running
-
Install dependencies
pip install -r requirements.txt
-
Run the application
python main.py
Or using Gunicorn for production:
gunicorn --bind 0.0.0.0:5000 --reuse-port --reload main:app
- Navigate to the home page
- Click "Choose File" and select a PDF, TXT, or HTML document
- Enter a descriptive title for the document
- Click "Upload Document"
- Upload at least two documents
- Navigate to the "Compare Documents" page
- Select two documents from the dropdown menus
- Click "Compare Documents"
- Review the detailed comparison results, including:
- Section-by-section changes
- Added/removed content
- Modified sections with detailed analysis
- Overall summary of changes
The comparison results include:
- Matched Sections: Sections that exist in both documents with similarity analysis
- Added Sections: Content that appears only in the newer document
- Removed Sections: Content that was present in the original but removed
- Change Statistics: Quantitative analysis of document changes
- Impact Classification: AI-powered categorization of change significance
The application uses SQLite and automatically creates instance/diffpolicy.db
- Maximum file size: 16MB
- Supported formats: PDF, TXT, HTML, HTM
- Files are stored in the
uploads/
directory
The platform uses Ollama for AI-powered document analysis and includes fallback mechanisms when services are unavailable.
GET /
- Main upload pagePOST /upload
- Document upload handlerGET /compare
- Document comparison formPOST /compare
- Process document comparisonGET /document/<id>
- View individual documentPOST /analyze_section
- API endpoint for section analysis
├── app.py # Flask application factory
├── main.py # Application entry point
├── models.py # Database models
├── routes.py # URL routes and handlers
├── document_processor.py # Document parsing and extraction
├── semantic_matcher.py # Section matching algorithms
├── diff_generator.py # Comparison result generation
├── llm_analyzer.py # LLM integration for analysis
├── simple_analyzer.py # Fallback analysis without LLM
├── structured_parser.py # Structured data extraction
├── templates/ # HTML templates
├── static/ # CSS and JavaScript files
├── uploads/ # Uploaded document storage
└── instance/ # Database and instance files
export FLASK_ENV=development
python main.py
To support additional document formats:
- Update
ALLOWED_EXTENSIONS
inroutes.py
- Add processing logic in
document_processor.py
- Update the upload form validation in templates
To add new analysis features:
- Extend the
LLMAnalyzer
class inllm_analyzer.py
- Update the fallback logic in
simple_analyzer.py
- Modify the comparison results structure in
diff_generator.py
- Import Errors: Ensure all dependencies are installed via
pip install -r requirements.txt
- Database Errors: Check database permissions and connection string
- File Upload Errors: Verify the
uploads/
directory exists and is writable - Memory Issues: Large documents may require increased memory allocation
- For large documents, consider implementing pagination
- Use database indexing for frequently queried fields
- Cache semantic embeddings for repeated comparisons
- Consider using a message queue for long-running analysis tasks
- Fork the repository
- Create a feature branch
- Make your changes with appropriate tests
- Submit a pull request with a clear description