A Python-based document analysis system that combines vector search and graph-based retrieval for intelligent document processing and question answering.
- Hybrid Document Retrieval: Combines vector similarity search with graph-based entity relationships
- Smart Text Processing: Semantic chunking and intelligent summarization
- Multi-Database Integration: Uses Qdrant for vector storage and Neo4j for graph relationships
- Adaptive Summarization: Map-reduce approach for handling large documents
- Entity Recognition: Built-in named entity recognition and relationship extraction
- Error Handling: Comprehensive error handling and fallback mechanisms
- Progress Logging: Detailed logging for monitoring and debugging
- Document parsing and text extraction
- Semantic chunking of text
- Vector embeddings generation
- Storage in Qdrant vector database
- Entity extraction and relationship mapping
- Storage in Neo4j graph database
- Hybrid search combining vector and graph approaches
- Entity-aware search capabilities
- Contextual and filtered searches
- Support for parent-child document relationships
- Automatic token counting and limit handling
- Direct and map-reduce summarization strategies
- Rate limiting and error handling
- Progress monitoring and logging
- Unified question-answering interface
- Document summarization capabilities
- Flexible configuration options
- OpenAI API for embeddings and completions
- Qdrant for vector storage
- Neo4j for graph database
- Sentence Transformers for local embeddings
- spaCy for NLP tasks
- Transformers for entity recognition
- Apache Tika for document parsing
Required environment variables:
TIKA_SERVER_URL
: URL for Apache Tika serverQDRANT_HOST
: Qdrant server hostQDRANT_PORT
: Qdrant server portBASE_URL
: OpenAI API base URLAPI_KEY
: OpenAI API keyNEO4J_URI
: Neo4j database URINEO4J_AUTH
: Neo4j authentication credentials
- Document Ingestion:
intake = DataIntake(collection_name="your_collection", file_path="path/to/document")
intake.organize_intake()
- Question Answering:
answer = Answering(collection_name="your_collection")
result = await answer.answer(
question="Your question?",
use_type="retriever",
max_tokens=4096,
top_k=10,
use_graph=True
)
- Document Summarization:
summarizer = QdrantSummarizer(collection_name="your_collection")
texts = summarizer.retrieve_all_texts()
summary = summarizer.summarize_texts(texts, max_tokens=4096)
The system includes comprehensive error handling for:
- Rate limiting
- Context length exceeded
- API errors
- Database connection issues
- Token limit violations
- Uses batching for large document processing
- Implements fallback mechanisms for embedding generation
- Supports both local and API-based models
- Includes retry logic for API calls
The Code is Licensed under APGLv3. You may read the License in LICENSE.md