Skip to content
Teddy edited this page Jul 17, 2024 · 10 revisions

ERAG Wiki

Table of Contents

  1. Overview
  2. Key Features
  3. System Architecture
  4. Installation
  5. Usage
  6. Configuration
  7. Customization
  8. Troubleshooting
  9. Advanced Topics

Overview

ERAG is a sophisticated system that combines lexical, semantic, text, and knowledge graph searches with conversation context to provide accurate and contextually relevant responses. It processes various document types, creates embeddings, builds knowledge graphs, and uses this information to answer user queries intelligently. The system is designed to enhance document understanding and question-answering capabilities through retrieval-augmented generation tasks.

Key Features

  1. Multi-modal Search: Combines lexical, semantic, text, and knowledge graph searches with customizable weighting.
  2. Conversation Context Management: Maintains context across interactions for coherent conversations.
  3. Document Processing: Handles DOCX, JSON, PDF, and plain text with configurable chunking.
  4. Embedding Generation: Creates and manages embeddings using state-of-the-art sentence transformer models.
  5. Knowledge Graph Creation: Builds and utilizes a graph for enhanced information retrieval.
  6. Web Content Processing: Implements real-time web crawling, content extraction, and summarization.
  7. Knol Creation: Generates comprehensive knowledge entries on specific subjects.
  8. Retrieval-Augmented Generation (RAG): Combines retrieved context with language model capabilities.
  9. Adaptive Context Retrieval: Dynamically adjusts context based on query complexity.
  10. Multi-stage Summarization: Summarizes individual web pages and generates comprehensive summaries.
  11. Entity Extraction and Linking: Enhances the knowledge graph with extracted entities.
  12. Modular Architecture: Allows easy extension and customization of system capabilities.
  13. Debug and Logging: Provides comprehensive logging and debug information.
  14. User-friendly Interfaces: Offers both CLI and GUI interfaces with color-coded output.

System Architecture

The ERAG system consists of several interconnected components:

  1. Document Processing (file_processing.py): Handles ingestion and preprocessing of documents.
  2. Embedding Utils (embeddings_utils.py): Manages creation, storage, and retrieval of embeddings.
  3. Knowledge Graph Creation (create_graph.py): Creates a graph representation of document content.
  4. Settings Management (settings.py): Manages system-wide configuration settings.
  5. Main Application (main.py): Implements the GUI and orchestrates system components.
  6. Talk2Doc (talk2doc.py): Core RAG system for document interaction.
  7. Web RAG (web_rag.py): Extends RAG capabilities to web content.
  8. Web Sum (web_sum.py): Provides web content summarization.
  9. Knol Creator (create_knol.py): Generates comprehensive knowledge entries.
  10. Search Utils (search_utils.py): Implements various search methods for context retrieval.
  11. Talk2Model (talk2model.py): Enables direct interaction with language models.
  12. Talk2URL (talk2url.py): Facilitates interaction with web content.
  13. Talk2Git (talk2git.py): Analyzes and summarizes GitHub repositories.

Installation

  1. Clone the repository:

    git clone https://github.com/EdwardDali/erag.git && cd erag
    
  2. Install required dependencies:

    pip install -r requirements.txt
    
  3. Download required models:

    python -m spacy download en_core_web_sm
    python -m nltk.downloader punkt
    
  4. Install Ollama (optional):

  5. Set up environment variables in a .env file:

    GROQ_API_KEY=your_groq_api_key
    GITHUB_TOKEN=your_github_token
    

Usage

  1. Start the ERAG GUI: python main.py

  2. Use the GUI to:

    • Upload and process documents
    • Generate embeddings
    • Create knowledge graphs
    • Configure settings
    • Run RAG operations
  3. For CLI interactions, use specific modules:

    • Document RAG: python src/talk2doc.py <api_type>
    • Web RAG: python src/web_rag.py <api_type>
    • Web Summarization: python src/web_sum.py <api_type>
    • Model Interaction: python src/talk2model.py <api_type> <model>
    • URL Interaction: python src/talk2url.py <api_type>
    • GitHub Analysis: python src/talk2git.py <api_type>
    • Query Routing: python src/route_query.py <api_type>

Configuration

Customize ERAG through the Settings tab in the GUI or by modifying settings.py. Key settings include:

  • Chunk size and overlap for document processing
  • Embedding generation parameters
  • Knowledge graph creation settings
  • RAG system parameters
  • Search method weights and thresholds
  • Web crawling and summarization settings
  • GitHub analysis parameters

Customization

The system allows for extensive customization:

  • Modify embedding models in embeddings_utils.py
  • Adjust NLP models for entity extraction in create_graph.py
  • Fine-tune search method weights and thresholds
  • Customize knowledge graph parameters

Refer to the Settings tab in the GUI for all customization options.

Troubleshooting

  • Ensure all dependencies are correctly installed
  • Check console output for error messages
  • Verify API keys and tokens in the .env file
  • For performance issues, adjust chunk sizes or batch processing parameters
  • If using local LLaMA.cpp servers, ensure correct model files and configuration

Advanced Topics

Workflow

  1. Document upload and processing
  2. Embedding generation
  3. Knowledge graph creation
  4. RAG system initialization
  5. User interaction and query processing
  6. Web content integration (optional)
  7. Knol creation (optional)
  8. Continuous learning and embedding updates

Performance Optimization

  • Use GPU for faster processing, especially with large language models
  • Adjust batch sizes for embedding generation based on available memory
  • Optimize knowledge graph parameters for balance between detail and performance
  • Use query routing for efficient handling of different types of queries

Module Details

1. Document Processing (file_processing.py)

This module handles the ingestion and preprocessing of various document types.

Key Features:

  • Supports multiple file formats: DOCX, PDF, Text, and JSON
  • Implements configurable text chunking with overlap
  • Provides functions for uploading and processing different file types

Main Functions:

  • upload_docx(), upload_pdf(), upload_txt(), upload_json(): Handle file uploads for respective formats
  • handle_text_chunking(text): Splits text into chunks with configurable size and overlap
  • process_file(file_type): Processes files based on their type
  • append_to_db(chunks, db_file): Appends processed chunks to the database file

2. Embedding Utils (embeddings_utils.py)

This module manages the creation, storage, and retrieval of document embeddings.

Key Features:

  • Utilizes sentence transformers for embedding generation
  • Supports batch processing for efficient embedding computation
  • Provides functions for loading and saving embeddings

Main Functions:

  • compute_and_save_embeddings(model, save_path, content): Computes and saves embeddings for given content
  • load_embeddings_and_data(embeddings_file): Loads previously saved embeddings and associated data
  • load_or_compute_embeddings(model, db_file, embeddings_file): Loads existing embeddings or computes new ones if necessary

3. Knowledge Graph Creation (create_graph.py)

This module is responsible for creating a knowledge graph from processed documents.

Key Features:

  • Uses spaCy for named entity recognition and natural language processing
  • Creates a NetworkX graph representing document structure and entity relationships
  • Supports semantic edge creation based on document similarity

Main Functions:

  • extract_entities_with_confidence(text): Extracts named entities from text with confidence scores
  • create_networkx_graph(data, embeddings): Creates a knowledge graph from document data and embeddings
  • create_knowledge_graph(): Main function to create and save the knowledge graph
  • create_knowledge_graph_from_raw(raw_file_path): Creates a knowledge graph from a raw text file

4. Settings Management (settings.py)

This module manages the configuration settings for the entire ERAG system.

Key Features:

  • Implements a singleton pattern for global access to settings
  • Provides methods for loading, saving, and resetting settings
  • Stores various configuration parameters for different components of the system

Main Functions:

  • load_settings(): Loads settings from a JSON file
  • save_settings(): Saves current settings to a JSON file
  • update_setting(key, value): Updates a specific setting
  • reset_to_defaults(): Resets all settings to their default values

5. Main Application (main.py)

This is the entry point of the ERAG system, implementing the graphical user interface and orchestrating the various components.

Key Features:

  • Implements a tkinter-based GUI with multiple tabs for different functionalities
  • Manages the interaction between user inputs and the underlying ERAG components
  • Provides buttons and interfaces for document upload, embedding creation, knowledge graph generation, and RAG operations

Main Classes:

  • ERAGGUI: The main GUI class that sets up the interface and handles user interactions

Key Functions:

  • create_widgets(): Sets up the main GUI components
  • upload_and_chunk(file_type): Handles file upload and processing
  • execute_embeddings(): Triggers the embedding computation process
  • create_knowledge_graph(): Initiates the knowledge graph creation process
  • run_model(): Starts the RAG system for interaction

6. Talk2Doc Module (talk2doc.py)

This module implements the core Retrieval-Augmented Generation (RAG) system for document interaction.

Key Components:

  • RAGSystem class: Manages the RAG process, including API configuration, embedding loading, and conversation handling.
  • Supports multiple API types (ollama, llama).
  • Implements a colored console interface for user interaction.

Main Functions:

  • configure_api(api_type): Sets up the API client based on the specified type.
  • load_embeddings(): Loads pre-computed embeddings for the document database.
  • load_knowledge_graph(): Loads the knowledge graph for enhanced context retrieval.
  • ollama_chat(user_input, system_message): Generates responses using the configured API.
  • run(): Main loop for user interaction with the RAG system.

7. Web RAG Module (web_rag.py)

This module extends the RAG system to work with web content, allowing for real-time information retrieval and processing.

Key Components:

  • WebRAG class: Manages web content retrieval, processing, and RAG-based question answering.
  • Implements web crawling, content chunking, and embedding generation for web pages.
  • Supports iterative searching and processing of web content.

Main Functions:

  • search_and_process(query): Performs web search and processes relevant URLs.
  • generate_qa(query): Generates answers based on processed web content.
  • process_next_urls(): Processes additional URLs to expand the knowledge base.
  • run(): Main loop for user interaction with the Web RAG system.

8. Web Sum Module (web_sum.py)

This module focuses on creating summaries of web content based on user queries.

Key Components:

  • WebSum class: Manages web content retrieval, summarization, and final summary generation.
  • Implements web search, content relevance filtering, and multi-stage summarization.

Main Functions:

  • search_and_process(query): Performs web search, filters relevant content, and generates summaries.
  • create_summary(content, query, index): Creates a summary for a single web page.
  • create_final_summary(summaries, query): Generates a comprehensive final summary from individual page summaries.
  • run(): Main loop for user interaction with the Web Sum system.

9. Knol Creator Module (create_knol.py)

This module is responsible for creating comprehensive knowledge entries (knols) on specific subjects.

Key Components:

  • KnolCreator class: Manages the process of creating, improving, and finalizing knols.
  • Implements a multi-stage process including initial creation, improvement, question generation, and answering.

Main Functions:

  • create_knol(subject): Creates an initial structured knowledge entry.
  • improve_knol(knol, subject): Enhances and expands the initial knol.
  • generate_questions(knol, subject): Generates relevant questions based on the knol content.
  • answer_questions(questions, subject, knol): Answers generated questions using the RAG system.
  • create_final_knol(subject): Combines improved knol and Q&A to create the final knowledge entry.
  • run_knol_creator(): Main loop for user interaction with the Knol Creation system.

10. Search Utils Module (search_utils.py)

This module provides various search utilities to enhance the retrieval capabilities of the ERAG system.

Key Components:

  • SearchUtils class: Implements different search methods including lexical, semantic, graph-based, and text search.

Main Functions:

  • lexical_search(query): Performs lexical (keyword-based) search on the document content.
  • semantic_search(query): Conducts semantic search using document embeddings.
  • get_graph_context(query): Retrieves context from the knowledge graph based on the query.
  • text_search(query): Performs basic text search on the document content.
  • get_relevant_context(user_input, conversation_context): Combines different search methods to retrieve the most relevant context.

11. Talk2Model Module (talk2model.py)

This module enables direct interaction with various language models.

Key Features:

  • Supports multiple API types for language model interaction
  • Provides a simple interface for chatting with the selected model

Main Functions:

  • run(): Starts the interactive session with the selected model
  • get_model_response(user_prompt): Generates a response from the model based on user input

12. Talk2URL Module (talk2url.py)

This module facilitates interaction with web content, allowing users to ask questions about specific URLs.

Key Features:

  • Crawls and processes web pages
  • Generates responses based on the content of specified URLs
  • Supports conversation history and context management

Main Functions:

  • crawl_page(url): Retrieves and processes content from a given URL
  • generate_response(user_input): Generates a response based on the crawled web content
  • run(): Main loop for user interaction with the Talk2URL system

13. Talk2Git Module (talk2git.py)

This module analyzes and summarizes GitHub repositories, providing various insights into the codebase.

Key Features:

  • Clones and analyzes GitHub repositories
  • Performs static code analysis
  • Generates project summaries and dependency analyses
  • Detects code smells and suggests improvements

Main Functions:

  • process_repo(repo_url): Clones and processes a GitHub repository
  • static_code_analysis(): Performs static analysis on the repository's code
  • summarize_project(): Generates a summary of the project and its files
  • analyze_dependencies(): Analyzes the project's dependencies
  • detect_code_smells(): Identifies potential code smells in the repository
  • run(): Main loop for user interaction with the Talk2Git system