A real-time Retrieval-Augmented Generation (RAG) application built with Streamlit and Groq. This application allows users to upload documents (PDF, DOCX, PPTX, TXT) and add website URLs to create a knowledge base, which can then be queried using natural language.
- Support for multiple document formats (PDF, DOCX, PPTX, TXT)
- Web crawling capability for adding online content
- Real-time document processing with queuing system
- Vector similarity search using FAISS
- Integration with Groq's LLM API
- Persistent storage of knowledge base
- Interactive web interface built with Streamlit
- Python 3.8 or higher
- A Groq API key
- Clone the repository:
git clone https://github.com/sidharthsajith/Groq-RAG.git
cd groq-rag
- Create and activate a virtual environment:
python -m venv venv
# On Windows
venv\Scripts\activate
# On macOS/Linux
source venv/bin/activate
- Install the required packages:
pip install -r requirements.txt
- Create a
.env
file in the project root and add your Groq API key:
GROQ_API_KEY=your_api_key_here
- Start the application:
streamlit run app.py
-
Open your web browser and navigate to the provided URL (typically
http://localhost:8501
) -
Use the sidebar to:
- Upload documents (PDF, DOCX, PPTX, TXT)
- Add URLs for web crawling
- Process documents
- Clear the knowledge base if needed
-
Ask questions in the main interface to query your knowledge base
-
Document Upload and URL Processing
- Documents are uploaded through the Streamlit interface
- URLs are crawled to extract text content
- All content is queued for processing with a 30-second delay between documents
-
Text Extraction
- PDFs: Uses
pypdf
to extract text from each page - DOCX: Uses
python-docx
to extract text from paragraphs - PPTX: Uses
python-pptx
to extract text from slides - TXT: Direct text extraction
- Web pages: Uses
BeautifulSoup
to extract cleaned text content
- PDFs: Uses
-
Text Processing
- Content is split into chunks of approximately 1000 words
- Each chunk is converted into a vector embedding using the
all-MiniLM-L6-v2
model - Embeddings are normalized and added to a FAISS index
-
Knowledge Base Management
- Vector index and text chunks are saved to disk
- Can be loaded on application restart
- Clearable through the interface
-
Question Input
- User enters a natural language question
- Question is converted to a vector embedding
-
Retrieval
- FAISS index searches for most similar text chunks
- Top 3 most relevant chunks are retrieved
- Source information is preserved
-
Response Generation
- Retrieved chunks are combined into context
- Context and question are sent to Groq's LLM
- Response is generated and displayed with source references
- DocumentProcessor: Main class handling all document processing and querying
- FAISS: High-performance similarity search
- Sentence Transformers: Document and query embedding
- Groq Integration: LLM-based response generation
- Threading: Background processing of documents
- Streamlit: Web interface and user interaction
The application follows a modular architecture:
- Frontend: Streamlit web interface
- Processing Layer: Document handling and embedding generation
- Storage Layer: FAISS index and pickle storage
- API Layer: Groq LLM integration
- Processing large documents may take significant time
- Web crawling is rate-limited and basic
- Knowledge base size is limited by available memory
- Requires stable internet connection for Groq API access
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.