This project implements a Retrieval-Augmented Generation (RAG) system for intelligent document search and retrieval. It enables users to extract, preprocess, store, and query text from PDFs, CSVs, voice files, web links and Youtube videos, leveraging a vector database and LLMs for accurate responses.
[ User ] → [ Frontend UI ] → [ API (FastAPI) ] → [ Vector DB (FAISS/Pinecone) ] → [ Embedding Model ] → [ LLM (Hugging Face) ]
✅ Multi-source data ingestion: PDFs, CSVs, voice files, web links
✅ Embeddings-based retrieval: Efficient search via FAISS/Pinecone
✅ LLM-powered responses: Uses transformers for intelligent answers
✅ Transparent retrieval: Returns top-k chunks with similarity scores
✅ Automatic database updates: Watches for document changes
✅ User-friendly UI: Built with Streamlit/Gradio
✅ REST API support: Query system programmatically using FastAPI
📦 RAG-System
┣ 📂 data_ingestion
┃ ┣ 📜 extract_pdfs.py
┃ ┣ 📜 extract_csvs.py
┃ ┣ 📜 extract_voice.py
┃ ┣ 📜 extract_web.py
┃ ┗ 📜 extract_yt_videos.py
┣ 📂 vector_store
┃ ┣ 📜 faiss_store.py
┃ ┗ 📜 pinecone_store.py
┣ 📂 rag_engine
┃ ┣ 📜 retriever.py
┃ ┣ 📜 generator.py
┃ ┗ 📜 query_pipeline.py
┣ 📂 ui
┃ ┗ 📜 app.py
┣ 📜 requirements.txt
┣ 📜 README.md
┗ 📜 main.py
| Component | Technology |
|---|---|
| Backend | Python |
| Data Parsing | PyPDF2, pandas, BeautifulSoup, Whisper |
| Vector DB | FAISS / Pinecone |
| Embeddings | Sentence-Transformers (all-MiniLM-L6-v2) |
| LLM | Hugging Face Transformers (facebook/bart-large) |
| Frontend | Streamlit / Gradio |
| API | FastAPI |
# Clone the repository
git clone https://github.com/butterpaneermasala/hacko-tech
cd rag-system
# Install dependencies
pip install -r requirements.txt
# run
streamlit run app.pypython main.pystreamlit run ui/app.py| Method | Endpoint | Description |
|---|---|---|
| POST | /ingest |
Upload and process documents |
| GET | /query?text=your_query |
Search documents and get responses |
| GET | /collections |
List available document collections |
✔ Input validation for document uploads
✔ Confidence thresholds for LLM responses
✔ Metadata and access controls for stored data
🔹 Implement cross-encoder for better re-ranking
🔹 Support additional file formats (JSON, DOCX)
🔹 Optimize embeddings with knowledge distillation
📢 Contributions are welcome! Feel free to open an issue or submit a PR. 🙌