A lightweight RAG pipeline to create an evaluation dataset automatically and evaluate it, all using Langchain, RAGAS, Giskard, Gemini, and LangSmith.
Author: Daniel Puente Viejo
This repository demonstrates how to quickly evaluate RAG systems without the need to manually create a large dataset.
We use a sample use case: answering questions about popular TV series like Breaking Bad and La Casa de Papel.
The pipeline is fully open-source and built using:
- Langchain - a platform to help build language model applications
- Gemmini – a completely open-source orchestration framework for LLM apps
- RAGAS – for evaluating RAG responses
- Giskard – to detect hallucinations, bias, and robustness issues
- LangSmith – to monitor, debug, and evaluate LLM usage at scale
We simulate a real-world scenario:
A user asks detailed questions about a TV show, such as character arcs, plot developments, or ethical decisions.
The system retrieves summaries of episodes and returns a relevant, accurate response.
Tool | Role |
---|---|
Langchain | Build the RAG pipeline (retriever + LLM) |
Gemmini | Open-source LLM orchestration & agent management |
RAGAS | Automatically evaluate generated answers |
Giskard | Test model outputs for hallucinations, bias, robustness |
LangSmith | Monitor and log RAG chains and metrics at runtime |
We eliminate the need to create a labeled dataset from scratch by:
- Generating realistic questions and answers using Giskard
- Using RAGAS to compute evaluation metrics:
- Context Precision
- Context Recall
- Faithfulness
- Answer Similarity
- Answer Relevancy
- Tracking all generations and context chunks using LangSmith
1. 🔧 Setup
- Install and import dependencies
- Env variables
- Load clients
2. 📦 Chunking & Vect BBDD creation
- Simple steps to create chunking dataset and vector database
3. ⚙️ Create dataset
- Use Giskard to create the dataset
4. 🔄 Retrieve examples & Evaluate
- Use RAGAS to compute metrics
5. 🎯 Answer questions & Evaluate
- Use RAGAS to compute metrics