Accelerating domain-specific LLM evaluation through strategic subsampling and real-time analysis
Traditional LLM leaderboards often fail to predict performance in specialized domains, while conventional adaptation methods like fine-tuning demand excessive computational resources. To solve this issue, we present you a proof of concept for running Real-time evaluations on your dataset with configurable interface for model, hyperparameters and RAG technique selection - In-Situ Evaluator. Specifically, we employ:
- Dataset subsampling for rapid domain-specific benchmarking.
- API Interface for choosing between various LLM providers and models.
- Custom RAG pipelines for 3 most popular RAG architectures.
By following the examples in this repository, you can:
- Load your custom dataset
- Choose LLM provider and Model for evals (e.g. Groq (Llama-2, Mixtral), OpenAI(GPT-3.5, GPT-4))
- Customize model based hyperparameters to your tailored liking (e.g. Temperature, Top P)
- Choose between RAG techniques (e.g. Vanilla RAG, Graph RAG, RAPTOR)
- Configure hyperparameters for RAG (e.g. Chunk size, Chunk Overlap)
- Run Real-time evaluation of LLMs, hyperparameters, and RAG configurations
- Compare with metrics (BLEU, ROUGE, RoBERTa-NLI, etc.)
Paper preprint coming soon. This repository contains a production-ready proof-of-concept.
Component | Supported Options |
---|---|
LLM Providers | Groq , OpenAI |
RAG Techniques | Vanilla RAG, Graph RAG, RAPTOR |
Model Hyperparameters | Temperature, Top_P, Stop Sequence, Stream RAG chunk size/overlap |
RAG Hyperparameters | Chunk Size, Chunk Overlap, Top K |
Proxy Datasets | SQuAD (easy), TriviaQA (medium), WikiQA (hard) |
Metrics | BLEU, ROUGE-L, METEOR, RoBERTa-NLI, Cosine Similarity |
- Python 3.8+ (for backend)
- ReactJS (for frontend)
- GROQ/OpenAI API keys (for LLM calls). You can obtain these API keys by following the steps of the respective providers -
- (Optional) Before uploading the custom dataset, please ensure it is of
json
file type and is of the following format -[ { "Question": "This is a sample", "Context": "This is the context related to the question.", "Response": "This is the ground truth answer" }, { "Question": "What is the hottest planet in our solar system?", "Context": "The planets in our solar system vary in temperature due to their distance from the Sun, atmospheric composition, and other factors.", "Response": "Venus is the hottest planet in our solar system, with surface temperatures reaching up to 462°C (864°F), due to its thick atmosphere and runaway greenhouse effect." } ]
Clone the git repository -
git clone https://github.com/Ritvik-G/in-situ_eval.git
cd in-situ_eval
To set up the frontend, follow these steps:
- Navigate to the Frontend Directory
First, change the directory to the frontend folder:cd frontend
- Install Dependencies
Use npm to install all the required dependencies:npm install
- Start the Frontend
Finally, start the frontend server:npm start
The structure of backend is as follows -
Backend/
├── data/ # proxy datasets
│ ├── squad.json
│ ├── trivia_qa.json
│ └── wiki_qa.json
├── RAG/
│ ├── rag.py
│ ├── raptor.py
│ ├── graphrag.py
│ └── model_config.py # LLM caller function
├── Benchmarks/ # Benchmarker that calls data
│ └── benchmarks.py
├── Evaluations/
│ ├── evaluations.py
│ └── consolidate_metrics.py
├── app.py
└── requirements.txt
- Navigate to the Backend Directory
First, change the directory to the backend folder:cd backend
- Install Dependencies
Use pip to install all the required dependencies:pip install -r requirements.txt
- Run the Backend
Run the backend server. By default, it would be running onhttp://localhost:5000/
:python app.py
Once both the frontend and backend servers are running, you can access the application via the frontend URL http://localhost:3000/api
and interact with the application.