Factual Context Validation and Simplification: A Scalable Method to Enhance GPT Trustworthiness and Efficiency
ICLR 2025 Blogpost Track
This work has been accepted at the ICLR 2025 Blogpost Track. View the full blog.
As the deployment of Large Language Models (LLMs) like GPT expands across domains, mitigating their susceptibility to factual inaccuracies or hallucinations becomes crucial for ensuring reliable performance. This blog post introduces two novel frameworks that enhance retrieval-augmented generation (RAG): one uses summarization to achieve a maximum of 57.7% storage reduction, while the other preserves critical information through statement-level extraction. Leveraging DBSCAN clustering, vectorized fact storage, and LLM-driven fact-checking, the pipelines deliver higher overall performance across benchmarks such as PubMedQA, SQuAD, and HotpotQA. By optimizing efficiency and accuracy, these frameworks advance trustworthy AI for impactful real-world applications.
This project provides a framework for:
- Benchmarking and evaluating the effectiveness of different techniques (raw, summarized, validity-checking)
- Comparing performance metrics across different benchmarks
- Optimizing storage efficiency through context summarization
- Enhancing factual validation through statement-level extraction
The system currently supports the following datasets:
- HotpotQA: A dataset requiring multi-hop reasoning across multiple documents
- PubMedQA: A dataset containing biomedical questions with yes/no/maybe answers
- SQuAD: Stanford Question Answering Dataset with factoid-based questions
src/
├── benchmarks/ # Benchmark implementation modules
│ ├── hotpot/ # HotpotQA dataset benchmarks
│ │ ├── raw.py # Basic pipeline benchmark
│ │ ├── summarizer.py # Summarization pipeline benchmark
│ │ └── validity.py # Validity checking benchmark
│ ├── pubmed/ # PubMedQA dataset benchmarks
│ │ ├── raw.py # Basic pipeline benchmark
│ │ ├── summarizer.py # Summarization pipeline benchmark
│ │ └── validity.py # Validity checking benchmark
│ └── squad/ # SQuAD dataset benchmarks
│ ├── raw.py # Basic pipeline benchmark
│ ├── summarizer.py # Summarization pipeline benchmark
│ └── validity.py # Validity checking benchmark
│
├── database/ # Database modules for vector storage
│ ├── hotpot/ # HotpotQA database operations
│ ├── pubmed/ # PubMedQA database operations
│ └── squad/ # SQuAD database operations
│
├── helpers/ # Utility modules
│ ├── data.py # Data handling utilities
│ ├── oai.py # OpenAI API wrapper
│ ├── pc.py # Pinecone integration
│ ├── progress.py # Progress tracking tools
│ └── variables.py # Common variables
│
├── pipelines/ # Pipeline implementations
│ ├── raw.py # Raw question-answering pipeline
│ ├── validity.py # Validity checking pipeline
│ └── baseline.py # Baseline pipeline implementation
│
├── results/ # Benchmark results in JSON format
│ └── rag/ # RAG-specific benchmark results
│
└── temp/ # Temporary files and caches
- Python 3.8+
- Pinecone account (for vector database)
- OpenAI API key
- Clone the repository:
git clone https://github.com/Tonyhrule/Facts.git
cd Factual-Validation-Simplification
- Install dependencies:
pip install -r requirements.txt
- Create a
.env
file in the root directory with the following variables:
OPENAI_API_KEY=your_openai_api_key
PINECONE_API_KEY=your_pinecone_api_key
PINECONE_HOST=your_pinecone_host_url
- Sign up for a Pinecone account
- Create a new index with the following settings:
- Dimensions: 1536 (for OpenAI embeddings)
- Metric: Cosine
- Pod Type: Starter (for testing) or Standard (for production)
- Copy your API key and host URL to the
.env
file
Before running benchmarks, you need to prepare the vector database:
python src/run_database.py
This will prompt you to select which dataset to prepare for the database.
To run benchmarks on the prepared datasets:
python src/run_benchmark.py
This will prompt you to select which benchmark to run. Results will be saved in the src/results/
directory.
You can test a single query through a specific pipeline:
python src/run_pipeline.py
Note: You may need to modify the query in run_pipeline.py
before running.
Basic question-answering using retrieval-augmented generation without additional processing.
Generates summaries of retrieved contexts before answering questions, which can improve accuracy for complex topics while reducing storage space.
Includes an additional verification step to check the factual accuracy of answers against the provided context. This pipeline leverages LLM-driven fact-checking to validate individual statements extracted from responses.
This project is licensed under the MIT License – see the LICENSE file for details.