RAG (Retrieval-Augmented Generation) for Code

This repository contains a RAG (Retrieval-Augmented Generation) system for code-related tasks, specifically focused on code completion and bug localization.

Project Overview

The RAG system enhances language models by retrieving relevant context from a codebase before generating completions or localizing bugs. This approach improves the quality and relevance of model outputs by providing task-specific context.

In this project, we benchmark different approaches to RAG for code to recommend the best approach for various scenarios. We evaluate different chunking strategies, scoring methods, and context composition techniques to determine the most effective combinations.

The project consists of two main components:

Code Completion: Enhances code completion by retrieving relevant context from the codebase
Bug Localization: Identifies files likely to contain bugs based on issue descriptions

Components

Code Completion

The code completion component uses RAG to improve code suggestions by:

Chunking repository files into manageable pieces
Scoring chunks based on relevance to the current coding context
Using the most relevant chunks as context for code completion

Configuration options:

Different chunking strategies (full_file, fixed_line, langchain)
Various scoring methods (BM25, IOU, dense embeddings)
Adjustable context sizes and composition strategies

Performance is evaluated using Exact Match (EM).

Bug Localization

The bug localization component helps identify files likely to contain bugs by:

Taking issue descriptions as input
Chunking repository files
Scoring chunks based on relevance to the issue description
Aggregating scores at the file level to identify the most likely locations of bugs

Performance is evaluated using metrics like F1 score and NDCG.

Chunk Pipeline

The chunking pipeline is a critical component of our RAG system, responsible for breaking down repository files into manageable pieces that can be efficiently processed and retrieved.

Environment Setup

Prerequisites

Python 3.9+
Poetry (for dependency management)

Installation

Clone the repository:

git clone https://github.com/JetBrains-Research/project-adaptation-experiments.git
cd project-adaptation-experiments

Install dependencies using Poetry:
```
poetry install
```

Running Experiments

Code Completion

To run code completion experiments:

Configure the experiment in rag/configs/plcc.yaml:
- Set the model, language, and context composer
- Configure context sizes and completion categories
- Specify output paths
Run the evaluation:
```
python -m rag.eval_plcc
```

Bug Localization

To run bug localization experiments:

Configure the experiment in rag/configs/bug_localization.yaml and rag/configs/rag.yaml:
- Set the chunker, scorer, and other parameters
- Specify output paths
Run the evaluation:
```
python -m rag.bug_localization
```

Configuration Options

The system can be configured through YAML files in the rag/configs directory:

rag.yaml: General RAG configuration (chunkers, scorers, models)
bug_localization.yaml: Bug localization specific settings
plcc.yaml: Code completion specific settings

Experimental Results

Our benchmarking experiments have yielded several important insights:

Larger the context of the generation model – larger chunks you should use. The minimal size of the context chunk is 32 lines.
IoU + line_splitter is good on short contexts (<=2000) and very fast.
The word_splitter and tokenizer has equal performance; the word_splitter is much faster.
No need to include non-code files in this task. Saves plenty compute.

This plot shows the comparison of the explored context composer strategies:

No Context (baseline)
Path Distance (baseline),
optimal full_file composer configuration
optimal fixed_line composer configuration

Parameter	Values
Chunkers	`full_file`, `fixed_line`
Scorer	`bm25`
Splitter	`word_splitter`
File extensions	[`py`]
Context chunk size	32 lines (for `fixed_line`)
Completion chunk size	32 lines (for `fixed_line`)

To plot the results of the experiments we used rag/plot_analysis/all_py_kt_plots.ipynb

Project Structure

rag/: Main project directory
- bug_localization/: Bug localization components
- configs/: Configuration files
- context_composers/: Context composition strategies
- draco/: Data flow analysis components
- metrics/: Evaluation metrics
- plot_analysis/: Visualization tools
- rag_engine/: Core RAG functionality (chunkers, scorers, splitters)
- utils/: Utility functions

Name		Name	Last commit message	Last commit date
Latest commit History 232 Commits
archive		archive
images		images
rag		rag
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAG (Retrieval-Augmented Generation) for Code

Project Overview

Components

Code Completion

Bug Localization

Chunk Pipeline

Environment Setup

Prerequisites

Installation

Running Experiments

Code Completion

Bug Localization

Configuration Options

Experimental Results

Project Structure

Archive

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

JetBrains-Research/project-adaptation-experiments

Folders and files

Latest commit

History

Repository files navigation

RAG (Retrieval-Augmented Generation) for Code

Project Overview

Components

Code Completion

Bug Localization

Chunk Pipeline

Environment Setup

Prerequisites

Installation

Running Experiments

Code Completion

Bug Localization

Configuration Options

Experimental Results

Project Structure

Archive

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages