An accurate Retrieval-Augmented Generation (RAG) system that analyzes multi-language codebases using Tree-sitter, builds comprehensive knowledge graphs, and enables natural language querying of codebase structure and relationships.
- π Multi-Language Support: Supports Python, JavaScript, TypeScript, Rust, Go, Scala, and Java codebases
- π³ Tree-sitter Parsing: Uses Tree-sitter for robust, language-agnostic AST parsing
- π Knowledge Graph Storage: Uses Memgraph to store codebase structure as an interconnected graph
- π£οΈ Natural Language Querying: Ask questions about your codebase in plain English
- π€ AI-Powered Cypher Generation: Supports both cloud models (Google Gemini) and local models (Ollama) for natural language to Cypher translation
- π Code Snippet Retrieval: Retrieves actual source code snippets for found functions/methods
- π Dependency Analysis: Parses
pyproject.toml
to understand external dependencies - π― Nested Function Support: Handles complex nested functions and class hierarchies
- π Language-Agnostic Design: Unified graph schema across all supported languages
The system consists of two main components:
- Multi-language Parser: Tree-sitter based parsing system that analyzes codebases and ingests data into Memgraph
- RAG System (
codebase_rag/
): Interactive CLI for querying the stored knowledge graph
- π³ Tree-sitter Integration: Language-agnostic parsing using Tree-sitter grammars
- π Graph Database: Memgraph for storing code structure as nodes and relationships
- π€ LLM Integration: Supports Google Gemini (cloud) and Ollama (local) for natural language processing
- π Code Analysis: Advanced AST traversal for extracting code elements across languages
- π οΈ Query Tools: Specialized tools for graph querying and code retrieval
- βοΈ Language Configuration: Configurable mappings for different programming languages
- Python 3.12+
- Docker & Docker Compose (for Memgraph)
- For cloud models: Google Gemini API key
- For local models: Ollama installed and running
uv
package manager
- Clone the repository:
git clone https://github.com/vitali87/code-graph-rag.git
cd code-graph-rag
- Install dependencies:
For basic Python support:
uv sync
For full multi-language support:
uv sync --extra treesitter-full
This installs Tree-sitter grammars for:
- Python (.py)
- JavaScript (.js, .jsx)
- TypeScript (.ts, .tsx)
- Rust (.rs)
- Go (.go)
- Scala (.scala, .sc)
- Java (.java)
- Set up environment variables:
cp .env.example .env
# Edit .env with your configuration (see options below)
# .env file
LLM_PROVIDER=gemini
GEMINI_API_KEY=your_gemini_api_key_here
Get your free API key from Google AI Studio.
# .env file
LLM_PROVIDER=local
LOCAL_MODEL_ENDPOINT=http://localhost:11434/v1
LOCAL_ORCHESTRATOR_MODEL_ID=llama3
LOCAL_CYPHER_MODEL_ID=llama3
LOCAL_MODEL_API_KEY=ollama
Install and run Ollama:
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull required models
ollama pull llama3
# Or try other models like:
# ollama pull llama3.1
# ollama pull mistral
# ollama pull codellama
# Ollama will automatically start serving on localhost:11434
Note: Local models provide privacy and no API costs, but may have lower accuracy compared to cloud models like Gemini.
- Start Memgraph database:
docker-compose up -d
Parse and ingest a multi-language repository into the knowledge graph:
For the first repository (clean start):
python -m codebase_rag.main --repo-path /path/to/repo1 --update-graph --clean
For additional repositories (preserve existing data):
python -m codebase_rag.main --repo-path /path/to/repo2 --update-graph
python -m codebase_rag.main --repo-path /path/to/repo3 --update-graph
Supported Languages: The system automatically detects and processes files based on extensions:
- Python:
.py
files - JavaScript:
.js
,.jsx
files - TypeScript:
.ts
,.tsx
files - Rust:
.rs
files - Go:
.go
files - Scala:
.scala
,.sc
files - Java:
.java
files
Start the interactive RAG CLI:
python -m codebase_rag.main --repo-path /path/to/your/repo
You can switch between cloud and local models at runtime using CLI arguments:
Use Local Models:
python -m codebase_rag.main --repo-path /path/to/your/repo --llm-provider local
Use Cloud Models:
python -m codebase_rag.main --repo-path /path/to/your/repo --llm-provider gemini
Specify Custom Models:
# Use specific local models
python -m codebase_rag.main --repo-path /path/to/your/repo \
--llm-provider local \
--orchestrator-model llama3.1 \
--cypher-model codellama
# Use specific Gemini models
python -m codebase_rag.main --repo-path /path/to/your/repo \
--llm-provider gemini \
--orchestrator-model gemini-2.0-flash-thinking-exp-01-21 \
--cypher-model gemini-2.5-flash-lite-preview-06-17
Available CLI Arguments:
--llm-provider
: Choosegemini
orlocal
--orchestrator-model
: Specify model for main RAG orchestration--cypher-model
: Specify model for Cypher query generation
Example queries (works across all supported languages):
- "Show me all classes that contain 'user' in their name"
- "Find functions related to database operations"
- "What methods does the User class have?"
- "Show me functions that handle authentication"
- "List all TypeScript components"
- "Find Rust structs and their methods"
- "Show me Go interfaces and implementations"
The knowledge graph uses the following node types and relationships:
- Project: Root node representing the entire repository
- Package: Language packages (Python:
__init__.py
, etc.) - Module: Individual source code files (
.py
,.js
,.jsx
,.ts
,.tsx
,.rs
,.go
,.scala
,.sc
,.java
) - Class: Class/Struct/Enum definitions across all languages
- Function: Module-level functions and standalone functions
- Method: Class methods and associated functions
- Folder: Regular directories
- File: All files (source code and others)
- ExternalPackage: External dependencies
- Python:
function_definition
,class_definition
- JavaScript/TypeScript:
function_declaration
,arrow_function
,class_declaration
- Rust:
function_item
,struct_item
,enum_item
,impl_item
- Go:
function_declaration
,method_declaration
,type_declaration
- Scala:
function_definition
,class_definition
,object_definition
,trait_definition
- Java:
method_declaration
,class_declaration
,interface_declaration
,enum_declaration
CONTAINS_PACKAGE/MODULE/FILE/FOLDER
: Hierarchical containmentDEFINES
: Module defines classes/functionsDEFINES_METHOD
: Class defines methodsDEPENDS_ON_EXTERNAL
: Project depends on external packages
Configuration is managed through environment variables in .env
file:
LLM_PROVIDER
: Set to"gemini"
for cloud models or"local"
for local models
GEMINI_API_KEY
: Required whenLLM_PROVIDER=gemini
GEMINI_MODEL_ID
: Main model for orchestration (default:gemini-2.5-pro-preview-06-05
)MODEL_CYPHER_ID
: Model for Cypher generation (default:gemini-2.5-flash-lite-preview-06-17
)
LOCAL_MODEL_ENDPOINT
: Ollama endpoint (default:http://localhost:11434/v1
)LOCAL_ORCHESTRATOR_MODEL_ID
: Model for main RAG orchestration (default:llama3
)LOCAL_CYPHER_MODEL_ID
: Model for Cypher query generation (default:llama3
)LOCAL_MODEL_API_KEY
: API key for local models (default:ollama
)
MEMGRAPH_HOST
: Memgraph hostname (default:localhost
)MEMGRAPH_PORT
: Memgraph port (default:7687
)TARGET_REPO_PATH
: Default repository path (default:.
)
code-graph-rag/
βββ codebase_rag/ # RAG system package
β βββ main.py # CLI entry point
β βββ config.py # Configuration management
β βββ graph_updater.py # Tree-sitter based multi-language parser
β βββ language_config.py # Language-specific configurations
β βββ prompts.py # LLM prompts and schemas
β βββ schemas.py # Pydantic models
β βββ services/ # Core services
β β βββ llm.py # Gemini LLM integration
β βββ tools/ # RAG tools
β βββ codebase_query.py # Graph querying tool
β βββ code_retrieval.py # Code snippet retrieval
β βββ file_reader.py # File content reading
βββ docker-compose.yaml # Memgraph setup
βββ pyproject.toml # Project dependencies & language extras
βββ README.md # This file
- tree-sitter: Core Tree-sitter library for language-agnostic parsing git - tree-sitter-{language}: Language-specific grammars (Python, JS, TS, Rust, Go, Scala, Java)
- pydantic-ai: AI agent framework for RAG orchestration
- pymgclient: Memgraph Python client for graph database operations
- loguru: Advanced logging with structured output
- python-dotenv: Environment variable management
Language | Extensions | Functions | Classes/Structs | Modules | Package Detection |
---|---|---|---|---|---|
Python | .py |
β | β | β | __init__.py |
JavaScript | .js , .jsx |
β | β | β | - |
TypeScript | .ts , .tsx |
β | β | β | - |
Rust | .rs |
β | β (structs/enums) | β | - |
Go | .go |
β | β (structs) | β | - |
Scala | .scala , .sc |
β | β (classes/objects/traits) | β | package declarations |
Java | .java |
β | β (classes/interfaces/enums) | β | package declarations |
- Python: Full support including nested functions, methods, classes, and package structure
- JavaScript/TypeScript: Functions, arrow functions, classes, and method definitions
- Rust: Functions, structs, enums, impl blocks, and associated functions
- Go: Functions, methods, type declarations, and struct definitions
- Scala: Functions, methods, classes, objects, traits, case classes, and Scala 3 syntax
- Java: Methods, constructors, classes, interfaces, enums, and annotation types
# Basic Python-only support
uv sync
# Full multi-language support
uv sync --extra treesitter-full
# Individual language support (if needed)
uv add tree-sitter-python tree-sitter-javascript tree-sitter-typescript tree-sitter-rust tree-sitter-go tree-sitter-scala tree-sitter-java
The system uses a configuration-driven approach for language support. Each language is defined in codebase_rag/language_config.py
with:
- File extensions: Which files to process
- AST node types: How to identify functions, classes, etc.
- Module structure: How modules/packages are organized
- Name extraction: How to extract names from AST nodes
Adding support for new languages requires only configuration changes, no code modifications.
-
Check Memgraph connection:
- Ensure Docker containers are running:
docker-compose ps
- Verify Memgraph is accessible on port 7687
- Ensure Docker containers are running:
-
View database in Memgraph Lab:
- Open http://localhost:3000
- Connect to memgraph:7687
-
For local models:
- Verify Ollama is running:
ollama list
- Check if models are downloaded:
ollama pull llama3
- Test Ollama API:
curl http://localhost:11434/v1/models
- Check Ollama logs:
ollama logs
- Verify Ollama is running:
- Follow the established code structure
- Keep files under 100 lines (as per user rules)
- Use type annotations
- Follow conventional commit messages
- Use DRY principles
For issues or questions:
- Check the logs for error details
- Verify Memgraph connection
- Ensure all environment variables are set
- Review the graph schema matches your expectations