Skip to content

Mirza-Samad-Ahmed-Baig/code-graph-rag

Β 
Β 

Repository files navigation

Graph-Code: A Multi-Language Graph-Based RAG System

An accurate Retrieval-Augmented Generation (RAG) system that analyzes multi-language codebases using Tree-sitter, builds comprehensive knowledge graphs, and enables natural language querying of codebase structure and relationships.

ag-ui Logo

πŸš€ Features

  • 🌍 Multi-Language Support: Supports Python, JavaScript, TypeScript, Rust, Go, Scala, and Java codebases
  • 🌳 Tree-sitter Parsing: Uses Tree-sitter for robust, language-agnostic AST parsing
  • πŸ“Š Knowledge Graph Storage: Uses Memgraph to store codebase structure as an interconnected graph
  • πŸ—£οΈ Natural Language Querying: Ask questions about your codebase in plain English
  • πŸ€– AI-Powered Cypher Generation: Supports both cloud models (Google Gemini) and local models (Ollama) for natural language to Cypher translation
  • πŸ“ Code Snippet Retrieval: Retrieves actual source code snippets for found functions/methods
  • πŸ”— Dependency Analysis: Parses pyproject.toml to understand external dependencies
  • 🎯 Nested Function Support: Handles complex nested functions and class hierarchies
  • πŸ”„ Language-Agnostic Design: Unified graph schema across all supported languages

πŸ—οΈ Architecture

The system consists of two main components:

  1. Multi-language Parser: Tree-sitter based parsing system that analyzes codebases and ingests data into Memgraph
  2. RAG System (codebase_rag/): Interactive CLI for querying the stored knowledge graph

Core Components

  • 🌳 Tree-sitter Integration: Language-agnostic parsing using Tree-sitter grammars
  • πŸ“Š Graph Database: Memgraph for storing code structure as nodes and relationships
  • πŸ€– LLM Integration: Supports Google Gemini (cloud) and Ollama (local) for natural language processing
  • πŸ” Code Analysis: Advanced AST traversal for extracting code elements across languages
  • πŸ› οΈ Query Tools: Specialized tools for graph querying and code retrieval
  • βš™οΈ Language Configuration: Configurable mappings for different programming languages

πŸ“‹ Prerequisites

  • Python 3.12+
  • Docker & Docker Compose (for Memgraph)
  • For cloud models: Google Gemini API key
  • For local models: Ollama installed and running
  • uv package manager

πŸ› οΈ Installation

  1. Clone the repository:
git clone https://github.com/vitali87/code-graph-rag.git
cd code-graph-rag
  1. Install dependencies:

For basic Python support:

uv sync

For full multi-language support:

uv sync --extra treesitter-full

This installs Tree-sitter grammars for:

  • Python (.py)
  • JavaScript (.js, .jsx)
  • TypeScript (.ts, .tsx)
  • Rust (.rs)
  • Go (.go)
  • Scala (.scala, .sc)
  • Java (.java)
  1. Set up environment variables:
cp .env.example .env
# Edit .env with your configuration (see options below)

Configuration Options

Option 1: Cloud Models (Gemini)

# .env file
LLM_PROVIDER=gemini
GEMINI_API_KEY=your_gemini_api_key_here

Get your free API key from Google AI Studio.

Option 2: Local Models (Ollama)

# .env file
LLM_PROVIDER=local
LOCAL_MODEL_ENDPOINT=http://localhost:11434/v1
LOCAL_ORCHESTRATOR_MODEL_ID=llama3
LOCAL_CYPHER_MODEL_ID=llama3
LOCAL_MODEL_API_KEY=ollama

Install and run Ollama:

# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull required models
ollama pull llama3
# Or try other models like:
# ollama pull llama3.1
# ollama pull mistral
# ollama pull codellama

# Ollama will automatically start serving on localhost:11434

Note: Local models provide privacy and no API costs, but may have lower accuracy compared to cloud models like Gemini.

  1. Start Memgraph database:
docker-compose up -d

🎯 Usage

Step 1: Parse a Repository

Parse and ingest a multi-language repository into the knowledge graph:

For the first repository (clean start):

python -m codebase_rag.main --repo-path /path/to/repo1 --update-graph --clean

For additional repositories (preserve existing data):

python -m codebase_rag.main --repo-path /path/to/repo2 --update-graph
python -m codebase_rag.main --repo-path /path/to/repo3 --update-graph

Supported Languages: The system automatically detects and processes files based on extensions:

  • Python: .py files
  • JavaScript: .js, .jsx files
  • TypeScript: .ts, .tsx files
  • Rust: .rs files
  • Go: .go files
  • Scala: .scala, .sc files
  • Java: .java files

Step 2: Query the Codebase

Start the interactive RAG CLI:

python -m codebase_rag.main --repo-path /path/to/your/repo

Runtime Model Switching

You can switch between cloud and local models at runtime using CLI arguments:

Use Local Models:

python -m codebase_rag.main --repo-path /path/to/your/repo --llm-provider local

Use Cloud Models:

python -m codebase_rag.main --repo-path /path/to/your/repo --llm-provider gemini

Specify Custom Models:

# Use specific local models
python -m codebase_rag.main --repo-path /path/to/your/repo \
  --llm-provider local \
  --orchestrator-model llama3.1 \
  --cypher-model codellama

# Use specific Gemini models
python -m codebase_rag.main --repo-path /path/to/your/repo \
  --llm-provider gemini \
  --orchestrator-model gemini-2.0-flash-thinking-exp-01-21 \
  --cypher-model gemini-2.5-flash-lite-preview-06-17

Available CLI Arguments:

  • --llm-provider: Choose gemini or local
  • --orchestrator-model: Specify model for main RAG orchestration
  • --cypher-model: Specify model for Cypher query generation

Example queries (works across all supported languages):

  • "Show me all classes that contain 'user' in their name"
  • "Find functions related to database operations"
  • "What methods does the User class have?"
  • "Show me functions that handle authentication"
  • "List all TypeScript components"
  • "Find Rust structs and their methods"
  • "Show me Go interfaces and implementations"

πŸ“Š Graph Schema

The knowledge graph uses the following node types and relationships:

Node Types

  • Project: Root node representing the entire repository
  • Package: Language packages (Python: __init__.py, etc.)
  • Module: Individual source code files (.py, .js, .jsx, .ts, .tsx, .rs, .go, .scala, .sc, .java)
  • Class: Class/Struct/Enum definitions across all languages
  • Function: Module-level functions and standalone functions
  • Method: Class methods and associated functions
  • Folder: Regular directories
  • File: All files (source code and others)
  • ExternalPackage: External dependencies

Language-Specific Mappings

  • Python: function_definition, class_definition
  • JavaScript/TypeScript: function_declaration, arrow_function, class_declaration
  • Rust: function_item, struct_item, enum_item, impl_item
  • Go: function_declaration, method_declaration, type_declaration
  • Scala: function_definition, class_definition, object_definition, trait_definition
  • Java: method_declaration, class_declaration, interface_declaration, enum_declaration

Relationships

  • CONTAINS_PACKAGE/MODULE/FILE/FOLDER: Hierarchical containment
  • DEFINES: Module defines classes/functions
  • DEFINES_METHOD: Class defines methods
  • DEPENDS_ON_EXTERNAL: Project depends on external packages

πŸ”§ Configuration

Configuration is managed through environment variables in .env file:

Required Settings

  • LLM_PROVIDER: Set to "gemini" for cloud models or "local" for local models

Gemini (Cloud) Configuration

  • GEMINI_API_KEY: Required when LLM_PROVIDER=gemini
  • GEMINI_MODEL_ID: Main model for orchestration (default: gemini-2.5-pro-preview-06-05)
  • MODEL_CYPHER_ID: Model for Cypher generation (default: gemini-2.5-flash-lite-preview-06-17)

Local Models Configuration

  • LOCAL_MODEL_ENDPOINT: Ollama endpoint (default: http://localhost:11434/v1)
  • LOCAL_ORCHESTRATOR_MODEL_ID: Model for main RAG orchestration (default: llama3)
  • LOCAL_CYPHER_MODEL_ID: Model for Cypher query generation (default: llama3)
  • LOCAL_MODEL_API_KEY: API key for local models (default: ollama)

Other Settings

  • MEMGRAPH_HOST: Memgraph hostname (default: localhost)
  • MEMGRAPH_PORT: Memgraph port (default: 7687)
  • TARGET_REPO_PATH: Default repository path (default: .)

πŸƒβ€β™‚οΈ Development

Project Structure

code-graph-rag/
β”œβ”€β”€ codebase_rag/              # RAG system package
β”‚   β”œβ”€β”€ main.py                # CLI entry point
β”‚   β”œβ”€β”€ config.py              # Configuration management
β”‚   β”œβ”€β”€ graph_updater.py       # Tree-sitter based multi-language parser
β”‚   β”œβ”€β”€ language_config.py     # Language-specific configurations
β”‚   β”œβ”€β”€ prompts.py             # LLM prompts and schemas
β”‚   β”œβ”€β”€ schemas.py             # Pydantic models
β”‚   β”œβ”€β”€ services/              # Core services
β”‚   β”‚   └── llm.py             # Gemini LLM integration
β”‚   └── tools/                 # RAG tools
β”‚       β”œβ”€β”€ codebase_query.py  # Graph querying tool
β”‚       β”œβ”€β”€ code_retrieval.py  # Code snippet retrieval
β”‚       └── file_reader.py     # File content reading
β”œβ”€β”€ docker-compose.yaml        # Memgraph setup
β”œβ”€β”€ pyproject.toml            # Project dependencies & language extras
└── README.md                 # This file

Key Dependencies

  • tree-sitter: Core Tree-sitter library for language-agnostic parsing git - tree-sitter-{language}: Language-specific grammars (Python, JS, TS, Rust, Go, Scala, Java)
  • pydantic-ai: AI agent framework for RAG orchestration
  • pymgclient: Memgraph Python client for graph database operations
  • loguru: Advanced logging with structured output
  • python-dotenv: Environment variable management

🌍 Multi-Language Support

Supported Languages & Features

Language Extensions Functions Classes/Structs Modules Package Detection
Python .py βœ… βœ… βœ… __init__.py
JavaScript .js, .jsx βœ… βœ… βœ… -
TypeScript .ts, .tsx βœ… βœ… βœ… -
Rust .rs βœ… βœ… (structs/enums) βœ… -
Go .go βœ… βœ… (structs) βœ… -
Scala .scala, .sc βœ… βœ… (classes/objects/traits) βœ… package declarations
Java .java βœ… βœ… (classes/interfaces/enums) βœ… package declarations

Language-Specific Features

  • Python: Full support including nested functions, methods, classes, and package structure
  • JavaScript/TypeScript: Functions, arrow functions, classes, and method definitions
  • Rust: Functions, structs, enums, impl blocks, and associated functions
  • Go: Functions, methods, type declarations, and struct definitions
  • Scala: Functions, methods, classes, objects, traits, case classes, and Scala 3 syntax
  • Java: Methods, constructors, classes, interfaces, enums, and annotation types

Installation Options

# Basic Python-only support
uv sync

# Full multi-language support  
uv sync --extra treesitter-full

# Individual language support (if needed)
uv add tree-sitter-python tree-sitter-javascript tree-sitter-typescript tree-sitter-rust tree-sitter-go tree-sitter-scala tree-sitter-java

Language Configuration

The system uses a configuration-driven approach for language support. Each language is defined in codebase_rag/language_config.py with:

  • File extensions: Which files to process
  • AST node types: How to identify functions, classes, etc.
  • Module structure: How modules/packages are organized
  • Name extraction: How to extract names from AST nodes

Adding support for new languages requires only configuration changes, no code modifications.

πŸ› Debugging

  1. Check Memgraph connection:

    • Ensure Docker containers are running: docker-compose ps
    • Verify Memgraph is accessible on port 7687
  2. View database in Memgraph Lab:

  3. For local models:

    • Verify Ollama is running: ollama list
    • Check if models are downloaded: ollama pull llama3
    • Test Ollama API: curl http://localhost:11434/v1/models
    • Check Ollama logs: ollama logs

🀝 Contributing

  1. Follow the established code structure
  2. Keep files under 100 lines (as per user rules)
  3. Use type annotations
  4. Follow conventional commit messages
  5. Use DRY principles

πŸ™‹β€β™‚οΈ Support

For issues or questions:

  1. Check the logs for error details
  2. Verify Memgraph connection
  3. Ensure all environment variables are set
  4. Review the graph schema matches your expectations

Star History

Star History Chart

About

Search Monorepos and get relevant answers

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%