Skip to content

Commit 9114edc

Browse files
feat: add scarb docs (#16)
Co-authored-by: alvinouille <[email protected]>
1 parent 4abfbbf commit 9114edc

File tree

9 files changed

+235
-22
lines changed

9 files changed

+235
-22
lines changed

CLAUDE.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# CLAUDE.md
2+
3+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4+
5+
## Project Overview
6+
7+
Cairo Coder is an open-source Cairo language code generation service using Retrieval-Augmented Generation (RAG) to transform natural language requests into functional Cairo smart contracts and programs. It was adapted from the Starknet Agent project.
8+
9+
## Essential Commands
10+
11+
### Development
12+
13+
- `pnpm install` - Install dependencies (requires Node.js 20+ and pnpm 9+)
14+
- `pnpm dev` - Start all services in development mode with hot reload
15+
- `pnpm build` - Build all packages for production
16+
- `pnpm clean` - Clean package build files
17+
- `pnpm clean:all` - Clean all build files and node_modules
18+
19+
### Testing
20+
21+
- `pnpm test` - Run all tests across packages
22+
- `pnpm --filter @cairo-coder/agents test` - Run tests for specific package
23+
- `pnpm --filter @cairo-coder/agents test -- -t "test name"` - Run specific test
24+
- `pnpm --filter @cairo-coder/backend check-types` - Type check specific package
25+
26+
### Documentation Ingestion
27+
28+
- `pnpm generate-embeddings` - Interactive ingestion of documentation sources
29+
- `pnpm generate-embeddings:yes` - Non-interactive ingestion (for CI/CD)
30+
31+
### Docker Operations
32+
33+
- `docker compose up postgres backend` - Start main services
34+
- `docker compose up ingester` - Run documentation ingestion
35+
36+
## High-Level Architecture
37+
38+
### Monorepo Structure
39+
40+
- **packages/agents**: Core RAG pipeline orchestrating query processing, document retrieval, and code generation
41+
- **packages/backend**: Express API server providing OpenAI-compatible endpoints
42+
- **packages/ingester**: Documentation processing system using template method pattern
43+
- **packages/typescript-config**: Shared TypeScript configuration
44+
45+
### Key Design Patterns
46+
47+
1. **RAG Pipeline** (packages/agents/src/core/pipeline/):
48+
49+
- `QueryProcessor`: Reformulates user queries for better retrieval
50+
- `DocumentRetriever`: Searches pgvector database using similarity measures
51+
- `AnswerGenerator`: Generates Cairo code from retrieved documents
52+
- `McpPipeline`: Special mode returning raw documents without generation
53+
54+
2. **Ingester System** (packages/ingester/src/ingesters/):
55+
56+
- `BaseIngester`: Abstract class implementing template method pattern
57+
- Source-specific ingesters extend base class for each documentation source
58+
- Factory pattern (`IngesterFactory`) creates appropriate ingester instances
59+
60+
3. **Multi-Provider LLM Support**:
61+
- Configurable providers: OpenAI, Anthropic, Google Gemini
62+
- Provider abstraction in agents package handles model differences
63+
- Streaming and non-streaming response modes
64+
65+
### Configuration
66+
67+
- Copy `packages/agents/sample.config.toml` to `config.toml`
68+
- Required configurations:
69+
- LLM provider API keys (OPENAI, GEMINI, ANTHROPIC)
70+
- Database connection in [VECTOR_DB] section
71+
- Model selection in [PROVIDERS] section
72+
- Environment variables:
73+
- Root `.env`: PostgreSQL initialization (POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB)
74+
- `packages/backend/.env`: Optional LangSmith tracing configuration
75+
76+
### Database Architecture
77+
78+
- PostgreSQL with pgvector extension for vector similarity search
79+
- Embedding storage for documentation chunks
80+
- Configurable similarity measures (cosine, dot product, euclidean)
81+
82+
## Development Guidelines
83+
84+
### Code Organization
85+
86+
- Follow existing patterns in neighboring files
87+
- Use dependency injection for testability
88+
- Mock external dependencies (LLMs, databases) in tests
89+
- Prefer editing existing files over creating new ones
90+
- Follow template method pattern for new ingesters
91+
92+
### Testing Approach
93+
94+
- Jest for all testing
95+
- Test files in `__tests__/` directories
96+
- Mock LLM calls and database operations
97+
- Test each ingester implementation separately
98+
- Use descriptive test names explaining behavior
99+
100+
### Adding New Documentation Sources
101+
102+
1. Create new ingester extending `BaseIngester` in packages/ingester/src/ingesters/
103+
2. Implement required abstract methods
104+
3. Register in `IngesterFactory`
105+
4. Update configuration if needed
106+
107+
### MCP (Model Context Protocol) Mode
108+
109+
- Special mode activated via `x-mcp-mode: true` header
110+
- Returns raw documentation chunks without LLM generation
111+
- Useful for integration with other tools needing Cairo documentation

packages/agents/src/config/agent.ts

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,7 @@ import { basicTestTemplate } from './templates/testTemplate';
44
import { VectorStore } from '../db/postgresVectorStore';
55
import { DocumentSource, RagSearchConfig } from '../types';
66

7-
export const getAgentConfig = (
8-
vectorStore: VectorStore,
9-
): RagSearchConfig => {
7+
export const getAgentConfig = (vectorStore: VectorStore): RagSearchConfig => {
108
return {
119
name: 'Cairo Coder',
1210
prompts: cairoCoderPrompts,
@@ -19,6 +17,9 @@ export const getAgentConfig = (
1917
DocumentSource.CAIRO_BOOK,
2018
DocumentSource.CAIRO_BY_EXAMPLE,
2119
DocumentSource.STARKNET_FOUNDRY,
20+
DocumentSource.CORELIB_DOCS,
21+
DocumentSource.OPENZEPPELIN_DOCS,
22+
DocumentSource.SCARB_DOCS,
2223
],
2324
};
2425
};

packages/agents/src/config/prompts/cairoCoderPrompts.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ You will be given a conversation history and a follow-up question. Your primary
2929
* **cairo_by_example:** Cairo by Example Documentation. Provides practical Cairo code snippets for specific language features or common patterns. Useful for "how-to" syntax questions.
3030
* **openzeppelin_docs:** OpenZeppelin Cairo Contracts Documentation. For using the OZ library: standard implementations (ERC20, ERC721), access control, security patterns, contract upgradeability. Crucial for building standard-compliant contracts.
3131
* **corelib_docs:** Cairo Core Library Documentation. For using the Cairo core library: basic types, stdlib functions, stdlib structs, macros, and other core concepts. Essential for Cairo programming questions.
32+
* **scarb_docs:** Scarb Documentation. For using the Scarb package manager: building, compiling, generating compilation artifacts, managing dependencies, configuration of Scarb.toml.
3233
3334
**Examples:**
3435

packages/agents/src/core/pipeline/documentRetriever.ts

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,11 @@ export class DocumentRetriever {
5454
].map(
5555
(content) => results.flat().find((doc) => doc.pageContent === content)!,
5656
);
57-
logger.debug('Retrieved documents:', { count: uniqueDocs.length });
57+
const sourceSet = new Set(uniqueDocs.map((doc) => doc.metadata.source));
58+
logger.debug('Retrieved documents:', {
59+
count: uniqueDocs.length,
60+
sources: Array.from(sourceSet),
61+
});
5862
return uniqueDocs;
5963
}
6064

packages/agents/src/core/pipeline/mcpPipeline.ts

Lines changed: 35 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
import { RagPipeline } from './ragPipeline';
2-
import { RagInput, StreamHandler } from '../../types';
2+
import { RagInput, RetrievedDocuments, StreamHandler } from '../../types';
33
import { logger, TokenTracker } from '../../utils';
44

55
/**
@@ -14,7 +14,7 @@ export class McpPipeline extends RagPipeline {
1414
try {
1515
// Reset token counters at the start of each pipeline run
1616
TokenTracker.resetSessionCounters();
17-
17+
1818
logger.info('Starting MCP pipeline', { query: input.query });
1919

2020
// Step 1: Process the query
@@ -30,33 +30,54 @@ export class McpPipeline extends RagPipeline {
3030

3131
// Step 3: Return raw documents without answer generation
3232
logger.info('MCP mode - returning raw documents');
33-
34-
const rawDocuments = retrieved.documents.map(doc => ({
35-
pageContent: doc.pageContent,
36-
metadata: doc.metadata
37-
}));
33+
34+
const context = this.assembleDocuments(retrieved);
3835

3936
handler.emitResponse({
40-
content: JSON.stringify(rawDocuments, null, 2),
37+
content: JSON.stringify(context, null, 2),
4138
} as any);
4239

4340
logger.debug('MCP pipeline ended');
44-
41+
4542
// Log final token usage
4643
const tokenUsage = TokenTracker.getSessionTokenUsage();
47-
logger.info('MCP Pipeline completed', {
44+
logger.info('MCP Pipeline completed', {
4845
query: input.query,
4946
tokenUsage: {
5047
promptTokens: tokenUsage.promptTokens,
5148
responseTokens: tokenUsage.responseTokens,
52-
totalTokens: tokenUsage.totalTokens
53-
}
49+
totalTokens: tokenUsage.totalTokens,
50+
},
5451
});
55-
52+
5653
handler.emitEnd();
5754
} catch (error) {
5855
logger.error('MCP Pipeline error:', error);
5956
handler.emitError('An error occurred while processing your request');
6057
}
6158
}
62-
}
59+
60+
public assembleDocuments(retrieved: RetrievedDocuments): string {
61+
const docs = retrieved.documents;
62+
if (!docs.length) {
63+
return (
64+
this.config.prompts.noSourceFoundPrompt ||
65+
'No relevant information found.'
66+
);
67+
}
68+
69+
// Concatenate all document content into a single string
70+
let context = docs.map((doc) => doc.pageContent).join('\n\n');
71+
72+
// Add contract and test templates at the end if applicable
73+
const { isContractRelated, isTestRelated } = retrieved.processedQuery;
74+
if (isContractRelated && this.config.contractTemplate) {
75+
context += '\n\n' + this.config.contractTemplate;
76+
}
77+
if (isTestRelated && this.config.testTemplate) {
78+
context += '\n\n' + this.config.testTemplate;
79+
}
80+
81+
return context;
82+
}
83+
}

packages/agents/src/types/index.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,6 +103,7 @@ export enum DocumentSource {
103103
CAIRO_BY_EXAMPLE = 'cairo_by_example',
104104
OPENZEPPELIN_DOCS = 'openzeppelin_docs',
105105
CORELIB_DOCS = 'corelib_docs',
106+
SCARB_DOCS = 'scarb_docs',
106107
}
107108

108109
export type BookChunk = {

packages/ingester/src/IngesterFactory.ts

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,10 @@ export class IngesterFactory {
5454
} = require('./ingesters/CoreLibDocsIngester');
5555
return new CoreLibDocsIngester();
5656

57+
case 'scarb_docs':
58+
const { ScarbDocsIngester } = require('./ingesters/ScarbDocsIngester');
59+
return new ScarbDocsIngester();
60+
5761
default:
5862
throw new Error(`Unsupported source: ${source}`);
5963
}
@@ -72,6 +76,7 @@ export class IngesterFactory {
7276
DocumentSource.CAIRO_BY_EXAMPLE,
7377
DocumentSource.OPENZEPPELIN_DOCS,
7478
DocumentSource.CORELIB_DOCS,
79+
DocumentSource.SCARB_DOCS,
7580
];
7681
}
7782
}

packages/ingester/src/generateEmbeddings.ts

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ import { loadOpenAIEmbeddingsModels } from '@cairo-coder/backend/config/provider
66
import { DocumentSource } from '@cairo-coder/agents/types/index';
77
import { IngesterFactory } from './IngesterFactory';
88

9-
109
/**
1110
* Global vector store instance
1211
*/
@@ -138,9 +137,7 @@ async function main() {
138137
if (target === 'Everything') {
139138
// Ingest all sources
140139
const sources = IngesterFactory.getAvailableSources();
141-
for (const source of sources) {
142-
await ingestSource(source);
143-
}
140+
await Promise.all(sources.map((source) => ingestSource(source)));
144141
} else {
145142
// Ingest specific source
146143
await ingestSource(target);
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
import * as path from 'path';
2+
import { DocumentSource } from '@cairo-coder/agents/types/index';
3+
import { BookConfig, BookPageDto } from '../utils/types';
4+
import { processDocFiles } from '../utils/fileUtils';
5+
import { logger } from '@cairo-coder/agents/utils/index';
6+
import { exec as execCallback } from 'child_process';
7+
import { promisify } from 'util';
8+
import { MarkdownIngester } from './MarkdownIngester';
9+
10+
/**
11+
* Ingester for the Scarb documentation
12+
*
13+
* This ingester downloads the Scarb documentation from the GitHub repository,
14+
* processes the markdown files from the website/docs directory, and creates chunks for the vector store.
15+
*/
16+
export class ScarbDocsIngester extends MarkdownIngester {
17+
/**
18+
* Constructor for the Scarb docs ingester
19+
*/
20+
constructor() {
21+
// Define the configuration for the Scarb documentation
22+
const config: BookConfig = {
23+
repoOwner: 'software-mansion',
24+
repoName: 'scarb',
25+
fileExtension: '.md',
26+
chunkSize: 4096,
27+
chunkOverlap: 512,
28+
};
29+
30+
super(config, DocumentSource.SCARB_DOCS);
31+
}
32+
33+
/**
34+
* Get the directory path for extracting files
35+
*
36+
* @returns string - Path to the extract directory
37+
*/
38+
protected getExtractDir(): string {
39+
return path.join(__dirname, '..', '..', 'temp', 'scarb-docs');
40+
}
41+
42+
/**
43+
* Download and extract the repository
44+
*
45+
* @returns Promise<BookPageDto[]> - Array of book pages
46+
*/
47+
protected async downloadAndExtractDocs(): Promise<BookPageDto[]> {
48+
const extractDir = this.getExtractDir();
49+
const repoUrl = `https://github.com/${this.config.repoOwner}/${this.config.repoName}.git`;
50+
51+
logger.info(`Cloning repository from ${repoUrl}`);
52+
53+
// Clone the repository
54+
const exec = promisify(execCallback);
55+
try {
56+
await exec(`git clone ${repoUrl} ${extractDir}`);
57+
} catch (error) {
58+
logger.error('Error cloning repository:', error);
59+
throw new Error('Failed to clone repository');
60+
}
61+
62+
logger.info('Repository cloned successfully.');
63+
64+
// Process the markdown files from website/docs directory
65+
const docsDir = path.join(extractDir, 'website', 'docs');
66+
const pages = await processDocFiles(this.config, docsDir);
67+
68+
logger.info(`Processed ${pages.length} documentation pages from Scarb`);
69+
70+
return pages;
71+
}
72+
}

0 commit comments

Comments
 (0)