Skip to content

Commit b23b9bc

Browse files
authored
feat: add starknet_js as documentation source (#59)
1 parent b19d022 commit b23b9bc

File tree

15 files changed

+260
-38
lines changed

15 files changed

+260
-38
lines changed

AGENTS.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Agents Working Protocol
2+
3+
This file documents conventions and checklists for making changes that affect the Cairo Coder agent system. Its scope applies to the entire repository.
4+
5+
## Adding a Documentation Source
6+
7+
When adding a new documentation source (e.g., a new docs site or SDK) make sure to complete all of the following steps:
8+
9+
1. TypeScript ingestion (packages/ingester)
10+
11+
- Create an ingester class extending `BaseIngester` or `MarkdownIngester` under `packages/ingester/src/ingesters/`.
12+
- Register it in `packages/ingester/src/IngesterFactory.ts`.
13+
- Ensure chunks carry correct metadata: `uniqueId`, `contentHash`, `sourceLink`, and `source`.
14+
- Run `pnpm generate-embeddings` (or `generate-embeddings:yes`) to populate/update the vector store.
15+
16+
2. Agents (TS)
17+
18+
- Add the new enum value to `packages/agents/src/types/index.ts` under `DocumentSource`.
19+
- Verify Postgres vector store accepts the new `source` and filters on it (`packages/agents/src/db/postgresVectorStore.ts`).
20+
21+
3. Retrieval Pipeline (Python)
22+
23+
- Add the new enum value to `python/src/cairo_coder/core/types.py` under `DocumentSource`.
24+
- Ensure filtering by `metadata->>'source'` works with the new value in `python/src/cairo_coder/dspy/document_retriever.py`.
25+
- Update the query processor resource descriptions in `python/src/cairo_coder/dspy/query_processor.py` (`RESOURCE_DESCRIPTIONS`).
26+
27+
4. Optimized Program Files (Python) — required
28+
29+
- If the query processor or retrieval prompts are optimized via compiled DSPy programs, you must also update the optimized program artifacts so they reflect the new resource.
30+
- Specifically, review and update: `python/optimizers/results/optimized_retrieval_program.json` (and any other relevant optimized files, e.g., `optimized_rag.json`, `optimized_mcp_program.json`).
31+
- Regenerate these artifacts if your change affects prompt instructions, available resource lists, or selection logic.
32+
33+
5. API and Docs
34+
35+
- Ensure the new source appears where appropriate (e.g., `/v1/agents` output and documentation tables):
36+
- `API_DOCUMENTATION.md`
37+
- `packages/ingester/README.md`
38+
- Any user-facing lists of supported sources
39+
40+
6. Quick Sanity Check
41+
- Ingest a small subset (or run a dry-run) and verify: rows exist in the vector DB with the new `source`, links open correctly, and retrieval can filter by the new source.
42+
43+
## Notes
44+
45+
- Keep changes minimal and consistent with existing style.
46+
- Do not commit credentials or large artifacts; optimized program JSONs are small and versioned.
47+
- If you add new files that define agent behavior, document them here.

API_DOCUMENTATION.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,8 @@ Lists every agent registered in Cairo Coder.
5757
"cairo_by_example",
5858
"openzeppelin_docs",
5959
"corelib_docs",
60-
"scarb_docs"
60+
"scarb_docs",
61+
"starknet_js"
6162
]
6263
},
6364
{
@@ -80,6 +81,7 @@ Lists every agent registered in Cairo Coder.
8081
| `openzeppelin_docs` | OpenZeppelin Cairo contracts documentation |
8182
| `corelib_docs` | Cairo core library docs |
8283
| `scarb_docs` | Scarb package manager documentation |
84+
| `starknet_js` | StarknetJS guides and SDK documentation |
8385

8486
## Chat Completions
8587

packages/agents/src/types/index.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ export enum DocumentSource {
7474
OPENZEPPELIN_DOCS = 'openzeppelin_docs',
7575
CORELIB_DOCS = 'corelib_docs',
7676
SCARB_DOCS = 'scarb_docs',
77+
STARKNET_JS = 'starknet_js',
7778
}
7879

7980
export type BookChunk = {

packages/ingester/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,9 @@ The ingester currently supports the following documentation sources:
2929
3. **Starknet Foundry** (`starknet_foundry`): Documentation for the Starknet Foundry testing framework
3030
4. **Cairo By Example** (`cairo_by_example`): Examples of Cairo programming
3131
5. **OpenZeppelin Docs** (`openzeppelin_docs`): OpenZeppelin documentation for Starknet
32+
6. **Core Library Docs** (`corelib_docs`): Cairo core library documentation
33+
7. **Scarb Docs** (`scarb_docs`): Scarb package manager documentation
34+
8. **StarknetJS Guides** (`starknet_js`): StarknetJS guides and tutorials
3235

3336
## Architecture
3437

packages/ingester/__tests__/IngesterFactory.test.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ describe('IngesterFactory', () => {
8585
'openzeppelin_docs',
8686
'corelib_docs',
8787
'scarb_docs',
88+
'starknet_js',
8889
]);
8990
});
9091
});

packages/ingester/src/IngesterFactory.ts

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,12 @@ export class IngesterFactory {
5858
const { ScarbDocsIngester } = require('./ingesters/ScarbDocsIngester');
5959
return new ScarbDocsIngester();
6060

61+
case 'starknet_js':
62+
const {
63+
StarknetJSIngester,
64+
} = require('./ingesters/StarknetJSIngester');
65+
return new StarknetJSIngester();
66+
6167
default:
6268
throw new Error(`Unsupported source: ${source}`);
6369
}
@@ -69,14 +75,7 @@ export class IngesterFactory {
6975
* @returns Array of available document sources
7076
*/
7177
public static getAvailableSources(): DocumentSource[] {
72-
return [
73-
DocumentSource.CAIRO_BOOK,
74-
DocumentSource.STARKNET_DOCS,
75-
DocumentSource.STARKNET_FOUNDRY,
76-
DocumentSource.CAIRO_BY_EXAMPLE,
77-
DocumentSource.OPENZEPPELIN_DOCS,
78-
DocumentSource.CORELIB_DOCS,
79-
DocumentSource.SCARB_DOCS,
80-
];
78+
const sources = Object.values(DocumentSource);
79+
return sources;
8180
}
8281
}
Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
import * as path from 'path';
2+
import { exec as execCallback } from 'child_process';
3+
import { promisify } from 'util';
4+
import * as fs from 'fs/promises';
5+
import { BookConfig, BookPageDto, ParsedSection } from '../utils/types';
6+
import { MarkdownIngester } from './MarkdownIngester';
7+
import { DocumentSource, logger } from '@cairo-coder/agents';
8+
import { Document } from '@langchain/core/documents';
9+
import { BookChunk } from '@cairo-coder/agents/types/index';
10+
import { calculateHash } from '../utils/contentUtils';
11+
12+
export class StarknetJSIngester extends MarkdownIngester {
13+
private static readonly SKIPPED_DIRECTORIES = ['pictures', 'doc_scripts'];
14+
15+
constructor() {
16+
const config: BookConfig = {
17+
repoOwner: 'starknet-io',
18+
repoName: 'starknet.js',
19+
fileExtension: '.md',
20+
chunkSize: 4096,
21+
chunkOverlap: 512,
22+
};
23+
24+
super(config, DocumentSource.STARKNET_JS);
25+
}
26+
27+
protected getExtractDir(): string {
28+
return path.join(__dirname, '..', '..', 'temp', 'starknet-js-guides');
29+
}
30+
31+
protected async downloadAndExtractDocs(): Promise<BookPageDto[]> {
32+
const extractDir = this.getExtractDir();
33+
const repoUrl = `https://github.com/${this.config.repoOwner}/${this.config.repoName}.git`;
34+
const exec = promisify(execCallback);
35+
36+
try {
37+
// Clone the repository
38+
// TODO: Consider sparse clone optimization for efficiency:
39+
// git clone --depth 1 --filter=blob:none --sparse ${repoUrl} ${extractDir}
40+
// cd ${extractDir} && git sparse-checkout set www/docs/guides
41+
logger.info(`Cloning repository from ${repoUrl}...`);
42+
await exec(`git clone ${repoUrl} ${extractDir}`);
43+
logger.info('Repository cloned successfully');
44+
45+
// Navigate to the guides directory
46+
const docsDir = path.join(extractDir, 'www', 'docs', 'guides');
47+
48+
// Process markdown files from the guides directory
49+
const pages: BookPageDto[] = [];
50+
await this.processDirectory(docsDir, docsDir, pages);
51+
52+
logger.info(
53+
`Processed ${pages.length} markdown files from StarknetJS guides`,
54+
);
55+
return pages;
56+
} catch (error) {
57+
logger.error('Error downloading StarknetJS guides:', error);
58+
throw new Error('Failed to download and extract StarknetJS guides');
59+
}
60+
}
61+
62+
private async processDirectory(
63+
dir: string,
64+
baseDir: string,
65+
pages: BookPageDto[],
66+
): Promise<void> {
67+
const entries = await fs.readdir(dir, { withFileTypes: true });
68+
69+
for (const entry of entries) {
70+
const fullPath = path.join(dir, entry.name);
71+
72+
if (entry.isDirectory()) {
73+
// Skip configured directories
74+
if (StarknetJSIngester.SKIPPED_DIRECTORIES.includes(entry.name)) {
75+
logger.debug(`Skipping directory: ${entry.name}`);
76+
continue;
77+
}
78+
// Recursively process subdirectories
79+
await this.processDirectory(fullPath, baseDir, pages);
80+
} else if (entry.isFile() && entry.name.endsWith('.md')) {
81+
// Read the markdown file
82+
const content = await fs.readFile(fullPath, 'utf-8');
83+
84+
// Create relative path without extension for the name
85+
const relativePath = path.relative(baseDir, fullPath);
86+
const name = relativePath.replace(/\.md$/, '');
87+
88+
pages.push({
89+
name,
90+
content,
91+
});
92+
93+
logger.debug(`Processed file: ${name}`);
94+
}
95+
}
96+
}
97+
98+
protected parsePage(
99+
content: string,
100+
split: boolean = false,
101+
): ParsedSection[] {
102+
// Strip frontmatter before parsing
103+
const strippedContent = this.stripFrontmatter(content);
104+
return super.parsePage(strippedContent, split);
105+
}
106+
107+
public stripFrontmatter(content: string): string {
108+
// Remove YAML frontmatter if present (delimited by --- at start and end)
109+
const frontmatterRegex = /^---\n[\s\S]*?\n---\n?/;
110+
return content.replace(frontmatterRegex, '').trimStart();
111+
}
112+
113+
/**
114+
* Create chunks from a single page with a proper source link to GitHub
115+
* This overrides the default to attach a meaningful URL for UI display.
116+
*/
117+
protected createChunkFromPage(
118+
page_name: string,
119+
page_content: string,
120+
): Document<BookChunk>[] {
121+
const baseUrl =
122+
'https://github.com/starknet-io/starknet.js/blob/main/www/docs/guides';
123+
const pageUrl = `${baseUrl}/${page_name}.md`;
124+
125+
const localChunks: Document<BookChunk>[] = [];
126+
const sanitizedContent = this.sanitizeCodeBlocks(
127+
this.stripFrontmatter(page_content),
128+
);
129+
130+
const sections = this.parsePage(sanitizedContent, true);
131+
132+
sections.forEach((section: ParsedSection, index: number) => {
133+
// Reuse hashing and metadata shape from parent implementation by constructing Document directly
134+
// Importantly, attach a stable page-level sourceLink for the UI
135+
const content = section.content;
136+
const title = section.title;
137+
const uniqueId = `${page_name}-${index}`;
138+
139+
// Lightweight hash to keep parity with other ingesters without duplicating util impl
140+
const contentHash = calculateHash(content);
141+
142+
localChunks.push(
143+
new Document<BookChunk>({
144+
pageContent: content,
145+
metadata: {
146+
name: page_name,
147+
title,
148+
chunkNumber: index,
149+
contentHash,
150+
uniqueId,
151+
sourceLink: pageUrl,
152+
source: this.source,
153+
},
154+
}),
155+
);
156+
});
157+
158+
return localChunks;
159+
}
160+
}

python/MAINTAINER_GUIDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ graph TD
7272
Cairo Coder's goal is to democratize Cairo development by providing an intelligent code generation service that:
7373

7474
- Understands natural language queries (e.g., "Create an ERC20 token with minting").
75-
- Retrieves relevant documentation from sources like Cairo Book, Starknet Docs, Scarb, OpenZeppelin.
75+
- Retrieves relevant documentation from sources like Cairo Book, Starknet Docs, Scarb, OpenZeppelin, and StarknetJS.
7676
- Generates compilable Cairo code with explanations, following best practices.
7777
- Supports specialized agents (e.g., for Scarb config, Starknet deployment).
7878
- Is optimizable to improve accuracy over time using datasets like Starklings exercises.

0 commit comments

Comments
 (0)