Skip to content

Commit f7aea2f

Browse files
committed
feat: embeddings for 2025 starknet blog
1 parent 21bcf7c commit f7aea2f

File tree

16 files changed

+4844
-26
lines changed

16 files changed

+4844
-26
lines changed

ingesters/README.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,30 @@ The package includes several utility modules:
8686
- **vectorStoreUtils.ts**: Functions for vector store operations
8787
- **types.ts**: Common types and interfaces
8888

89+
### Chunking: RecursiveMarkdownSplitter
90+
91+
The `RecursiveMarkdownSplitter` splits markdown content into semantic chunks with metadata (title, unique ID, character offsets, source link). It supports two modes:
92+
93+
- Default mode (size-aware):
94+
- Recursively splits by headers (configurable levels), paragraphs, and lines to target `maxChars`.
95+
- Merges tiny segments when below `minChars` and applies backward `overlap` between chunks.
96+
- Respects fenced code blocks and avoids splitting inside non-breakable blocks when possible.
97+
98+
Example usage:
99+
100+
```ts
101+
import { RecursiveMarkdownSplitter } from './src/utils/RecursiveMarkdownSplitter';
102+
103+
// Default mode
104+
const splitter = new RecursiveMarkdownSplitter({
105+
maxChars: 2048,
106+
minChars: 500,
107+
overlap: 256,
108+
headerLevels: [1, 2, 3],
109+
});
110+
const chunks = splitter.splitMarkdownToChunks(markdown);
111+
```
112+
89113
## Usage
90114

91115
To use the ingester package, run the `generateEmbeddings.ts` script:

ingesters/__tests__/IngesterFactory.test.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ describe('IngesterFactory', () => {
7575
DocumentSource.CORELIB_DOCS,
7676
DocumentSource.SCARB_DOCS,
7777
DocumentSource.STARKNET_JS,
78+
DocumentSource.STARKNET_BLOG,
7879
]);
7980
});
8081
});

ingesters/src/IngesterFactory.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ import { OpenZeppelinDocsIngester } from './ingesters/OpenZeppelinDocsIngester';
88
import { CoreLibDocsIngester } from './ingesters/CoreLibDocsIngester';
99
import { ScarbDocsIngester } from './ingesters/ScarbDocsIngester';
1010
import { StarknetJSIngester } from './ingesters/StarknetJSIngester';
11+
import { StarknetBlogIngester } from './ingesters/StarknetBlogIngester';
1112

1213
/**
1314
* Factory class for creating ingesters
@@ -50,6 +51,9 @@ export class IngesterFactory {
5051
case 'starknet_js':
5152
return new StarknetJSIngester();
5253

54+
case 'starknet_blog':
55+
return new StarknetBlogIngester();
56+
5357
default:
5458
throw new Error(`Unsupported source: ${source}`);
5559
}
Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
import { type BookConfig } from '../utils/types';
2+
import { MarkdownIngester } from './MarkdownIngester';
3+
import { type BookChunk, DocumentSource } from '../types';
4+
import { Document } from '@langchain/core/documents';
5+
import { VectorStore } from '../db/postgresVectorStore';
6+
import { logger } from '../utils/logger';
7+
import * as fs from 'fs/promises';
8+
import * as path from 'path';
9+
import { calculateHash } from '../utils/contentUtils';
10+
import {
11+
RecursiveMarkdownSplitter,
12+
type SplitOptions,
13+
} from '../utils/RecursiveMarkdownSplitter';
14+
import { getPythonPath } from '../utils/paths';
15+
16+
/**
17+
* Ingester for Starknet blog posts documentation
18+
*
19+
* This ingester processes pre-summarized Starknet blog posts from the generated
20+
* summary file, chunks them using the RecursiveMarkdownSplitter, and stores them
21+
* in the vector database for retrieval.
22+
*/
23+
export class StarknetBlogIngester extends MarkdownIngester {
24+
/**
25+
* Constructor for the Starknet Blog ingester
26+
*/
27+
constructor() {
28+
// Define the configuration for the Starknet Blog
29+
const config: BookConfig = {
30+
repoOwner: 'starknet',
31+
repoName: 'starknet-blog',
32+
fileExtension: '.md',
33+
chunkSize: 4096,
34+
chunkOverlap: 512,
35+
baseUrl: 'https://www.starknet.io/blog',
36+
urlSuffix: '',
37+
useUrlMapping: false,
38+
};
39+
40+
super(config, DocumentSource.STARKNET_BLOG);
41+
}
42+
43+
/**
44+
* Read the pre-summarized Starknet blog documentation file
45+
*/
46+
async readSummaryFile(): Promise<string> {
47+
const summaryPath = getPythonPath(
48+
'src',
49+
'scripts',
50+
'summarizer',
51+
'generated',
52+
'blog_summary.md',
53+
);
54+
55+
logger.info(`Reading Starknet blog summary from ${summaryPath}`);
56+
const text = await fs.readFile(summaryPath, 'utf-8');
57+
return text;
58+
}
59+
60+
/**
61+
* Chunk the blog summary file using RecursiveMarkdownSplitter
62+
*
63+
* This function takes the markdown content and splits it using a recursive
64+
* strategy that respects headers, code blocks, and maintains overlap between chunks.
65+
*
66+
* @param text - The markdown content to chunk
67+
* @returns Promise<Document<BookChunk>[]> - Array of document chunks
68+
*/
69+
async chunkSummaryFile(text: string): Promise<Document<BookChunk>[]> {
70+
// Configure the splitter with appropriate settings
71+
const splitOptions: SplitOptions = {
72+
maxChars: 2048,
73+
minChars: 500,
74+
overlap: 256,
75+
headerLevels: [1, 2, 3], // Split on H1/H2/H3 (title uses deepest)
76+
preserveCodeBlocks: true,
77+
idPrefix: 'starknet-blog',
78+
trim: true,
79+
};
80+
81+
// Create the splitter and split the content
82+
const splitter = new RecursiveMarkdownSplitter(splitOptions);
83+
const chunks = splitter.splitMarkdownToChunks(text);
84+
85+
logger.info(
86+
`Created ${chunks.length} chunks using RecursiveMarkdownSplitter`,
87+
);
88+
89+
// Convert chunks to Document<BookChunk> format
90+
const localChunks: Document<BookChunk>[] = chunks.map((chunk) => {
91+
const contentHash = calculateHash(chunk.content);
92+
93+
return new Document<BookChunk>({
94+
pageContent: chunk.content,
95+
metadata: {
96+
name: chunk.meta.title,
97+
title: chunk.meta.title,
98+
chunkNumber: chunk.meta.chunkNumber, // Already 0-based
99+
contentHash: contentHash,
100+
uniqueId: chunk.meta.uniqueId,
101+
sourceLink: chunk.meta.sourceLink || this.config.baseUrl,
102+
source: this.source,
103+
},
104+
});
105+
});
106+
107+
return localChunks;
108+
}
109+
110+
/**
111+
* Starknet Blog specific processing based on the pre-summarized markdown file
112+
* @param vectorStore
113+
*/
114+
public override async process(vectorStore: VectorStore): Promise<void> {
115+
try {
116+
// 1. Read the pre-summarized documentation
117+
const text = await this.readSummaryFile();
118+
119+
// 2. Create chunks from the documentation
120+
const chunks = await this.chunkSummaryFile(text);
121+
122+
logger.info(
123+
`Created ${chunks.length} chunks from Starknet blog documentation`,
124+
);
125+
126+
// 3. Update the vector store with the chunks
127+
await this.updateVectorStore(vectorStore, chunks);
128+
129+
// 4. Clean up any temporary files (no temp files in this case)
130+
await this.cleanupDownloadedFiles();
131+
} catch (error) {
132+
this.handleError(error);
133+
}
134+
}
135+
136+
/**
137+
* Get the directory path for extracting files
138+
*
139+
* @returns string - Path to the extract directory
140+
*/
141+
protected getExtractDir(): string {
142+
const { getTempDir } = require('../utils/paths');
143+
return getTempDir('starknet-blog');
144+
}
145+
146+
/**
147+
* Override cleanupDownloadedFiles since we don't download anything
148+
*/
149+
protected override async cleanupDownloadedFiles(): Promise<void> {
150+
// No cleanup needed as we're reading from a local file
151+
logger.info('No cleanup needed - using local summary file');
152+
}
153+
}

ingesters/src/types/index.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ export enum DocumentSource {
1616
CORELIB_DOCS = 'corelib_docs',
1717
SCARB_DOCS = 'scarb_docs',
1818
STARKNET_JS = 'starknet_js',
19+
STARKNET_BLOG = 'starknet_blog',
1920
}
2021

2122
export type BookChunk = {

ingesters/src/utils/RecursiveMarkdownSplitter.ts

Lines changed: 24 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,15 @@
11
import { logger } from './logger';
22

33
// Public API interfaces
4+
/**
5+
* Options controlling how markdown is split into chunks. Two high-level modes exist:
6+
*
7+
* - Default mode (splitFullPage: false):
8+
* Recursively splits by headers (per headerLevels), paragraphs, and lines to respect
9+
* maxChars. Applies minChars-based merging and backward overlap. Avoids splitting
10+
* inside non-breakable code fences when possible.
11+
*
12+
*/
413
export interface SplitOptions {
514
/** Maximum characters per chunk (UTF-16 .length), not counting overlap. Default: 2048 */
615
maxChars?: number;
@@ -72,6 +81,13 @@ interface Tokens {
7281
sourceRanges: Array<{ start: number; end: number; url: string }>;
7382
}
7483

84+
/**
85+
* Splits markdown into semantic chunks with metadata.
86+
*
87+
* Modes
88+
* - Default: recursive splitting by headers/paragraphs/lines to satisfy maxChars, with overlap and
89+
* minChars-based merging, while respecting code blocks.
90+
*/
7591
export class RecursiveMarkdownSplitter {
7692
private readonly options: Required<SplitOptions>;
7793

@@ -124,7 +140,7 @@ export class RecursiveMarkdownSplitter {
124140
}
125141

126142
/**
127-
* Main entry point to split markdown into chunks
143+
* Split markdown into chunks
128144
*/
129145
public splitMarkdownToChunks(markdown: string): Chunk[] {
130146
// Handle empty input
@@ -209,15 +225,18 @@ export class RecursiveMarkdownSplitter {
209225
}
210226

211227
/**
212-
* Parse special formatted Sources blocks and compute active source ranges
213-
* A block looks like:
228+
* Parse Sources blocks and compute active source ranges used for meta.sourceLink.
229+
*
230+
* Format:
214231
* ---\n
215232
* Sources:\n
216233
* - https://example.com/a\n
217234
* - https://example.com/b\n
218235
* ---
219-
* Active source becomes the first URL and applies from the end of the block
220-
* until the start of the next Sources block (or end of document).
236+
*
237+
* The active source becomes the first URL in the list and applies from the end of the closing
238+
* '---' until the start of the next Sources block (or EOF). This mapping is used during metadata
239+
* attachment to set the chunk's sourceLink.
221240
*/
222241
private parseSourceRanges(
223242
markdown: string,

python/optimizers/datasets/user_queries.json

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -466,7 +466,6 @@
466466
"PS C:\\Users\\kased\\kaseddie-cairo-foundations> tree /F /A\nFolder PATH listing\nVolume serial number is C809-043D\nC:.\n\\---cairo-contracts\n | Scarb.lock\n | Scarb.toml\n |\n +---src\n | lib.cairo\n |\n +---target\n | | CACHEDIR.TAG\n | |\n | \\---dev\n | | kaseddie_balance_contract.sierra.json\n | | kaseddie_balance_contract_integrationtest.test.json\n | | kaseddie_balance_contract_integrationtest.test.sierra.json\n | | kaseddie_balance_contract_integrationtest.test.starknet_artifacts.json\n | | kaseddie_balance_contract_integrationtest_UserVault.test.contract_class.json\n | | kaseddie_balance_contract_unittest.test.json\n | | kaseddie_balance_contract_unittest.test.sierra.json\n | | kaseddie_balance_contract_unittest.test.starknet_artifacts.json\n | | kaseddie_balance_contract_unittest_UserVault.test.contract_class.json\n | | kaseddie_cairo_foundations_unittest.test.json\n | | kaseddie_cairo_foundations_unittest.test.sierra.json\n | | kaseddie_cairo_foundations_unittest.test.starknet_artifacts.json\n | | kaseddie_cairo_foundations_unittest_UserVault.test.contract_class.json\n | |\n | +---.fingerprint\n | | +---core-o8ctti9fe3p52\n | | | core\n | | |\n | | +---core-sc59she7p1k9k\n | | | core\n | | |\n | | +---kaseddie_balance_contract-g7l5vl2d6tbts\n | | | kaseddie_balance_contract\n | | |\n | | +---kaseddie_balance_contract-sfovo0kjo4j24\n | | | kaseddie_balance_contract\n | | |\n | | +---kaseddie_balance_contract_integrationtest-ston3v8tncj0c\n | | | kaseddie_balance_contract_integrationtest\n | | |\n | | +---kaseddie_balance_contract_unittest-95sc4uqcckhdo\n | | | kaseddie_balance_contract_unittest\n | | |\n | | +---kaseddie_balance_contract_unittest-ir7jeflt0lpls\n | | | kaseddie_balance_contract_unittest\n | | |\n | | \\---kaseddie_cairo_foundations_unittest-tvrbv3hnqi4ui\n | | kaseddie_cairo_foundations_unittest\n | |\n | \\---incremental\n | core-o8ctti9fe3p52.bin\n | core-sc59she7p1k9k.bin\n | kaseddie_balance_contract-g7l5vl2d6tbts.bin\n | kaseddie_balance_contract-sfovo0kjo4j24.bin\n | kaseddie_balance_contract_integrationtest-ston3v8tncj0c.bin\n | kaseddie_balance_contract_unittest-95sc4uqcckhdo.bin\n | kaseddie_balance_contract_unittest-ir7jeflt0lpls.bin\n | kaseddie_cairo_foundations_unittest-tvrbv3hnqi4ui.bin\n |\n \\---tests\n uservault_test.cairo\n\nPS C:\\Users\\kased\\kaseddie-cairo-foundations>",
467467
"que es fn?",
468468
"que mensaje recomiendas para el assert ?\n\n fn add_user(ref self: ContractState, user: ContractAddress) {\n let caller = get_caller_address();\n\n let mut is_dao: bool = false;\n let mut i: u16 = 0;\n\n while i != self.dao_counter.read() {\n if self.daos.read(i).dao_address == caller {\n is_dao = true;\n return;\n }\n i += 1;\n }\n\n assert!(is_dao, \"User is not a DAO\");\n _add_user(ref self, user);\n }",
469-
"que tipo de preguntas puedo hacerte?",
470469
"quiero asignarles roles de mint a contratos de mi proyecto, como me aseguro que estos addres pertenecen a mi proyecto y no son addres de usuarios u otros contratos externos malisiosos?",
471470
"read files in tests",
472471
"read files in tests\n\n",
@@ -668,12 +667,10 @@
668667
"смотри мы пользуемся старкнет девнет для локальной сети, верно? А если я хочу не пустую сеть, а форк текущей",
669668
"уяви що сьогодні 01.01.2026. Які криптомонети виросли найбільше в ціні?",
670669
"چطوری داداش",
671-
"스타크넷의 회사 위치와 사진을 보여줘",
672670
"今天天氣如何",
673671
"介紹一下STRK",
674672
"介绍一下 Starknet 上的 Paymaster",
675673
"但實際上不是遠遠超過了預估的費用嗎",
676-
"你現在是C#程式語言高手",
677674
"你能给我一个最新版本的 cairo 项目的配置吗?",
678675
"你能给我一个正确的 starknet",
679676
"关于 Cairo 编写的 Starknet 合约中的 Storage 与 State,下列说法哪些是正确的?(多选)\n\nA. 合约状态变量的存储是通过隐式的 Merkle Patricia Tree 实现的\n\nB. 每个 @storage_var 声明会创建一个对应的 getter 函数\n\nC. Cairo 中不能在合约外部直接读取存储变量\n\nD. 合约的存储数据按 Slot 和 Offset 编码组织\n\nE. Storage layout 是编译时静态生成的,不能动态调整",

0 commit comments

Comments
 (0)