-
-
Notifications
You must be signed in to change notification settings - Fork 1k
Open
Description
Description
The NewMarkdownTextSplitter produces significantly more chunks than expected based on WithChunkSize, and it alters the markdown structure, particularly when dealing with lists containing nested content (code blocks, nested lists).
Expected Behavior (please correct me if I'm wrong)
- Chunks should be closer to the specified
chunkSize(within reasonable bounds) - Markdown structure should be preserved, especially list items with nested content
- When
WithCodeBlocks(true)orWithJoinTableRows(true)is enabled, chunks may exceedchunkSizeto keep code blocks/tables intact, but should still respect it when possible
Actual Behavior
- Many chunks are produced that are much smaller than
chunkSize - List item content gets converted to list items
- The splitter appears to split on every markdown element rather than respecting the chunk size settings
Minimal Reproduction
See the attached langchaingo_issue_reproduction.go file for a complete, runnable example that clearly demonstrates all issues.
To run:
go run langchaingo_issue_reproduction.goThe script will:
- Show the configuration (chunkSize=500, expecting ~1-2 chunks)
- Display all produced chunks with their estimated token counts
- Clearly identify each issue with numbered error messages
- Provide a summary of all problems found
langchaingo_issue_reproduction.go
package main
import (
"fmt"
"strings"
"github.com/tmc/langchaingo/textsplitter"
)
func main() {
// Example markdown with nested lists and code blocks
markdown := "# Getting Started Guide\n\n" +
"## Prerequisites\n\n" +
"- You need to have Node.js installed on your system.\n" +
"- You should have a text editor ready.\n" +
"- Make sure you have internet connectivity.\n\n" +
"## Installation Steps\n\n" +
"To install the application:\n\n" +
"1. Download the installer from the official website.\n" +
"2. Run the installation script:\n" +
" \\`\\`\\`bash\n" +
" ./install.sh\n" +
" chmod +x install.sh\n" +
" \\`\\`\\`\n" +
"3. Configure your settings in the config file.\n\n" +
" For example, the following configuration sets up a basic server:\n\n" +
" \\`\\`\\`json\n" +
" {\n" +
" \"host\": \"localhost\",\n" +
" \"port\": 8080,\n" +
" \"timeout\": 30\n" +
" }\n" +
" \\`\\`\\`\n" +
"4. Start the application and verify it's running.\n\n" +
"## Configuration Options\n\n" +
"The application supports several configuration options:\n\n" +
"- **Host**: The server hostname or IP address\n" +
"- **Port**: The port number to listen on\n" +
"- **Timeout**: Connection timeout in seconds\n\n" +
"### Advanced Settings\n\n" +
"For advanced users, you can configure:\n\n" +
"1. Database connection settings\n" +
"2. Cache configuration\n" +
"3. Logging preferences\n"
fmt.Println("=" + strings.Repeat("=", 78) + "=")
fmt.Println("LANGCHAINGO MarkdownTextSplitter Issue Reproduction")
fmt.Println("=" + strings.Repeat("=", 78) + "=")
fmt.Println()
// Configuration
chunkSize := 500
chunkOverlap := 100
fmt.Printf("Configuration:\n")
fmt.Printf(" - chunkSize: %d tokens\n", chunkSize)
fmt.Printf(" - chunkOverlap: %d tokens\n", chunkOverlap)
fmt.Printf(" - Input size: %d characters (~%d tokens)\n", len(markdown), len(markdown)/4)
fmt.Println()
// Create splitter
splitter := textsplitter.NewMarkdownTextSplitter(
textsplitter.WithChunkSize(chunkSize),
textsplitter.WithChunkOverlap(chunkOverlap),
textsplitter.WithModelName("text-embedding-3-large"),
textsplitter.WithReferenceLinks(false),
textsplitter.WithSeparators([]string{"\n\n", "\n", " ", ""}),
textsplitter.WithKeepSeparator(true),
textsplitter.WithHeadingHierarchy(true),
textsplitter.WithCodeBlocks(true),
textsplitter.WithJoinTableRows(true),
)
chunks, err := splitter.SplitText(markdown)
if err != nil {
fmt.Printf("ERROR: %v\n", err)
return
}
fmt.Println("=" + strings.Repeat("=", 78) + "=")
fmt.Printf("RESULTS: %d chunks produced (expected: ~1-2 chunks)\n", len(chunks))
fmt.Println("=" + strings.Repeat("=", 78) + "=")
fmt.Println()
// Display each chunk
for i, chunk := range chunks {
estimatedTokens := len(chunk) / 4 // rough approximation
fmt.Printf("--- Chunk %d (estimated ~%d tokens, expected ~%d) ---\n", i+1, estimatedTokens, chunkSize)
fmt.Println(chunk)
fmt.Println()
}
}Output
================================================================================
LANGCHAINGO MarkdownTextSplitter Issue Reproduction
================================================================================
Configuration:
- chunkSize: 500 tokens
- chunkOverlap: 100 tokens
- Input size: 1003 characters (~250 tokens)
================================================================================
RESULTS: 5 chunks produced (expected: ~1-2 chunks)
================================================================================
--- Chunk 1 (estimated ~5 tokens, expected ~500) ---
# Getting Started Guide
--- Chunk 2 (estimated ~44 tokens, expected ~500) ---
# Getting Started Guide
## Prerequisites
- You need to have Node.js installed on your system.
- You should have a text editor ready.
- Make sure you have internet connectivity.
--- Chunk 3 (estimated ~118 tokens, expected ~500) ---
# Getting Started Guide
## Installation Steps
To install the application:
1. Download the installer from the official website.
2. Run the installation script:
\`\`\`bash
./install.sh
chmod +x [install.sh](http://install.sh)
\`\`\`
3. Configure your settings in the config file.
4. For example, the following configuration sets up a basic server:
5. \`\`\`json
{
“host”: “localhost”,
“port”: 8080,
“timeout”: 30
}
\`\`\`
6. Start the application and verify it’s running.
--- Chunk 4 (estimated ~59 tokens, expected ~500) ---
# Getting Started Guide
## Configuration Options
The application supports several configuration options:
- **Host**: The server hostname or IP address
- **Port**: The port number to listen on
- **Timeout**: Connection timeout in seconds
--- Chunk 5 (estimated ~46 tokens, expected ~500) ---
# Getting Started Guide
## Configuration Options
### Advanced Settings
For advanced users, you can configure:
1. Database connection settings
2. Cache configuration
3. Logging preferences
Observations
- the nested code block should be escaped e.g.
\`\`\`
. Otherwise it gets stripped out.
- the nested content
For example, the following configuration sets up a basic server
ets converted into list item along with the following code block
Metadata
Metadata
Assignees
Labels
No labels