Skip to content

MarkdownTextSplitter produces too many chunks and alters markdown structured #1439

@pavel-blagodov

Description

@pavel-blagodov

Description

The NewMarkdownTextSplitter produces significantly more chunks than expected based on WithChunkSize, and it alters the markdown structure, particularly when dealing with lists containing nested content (code blocks, nested lists).

Expected Behavior (please correct me if I'm wrong)

  1. Chunks should be closer to the specified chunkSize (within reasonable bounds)
  2. Markdown structure should be preserved, especially list items with nested content
  3. When WithCodeBlocks(true) or WithJoinTableRows(true) is enabled, chunks may exceed chunkSize to keep code blocks/tables intact, but should still respect it when possible

Actual Behavior

  1. Many chunks are produced that are much smaller than chunkSize
  2. List item content gets converted to list items
  3. The splitter appears to split on every markdown element rather than respecting the chunk size settings

Minimal Reproduction

See the attached langchaingo_issue_reproduction.go file for a complete, runnable example that clearly demonstrates all issues.

To run:

go run langchaingo_issue_reproduction.go

The script will:

  1. Show the configuration (chunkSize=500, expecting ~1-2 chunks)
  2. Display all produced chunks with their estimated token counts
  3. Clearly identify each issue with numbered error messages
  4. Provide a summary of all problems found

langchaingo_issue_reproduction.go

package main

import (
	"fmt"
	"strings"

	"github.com/tmc/langchaingo/textsplitter"
)

func main() {
	// Example markdown with nested lists and code blocks
	markdown := "# Getting Started Guide\n\n" +
		"## Prerequisites\n\n" +
		"- You need to have Node.js installed on your system.\n" +
		"- You should have a text editor ready.\n" +
		"- Make sure you have internet connectivity.\n\n" +
		"## Installation Steps\n\n" +
		"To install the application:\n\n" +
		"1. Download the installer from the official website.\n" +
		"2. Run the installation script:\n" +
		"   \\`\\`\\`bash\n" +
		"   ./install.sh\n" +
		"   chmod +x install.sh\n" +
		"   \\`\\`\\`\n" +
		"3. Configure your settings in the config file.\n\n" +
		"   For example, the following configuration sets up a basic server:\n\n" +
		"   \\`\\`\\`json\n" +
		"   {\n" +
		"     \"host\": \"localhost\",\n" +
		"     \"port\": 8080,\n" +
		"     \"timeout\": 30\n" +
		"   }\n" +
		"   \\`\\`\\`\n" +
		"4. Start the application and verify it's running.\n\n" +
		"## Configuration Options\n\n" +
		"The application supports several configuration options:\n\n" +
		"- **Host**: The server hostname or IP address\n" +
		"- **Port**: The port number to listen on\n" +
		"- **Timeout**: Connection timeout in seconds\n\n" +
		"### Advanced Settings\n\n" +
		"For advanced users, you can configure:\n\n" +
		"1. Database connection settings\n" +
		"2. Cache configuration\n" +
		"3. Logging preferences\n"

	fmt.Println("=" + strings.Repeat("=", 78) + "=")
	fmt.Println("LANGCHAINGO MarkdownTextSplitter Issue Reproduction")
	fmt.Println("=" + strings.Repeat("=", 78) + "=")
	fmt.Println()

	// Configuration
	chunkSize := 500
	chunkOverlap := 100

	fmt.Printf("Configuration:\n")
	fmt.Printf("  - chunkSize: %d tokens\n", chunkSize)
	fmt.Printf("  - chunkOverlap: %d tokens\n", chunkOverlap)
	fmt.Printf("  - Input size: %d characters (~%d tokens)\n", len(markdown), len(markdown)/4)
	fmt.Println()

	// Create splitter
	splitter := textsplitter.NewMarkdownTextSplitter(
		textsplitter.WithChunkSize(chunkSize),
		textsplitter.WithChunkOverlap(chunkOverlap),
		textsplitter.WithModelName("text-embedding-3-large"),
		textsplitter.WithReferenceLinks(false),
		textsplitter.WithSeparators([]string{"\n\n", "\n", " ", ""}),
		textsplitter.WithKeepSeparator(true),
		textsplitter.WithHeadingHierarchy(true),
		textsplitter.WithCodeBlocks(true),
		textsplitter.WithJoinTableRows(true),
	)

	chunks, err := splitter.SplitText(markdown)
	if err != nil {
		fmt.Printf("ERROR: %v\n", err)
		return
	}

	fmt.Println("=" + strings.Repeat("=", 78) + "=")
	fmt.Printf("RESULTS: %d chunks produced (expected: ~1-2 chunks)\n", len(chunks))
	fmt.Println("=" + strings.Repeat("=", 78) + "=")
	fmt.Println()

	// Display each chunk
	for i, chunk := range chunks {
		estimatedTokens := len(chunk) / 4 // rough approximation
		fmt.Printf("--- Chunk %d (estimated ~%d tokens, expected ~%d) ---\n", i+1, estimatedTokens, chunkSize)
		fmt.Println(chunk)
		fmt.Println()
	}
}

Output

================================================================================
LANGCHAINGO MarkdownTextSplitter Issue Reproduction
================================================================================

Configuration:
  - chunkSize: 500 tokens
  - chunkOverlap: 100 tokens
  - Input size: 1003 characters (~250 tokens)

================================================================================
RESULTS: 5 chunks produced (expected: ~1-2 chunks)
================================================================================

--- Chunk 1 (estimated ~5 tokens, expected ~500) ---
# Getting Started Guide

--- Chunk 2 (estimated ~44 tokens, expected ~500) ---
# Getting Started Guide
## Prerequisites
- You need to have Node.js installed on your system.
- You should have a text editor ready.
- Make sure you have internet connectivity.

--- Chunk 3 (estimated ~118 tokens, expected ~500) ---
# Getting Started Guide
## Installation Steps
To install the application:
1. Download the installer from the official website.
2. Run the installation script:
\`\`\`bash
./install.sh
chmod +x [install.sh](http://install.sh)
\`\`\`
3. Configure your settings in the config file.
4. For example, the following configuration sets up a basic server:
5. \`\`\`json
{
“host”: “localhost”,
“port”: 8080,
“timeout”: 30
}
\`\`\`
6. Start the application and verify it’s running.

--- Chunk 4 (estimated ~59 tokens, expected ~500) ---
# Getting Started Guide
## Configuration Options
The application supports several configuration options:
- **Host**: The server hostname or IP address
- **Port**: The port number to listen on
- **Timeout**: Connection timeout in seconds

--- Chunk 5 (estimated ~46 tokens, expected ~500) ---
# Getting Started Guide
## Configuration Options
### Advanced Settings
For advanced users, you can configure:
1. Database connection settings
2. Cache configuration
3. Logging preferences

Observations

  • the nested code block should be escaped e.g.
\`\`\`

. Otherwise it gets stripped out.

  • the nested content
For example, the following configuration sets up a basic server

ets converted into list item along with the following code block

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions