More intelligent chunking strategy #327
srtab
started this conversation in
Show and tell
Replies: 1 comment
-
The related MR: #328 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
DAIV Chunking System
Note
Summary:
DAIV’s new chunking system improves code structure preservation by using MarkdownHeaderTextSplitter for markdown, Chonkie for code, and RecursiveCharacterTextSplitter as a fallback, solving previous boundary and fragmentation issues.
Overview
The DAIV chunking system is designed to preprocess code for embedding and retrieval tasks by intelligently segmenting content to respect model context limits and optimize cost-efficiency.
It is based on two core strategies:
LanguageParser
from LangChain).RecursiveCharacterTextSplitter
from LangChain).Design Goals
This process is intended to:
Current Limitations
Although the current implementation shows a degree of sophistication, it has not yet achieved the performance goals established for DAIV.
Identified issues include:
Specific examples:
These issues result in the LLM often receiving incomplete context, negatively affecting performance.
Ongoing Improvements
After further research and experimentation, several solutions to improve chunking precision have been identified.
Qodo has developed a solution that fits DAIV’s needs particularly well. They use language-specific static analysis to recursively divide nodes into smaller chunks and perform retroactive processing to re-add any critical context that was removed during chunking.
This approach allows the system to create chunks that respect the code structure, keeping related elements together.
However, building such a system requires considerable effort, especially to provide support for multiple programming languages.
To minimize the effort needed to create a more advanced chunker, we have begun searching for existing solutions similar to the one implemented by Qodo. So far, we have identified two promising candidates:
We've conducted some initial tests and had no other choice but to choose Chonkie, as CintraAI currently does not offer an installable package via PyPI.
Despite this limitation, Chonkie proved to be a good and solid solution that respects code structure while offering multi-language support, aligning well with DAIV’s requirements. Chonkie don't provide a retroactive processing feature, which will be a great feature to include to maintain even more important context together.
Conclusion
After evaluating available options and conducting internal tests, we finalized the following approach for DAIV's chunking system:
MarkdownHeaderTextSplitter
for better structural segmentation.RecursiveCharacterTextSplitter
is used when the language cannot be detected.This strategy significantly improves chunk quality, maintains code structure, and ensures robust fallback handling, positioning DAIV for more reliable embedding and retrieval workflows.
References and Further Reading
Beta Was this translation helpful? Give feedback.
All reactions