Replies: 8 comments
-
The current notebook covers splitting of oversized chunks such that they fit within a max number of tokens. If one wants to also merge "undersized" chunks, such that they better fit/approach that limit, the most straightforward case to consider is merging consecutive chunks of the same context, i.e. same "headings" and "captions" metadata in our example case. Wrapping any such implementation as a |
Beta Was this translation helpful? Give feedback.
-
I made a PR with an alternative notebook that does address the issue of merging within sections. As noted in the PR description, the major differences in the version I produced are:
Of these, I think 1, 2, and 4 are probably improvements over the version that @vagenas proposed. However, 3 should probably be undone since the titles will be in the headers list soon. Also, 5 is clearly a way that the one I am proposing is worse than the one @vagenas proposed, but I am not sure whether it is important enough to address in the one I am proposing. Also, there are lots of other minor technical differences (e.g., I have my own subclasses of BaseChunk and BaseMeta because I couldn't find a way to construct instances of the ones in the product) that would be good to get resolved. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
To illustrate my point 2 above, I went ahead and updated the notebook in my PR with new You can see the latest draft here: here The latest draft was rebased on the one that @vagenas put in his PR. It includes that addition. Also, I dropped the pip install of |
Beta Was this translation helpful? Give feedback.
-
I removed the use of DoclingDocument.name (which I was assuming to be the title) from my version of the notebook. As discussed above, that didn't turn out to be a good way to get a document title after all. |
Beta Was this translation helpful? Give feedback.
-
@jwm4 (cc @ceberam) Regarding Semchunk: it looks like too thin a layer for warranting an additional dependency. Perhaps we can understand how to incorporate some of the basic ideas instead? Regarding the final outcome: |
Beta Was this translation helpful? Give feedback.
-
We can leverage the frameworks, since they have implementations of semantic chunking using embedding models.
|
Beta Was this translation helpful? Give feedback.
-
In terms of chunking approaches, there are various options on can consider, e.g. fixed-size chunking, document-based and others (example outline here).
Docling is currently providing the HierarchicalChunker, which is following a document-based approach, i.e. splits as dictated by the upstream document format. At the same time, it exposes various metadata that can be used from the user as additional context for the embedding or generation model — but also a source of grounding.
The exact metadata be included into the final text to be input into the (embedding or generation) model is application-dependent, and is therefore not prescribed by the HierarchicalChunker.
As an illustrational example of post-processing steps such as:
we have prepared an example that shows how to introduce a max-token limit — and split beyond that:
👉 notebook here
Of course, if the user is already using an LLM application framework like LlamaIndex or LangChain, they can also well tap into the big range of node parsers/splitters and postprocessing components already available in these libraries — as already showcased in our examples (LlamaIndex here & LangChain here).
Beta Was this translation helpful? Give feedback.
All reactions