Advanced chunking for RAG #191

vagenas · 2024-10-31T15:07:48Z

vagenas
Oct 31, 2024
Maintainer

In terms of chunking approaches, there are various options on can consider, e.g. fixed-size chunking, document-based and others (example outline here).

Docling is currently providing the HierarchicalChunker, which is following a document-based approach, i.e. splits as dictated by the upstream document format. At the same time, it exposes various metadata that can be used from the user as additional context for the embedding or generation model — but also a source of grounding.

The exact metadata be included into the final text to be input into the (embedding or generation) model is application-dependent, and is therefore not prescribed by the HierarchicalChunker.

As an illustrational example of post-processing steps such as:

serializing and including parts of the metadata (concretizing the actual text to embed/generate on), and
leveraging a tokenizer to apply filtering logic on the token size of that final text,

we have prepared an example that shows how to introduce a max-token limit — and split beyond that:
👉 notebook here

Of course, if the user is already using an LLM application framework like LlamaIndex or LangChain, they can also well tap into the big range of node parsers/splitters and postprocessing components already available in these libraries — as already showcased in our examples (LlamaIndex here & LangChain here).

vagenas · 2024-11-01T09:01:16Z

vagenas
Nov 1, 2024
Maintainer Author

The current notebook covers splitting of oversized chunks such that they fit within a max number of tokens.

If one wants to also merge "undersized" chunks, such that they better fit/approach that limit, the most straightforward case to consider is merging consecutive chunks of the same context, i.e. same "headings" and "captions" metadata in our example case.
(Otherwise, if one went into merging chunks of different sections, that would oppose the approach of document-based chunking & potentially lead to poorer chunking quality.)

Wrapping any such implementation as a BaseChunker, as done in the example notebook, has the advantage of allowing different approaches to be easily swappable in the client code.
This abstraction is also the one used by Docling extensions, such as the LlamaIndex extension, where the user can pass any BaseChunker to the DoclingNodeParser here.

0 replies

jwm4 · 2024-11-01T13:30:13Z

jwm4
Nov 1, 2024
Collaborator

I made a PR with an alternative notebook that does address the issue of merging within sections. As noted in the PR description, the major differences in the version I produced are:

This one merges chunks that have the same headings and captions (e.g., adjacent paragraphs within the same section).
This one splits on doc_items such as elements of an itemized list before trying to apply generic text splitting. This results in chunks that respect the begin and end of the list items more often.
This one uses the DoclingDocument.name as the title of the document instead of assuming that the title will be in the headers. That's probably not a great idea going forward though because in the near future the extracted title will be in the headers. The DoclingDocument.name comes from document metadata and sometimes also reflects the title but is often not very useful.
This one uses semchunk as the plain text splitter for use when the hierarchical elements are too big. In the semchunk repo, you can see their argument for why this is a good generic text splitter. Also, I tried it on some tricky examples and I liked the output in practice.
This one does not use yield to stream out the chunks one at a time -- it just uses lists for everything and then wraps them in an iterator at the end to comply with the API. That seems simpler but probably less efficient especially when dealing with large scale.

Of these, I think 1, 2, and 4 are probably improvements over the version that @vagenas proposed. However, 3 should probably be undone since the titles will be in the headers list soon. Also, 5 is clearly a way that the one I am proposing is worse than the one @vagenas proposed, but I am not sure whether it is important enough to address in the one I am proposing. Also, there are lots of other minor technical differences (e.g., I have my own subclasses of BaseChunk and BaseMeta because I couldn't find a way to construct instances of the ones in the product) that would be good to get resolved.

0 replies

vagenas · 2024-11-01T15:07:24Z

vagenas
Nov 1, 2024
Maintainer Author

@jwm4

I put together a small PR targeting your branch for reusing the existing metadata class (issue was due to how the type was referenced): reuse existing chunk/meta types, fix minor issues, lint #194
In that PR I also applied some simple linting etc. as per the pre-commit hooks and added some type hints (not complete), as that helps a lot with the development, and not only.
An important point is whether relevant metadata (e.g. headings) are indeed considered when calculating the token limits. I think you are including that in your calculations, but then the final cell does not show this metadata incorporated — I guess at this stage the .text should be updated accordingly since these chunks are ready to be consumed (by an embedding model or LLM).
An other point is regarding the semchunk lib I saw you used, which I see has under 200 stars. I was wondering if there is some more established alternative to consider for that.
Maybe @ceberam you know of some or Tim has some idea (I cannot find his GitHub handle)?

0 replies

jwm4 · 2024-11-01T15:26:08Z

jwm4
Nov 1, 2024
Collaborator

I see your updates to reuse the existing metadata class. That looks good to me, thank you!
Yes, the code is intended to consider the relevant metadata when calculating token limits. For example, in merge_chunks_with_matching_metadata I check window_text_length + window_other_length + lengths["text"] <= chunk_size. The window_other_length here includes the title (which I currently get from DoclingDocument.name) and the headings and captions (both of which come from chunk metadata). However, as you note, I do not include them in the text. Instead, my expectation was that they would be added to the text both when calling the embedding model and when calling the answer generation LLM but would still be stored in the index as separate metadata. However, I can see how that adds extra complexity and makes it harder (impossible?) to provide a seamless integration with tools like LlamaIndex or LangChain. Let's keep discussing this one, because it seems like there are significant pros and cons to each alternative.
I would definitely be open to a more established alternative to semchunk if we can find one.

0 replies

jwm4 · 2024-11-01T17:32:27Z

jwm4
Nov 1, 2024
Collaborator

To illustrate my point 2 above, I went ahead and updated the notebook in my PR with new make_text_for_embedding and make_lancedb_index methods illustrating the approach. The embeddings are computed on the outputs of make_text_for_embedding but the object we put into the vector DB (lancedb) still keeps the text and metadata separate.

You can see the latest draft here: here

The latest draft was rebased on the one that @vagenas put in his PR. It includes that addition. Also, I dropped the pip install of semchunk from the notebook because it seems odd to install that one dependency and not all the others. Maybe we should just have a comprehensive pip install of all of the requirements, including docling? It looks like that's what's done in rag_langchain.ipynb for example.

0 replies

jwm4 · 2024-11-04T22:02:45Z

jwm4
Nov 4, 2024
Collaborator

I removed the use of DoclingDocument.name (which I was assuming to be the title) from my version of the notebook. As discussed above, that didn't turn out to be a good way to get a document title after all.

0 replies

vagenas · 2024-11-05T14:27:59Z

vagenas
Nov 5, 2024
Maintainer Author

However, as you note, I do not include them in the text. Instead, my expectation was that they would be added to the text both when calling the embedding model and when calling the answer generation LLM but would still be stored in the index as separate metadata. However, I can see how that adds extra complexity and makes it harder (impossible?) to provide a seamless integration with tools like LlamaIndex or LangChain.

@jwm4 (cc @ceberam)
as the proposed chunker makes a decision on the metadata to consider (headings & captions) and incorporates it into the token limit resolution, I think it should also reflect it to .text right away (otherwise, how would one know how to reconstruct the final text to pass to the model downstream).

Regarding Semchunk: it looks like too thin a layer for warranting an additional dependency. Perhaps we can understand how to incorporate some of the basic ideas instead?

Regarding the final outcome:
For starters let's wrap everything as a BaseChunker subclass as you have already started doing (we could also try to help with the streaming part).
Then including this as an additional chunker implementation in docling-core would be a possibility as a next step, we should just be aware that this would imply adding dependencies like transformers to Docling Core.

0 replies

ceberam · 2024-11-05T14:57:04Z

ceberam
Nov 5, 2024
Collaborator

@vagenas @jwm4

An other point is regarding the semchunk lib I saw you used, which I see has under 200 stars. I was wondering if there is some more established alternative to consider for that.
Maybe @ceberam you know of some or Tim has some idea (I cannot find his GitHub handle)?

We can leverage the frameworks, since they have implementations of semantic chunking using embedding models.

With LlamaIndex SemanticSplitterNodeParser
With LangChain SemanticChunker

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced chunking for RAG #191

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Advanced chunking for RAG #191

vagenas Oct 31, 2024 Maintainer

Replies: 8 comments

vagenas Nov 1, 2024 Maintainer Author

jwm4 Nov 1, 2024 Collaborator

vagenas Nov 1, 2024 Maintainer Author

jwm4 Nov 1, 2024 Collaborator

jwm4 Nov 1, 2024 Collaborator

jwm4 Nov 4, 2024 Collaborator

vagenas Nov 5, 2024 Maintainer Author

ceberam Nov 5, 2024 Collaborator

vagenas
Oct 31, 2024
Maintainer

vagenas
Nov 1, 2024
Maintainer Author

jwm4
Nov 1, 2024
Collaborator

vagenas
Nov 1, 2024
Maintainer Author

jwm4
Nov 1, 2024
Collaborator

jwm4
Nov 1, 2024
Collaborator

jwm4
Nov 4, 2024
Collaborator

vagenas
Nov 5, 2024
Maintainer Author

ceberam
Nov 5, 2024
Collaborator