feat: add hybrid chunker #68

vagenas · 2024-11-19T23:37:35Z

Known points to fix & improve

Planned for this PR

last chunk is disregarded due to bug in _merge_chunks_with_matching_metadata()
introduced transformers as a docling-core dependency
_split_by_doc_items() currently assumes DocItem has a .text i.e. assumes TextItem
unit tests needed

TBD

we allow too many tokens when using another delimiter like "####"
table splitting currently text-based, can be optimized, e.g. based on row or cell
duplication of serialization logic in new chunker implem and BaseChunker serialize(); former should reuse latter
general refactoring of new chunker implementation needed
we currently indeed split in the middle of sentences; clarify if issue or not
use of external library semchunk; to be clarified
minor: streaming is superficial

pyproject.toml

vagenas · 2024-12-03T22:05:17Z

Converting to draft until all points planned for this PR are addressed (see description)..

vagenas · 2024-12-03T23:06:43Z

_split_by_doc_items() currently assumes DocItem has a .text i.e. assumes TextItem

No direct impact after all, as the method seeks to split DocChunks with multiple DocItems (e.g. lists). Tables on the other hand are not part of such upstream DocChunks (from HierarchicalChunker) and are already handled without problems (any splitting needed is done based on text, i.e. by another method).

Signed-off-by: Panos Vagenas <[email protected]>

Co-authored-by: Bill Murdock <[email protected]> Signed-off-by: Panos Vagenas <[email protected]>

Signed-off-by: Panos Vagenas <[email protected]>

Co-authored-by: Ben Rood <[email protected]> Signed-off-by: Panos Vagenas <[email protected]>

Signed-off-by: Panos Vagenas <[email protected]>

vagenas · 2024-12-06T07:59:03Z

Main changes with this PR:

Adds a new chunker, currently named TokenAwareChunker, which uses a hybrid approach, applying tokenization-aware refinements on top of document-layout-based (AKA hierarchical) chunking. More precisely:
- it starts from the result of the hierarchical chunker and, based on the input tokenizer (typically to be aligned to the embedding model tokenizer), it:
- does one pass where it splits chunks only when needed (i.e. oversized w.r.t. tokens), &
- another pass where it merges chunks only when possible (i.e. undersized successive chunks with same headings & captions) — users can opt out of this step via param merge_peers (by default True)
Extends BaseChunker interface with a new method, currently named serialize(), which returns the enriched chunk serialization, by default being a concatenation of the relevant metadata (headings & captions) with the chunk's .text. This is the representation that should meet the token limitations applied by TokenAwareChunker.

👉 Any better naming ideas, particularly for the new chunker, are welcome. Some options in no particular order:

HybridChunker
TokenizedLayoutChunker
TokenizedHierarchicalChunker (although it's technically not a subclass of HierarchicalChunker)
AdaptiveChunker

Signed-off-by: Panos Vagenas <[email protected]>

vagenas · 2024-12-06T13:55:21Z

Finally went for HybridChunker as that appears to be the standard industry term for this approach, see e.g.:

vagenas force-pushed the expand-chunking branch from 8791265 to 816c779 Compare December 3, 2024 12:39

vagenas changed the title ~~feat: expand chunking~~ feat: add token-aware chunker Dec 3, 2024

vagenas marked this pull request as ready for review December 3, 2024 13:17

vagenas requested review from dolfim-ibm and ceberam December 3, 2024 13:28

dolfim-ibm reviewed Dec 3, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

PeterStaar-IBM self-requested a review December 3, 2024 15:58

vagenas requested a review from jwm4 December 3, 2024 17:43

vagenas marked this pull request as draft December 3, 2024 22:05

vagenas force-pushed the expand-chunking branch from c6c54d6 to a9f08e9 Compare December 5, 2024 09:55

vagenas mentioned this pull request Dec 5, 2024

Update advanced_chunking_with_merging.ipynb DS4SD/docling#501

Closed

vagenas and others added 9 commits December 5, 2024 14:44

feat: expand chunking

7b29d0a

Signed-off-by: Panos Vagenas <[email protected]>

add TokenAwareChunker, add serialize to chunkers

7e4c882

Co-authored-by: Bill Murdock <[email protected]> Signed-off-by: Panos Vagenas <[email protected]>

restore inadvertently removed lines

92e510e

Signed-off-by: Panos Vagenas <[email protected]>

factor out new chunker deps as extra

72ef9ea

Signed-off-by: Panos Vagenas <[email protected]>

fix handling of last chunk, minor improvements

e3fabb6

Signed-off-by: Panos Vagenas <[email protected]>

add unit tests

852fdcb

Signed-off-by: Panos Vagenas <[email protected]>

add import error handling

71a041b

Signed-off-by: Panos Vagenas <[email protected]>

fix headings bug (from DS4SD/docling#501)

56bbba8

Co-authored-by: Ben Rood <[email protected]> Signed-off-by: Panos Vagenas <[email protected]>

allow tokenizer name or path besides instance

f3064e8

Signed-off-by: Panos Vagenas <[email protected]>

vagenas force-pushed the expand-chunking branch from e5f5a08 to f3064e8 Compare December 5, 2024 13:54

minor docstring improvements

5511bfb

Signed-off-by: Panos Vagenas <[email protected]>

vagenas marked this pull request as ready for review December 6, 2024 07:38

vagenas requested a review from dolfim-ibm December 6, 2024 07:59

loosen transformers version

0326617

Signed-off-by: Panos Vagenas <[email protected]>

dolfim-ibm previously approved these changes Dec 6, 2024

View reviewed changes

rename to hybrid chunker

7e04acc

Signed-off-by: Panos Vagenas <[email protected]>

vagenas dismissed dolfim-ibm’s stale review via 7e04acc December 6, 2024 13:45

vagenas requested a review from dolfim-ibm December 6, 2024 13:50

vagenas changed the title ~~feat: add token-aware chunker~~ feat: add hybrid chunker Dec 6, 2024

dolfim-ibm approved these changes Dec 6, 2024

View reviewed changes

vagenas merged commit 628ab67 into main Dec 6, 2024
7 checks passed

vagenas deleted the expand-chunking branch December 6, 2024 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add hybrid chunker #68

feat: add hybrid chunker #68

vagenas commented Nov 19, 2024 •

edited

Loading

vagenas commented Dec 3, 2024

vagenas commented Dec 3, 2024

vagenas commented Dec 6, 2024

vagenas commented Dec 6, 2024

feat: add hybrid chunker #68

feat: add hybrid chunker #68

Conversation

vagenas commented Nov 19, 2024 • edited Loading

Known points to fix & improve

Planned for this PR

TBD

vagenas commented Dec 3, 2024

vagenas commented Dec 3, 2024

vagenas commented Dec 6, 2024

vagenas commented Dec 6, 2024

vagenas commented Nov 19, 2024 •

edited

Loading