Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add hybrid chunker #68

Merged
merged 12 commits into from
Dec 6, 2024
Merged

feat: add hybrid chunker #68

merged 12 commits into from
Dec 6, 2024

Conversation

vagenas
Copy link
Collaborator

@vagenas vagenas commented Nov 19, 2024

Known points to fix & improve

Planned for this PR

  • last chunk is disregarded due to bug in _merge_chunks_with_matching_metadata()
  • introduced transformers as a docling-core dependency
  • _split_by_doc_items() currently assumes DocItem has a .text i.e. assumes TextItem
  • unit tests needed

TBD

  • we allow too many tokens when using another delimiter like "####"
  • table splitting currently text-based, can be optimized, e.g. based on row or cell
  • duplication of serialization logic in new chunker implem and BaseChunker serialize(); former should reuse latter
  • general refactoring of new chunker implementation needed
  • we currently indeed split in the middle of sentences; clarify if issue or not
  • use of external library semchunk; to be clarified
  • minor: streaming is superficial

@vagenas vagenas changed the title feat: expand chunking feat: add token-aware chunker Dec 3, 2024
@vagenas vagenas marked this pull request as ready for review December 3, 2024 13:17
@vagenas vagenas requested review from dolfim-ibm and ceberam December 3, 2024 13:28
pyproject.toml Outdated Show resolved Hide resolved
@PeterStaar-IBM PeterStaar-IBM self-requested a review December 3, 2024 15:58
@vagenas vagenas requested a review from jwm4 December 3, 2024 17:43
@vagenas
Copy link
Collaborator Author

vagenas commented Dec 3, 2024

Converting to draft until all points planned for this PR are addressed (see description)..

@vagenas vagenas marked this pull request as draft December 3, 2024 22:05
@vagenas
Copy link
Collaborator Author

vagenas commented Dec 3, 2024

_split_by_doc_items() currently assumes DocItem has a .text i.e. assumes TextItem

No direct impact after all, as the method seeks to split DocChunks with multiple DocItems (e.g. lists). Tables on the other hand are not part of such upstream DocChunks (from HierarchicalChunker) and are already handled without problems (any splitting needed is done based on text, i.e. by another method).

vagenas and others added 9 commits December 5, 2024 14:44
Signed-off-by: Panos Vagenas <[email protected]>
Co-authored-by: Bill Murdock <[email protected]>
Signed-off-by: Panos Vagenas <[email protected]>
Signed-off-by: Panos Vagenas <[email protected]>
Signed-off-by: Panos Vagenas <[email protected]>
Co-authored-by: Ben Rood <[email protected]>
Signed-off-by: Panos Vagenas <[email protected]>
Signed-off-by: Panos Vagenas <[email protected]>
@vagenas vagenas marked this pull request as ready for review December 6, 2024 07:38
@vagenas
Copy link
Collaborator Author

vagenas commented Dec 6, 2024

Main changes with this PR:

  1. Adds a new chunker, currently named TokenAwareChunker, which uses a hybrid approach, applying tokenization-aware refinements on top of document-layout-based (AKA hierarchical) chunking. More precisely:
    • it starts from the result of the hierarchical chunker and, based on the input tokenizer (typically to be aligned to the embedding model tokenizer), it:
    • does one pass where it splits chunks only when needed (i.e. oversized w.r.t. tokens), &
    • another pass where it merges chunks only when possible (i.e. undersized successive chunks with same headings & captions) — users can opt out of this step via param merge_peers (by default True)
  2. Extends BaseChunker interface with a new method, currently named serialize(), which returns the enriched chunk serialization, by default being a concatenation of the relevant metadata (headings & captions) with the chunk's .text. This is the representation that should meet the token limitations applied by TokenAwareChunker.

👉 Any better naming ideas, particularly for the new chunker, are welcome. Some options in no particular order:

  • HybridChunker
  • TokenizedLayoutChunker
  • TokenizedHierarchicalChunker (although it's technically not a subclass of HierarchicalChunker)
  • AdaptiveChunker

@vagenas vagenas requested a review from dolfim-ibm December 6, 2024 07:59
Signed-off-by: Panos Vagenas <[email protected]>
dolfim-ibm
dolfim-ibm previously approved these changes Dec 6, 2024
Signed-off-by: Panos Vagenas <[email protected]>
@vagenas vagenas requested a review from dolfim-ibm December 6, 2024 13:50
@vagenas vagenas changed the title feat: add token-aware chunker feat: add hybrid chunker Dec 6, 2024
@vagenas
Copy link
Collaborator Author

vagenas commented Dec 6, 2024

Finally went for HybridChunker as that appears to be the standard industry term for this approach, see e.g.:

@vagenas vagenas merged commit 628ab67 into main Dec 6, 2024
7 checks passed
@vagenas vagenas deleted the expand-chunking branch December 6, 2024 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants