-
Notifications
You must be signed in to change notification settings - Fork 46
Open
Description
Hey, super useful tool!
There's been some development in the chunking community. If you'd like to keep your app up to date here are a few suggestions. Also, considerung that all of the options struggle with correctly identifying sentence boundaries (quickly tested with some texts) and tend to chop off parts, it would be nice to have more choice.
Python
- https://github.com/benbrandt/text-splitter - Python API for Rust Package, at some point also available in JS via WebAssembly. It's my personal preference at the moment, yields "human-like" chunks
- https://github.com/umarbutler/semchunk - claims to be faster, didn't test enough yet to evaluate
JS
- https://github.com/askorama/chunker - didn't test yet, looks like a very simplistic tool, no documentation afaik
- https://gist.github.com/hanxiao/3f60354cf6dc5ac698bc9154163b4e6a - JinaAI tokenizer. See LinkedIn post here and read first comment for some exceptions; didn't test yet.
Maybe another idea would be to include the option to allow for any regex like we did in SemanticFinder. I tried to come up with a good regex for sentence boundaries but it's incredibly hard.
Metadata
Metadata
Assignees
Labels
No labels