chore(deps): bump text-splitter from 0.6.3 to 0.10.0 #128

dependabot · 2024-04-08T23:10:46Z

Bumps text-splitter from 0.6.3 to 0.10.0.

Release notes

Sourced from text-splitter's releases.

v0.10.0

Breaking Changes

Improved (but different) Markdown split points #137. In hindsight, the levels used for determining split points in Markdown text were too granular, which led to some strange split points. Several element types were consolidated into the same levels, which should still provide a good balance between splitting at the right points and not splitting too often.

Because the output of the MarkdownSplitter will be substantially different, especially for smaller chunk sizes, this is considered a breaking change.

Full Changelog: benbrandt/text-splitter@v0.9.1...v0.10.0

v0.9.1

What's Changed

Python TextSplitter and MarkdownSplitter now both provide a new chunk_indices method that returns a list not only of chunks, but also their corresponding character offsets relative to the original text. This should allow for different string comparison and matching operations on the chunks.
def chunk_indices(
    self, text: str, chunk_capacity: Union[int, Tuple[int, int]]
) -> List[Tuple[int, str]]:
    ...
A similar method already existed on the Rust side. The key difference is that these offsets are character not byte offsets. For Rust strings, it is usually helpful to have the byte offset, but in Python, most string methods and operations deal with character indices.

by @benbrandt in benbrandt/text-splitter#135

Full Changelog: benbrandt/text-splitter@v0.9.0...v0.9.1

v0.9.0

What's New

More robust handling of Hugging Face tokenizers as chunk sizers.

Tokenizers with padding enabled no longer count padding tokens when generating chunks. This caused some unexpected behavior, especially if the chunk capacity didn't perfectly line up with the padding size(s). Now, the tokenizer's padding token is ignored when counting the number of tokens generated in a chunk.

In the process, it also became clear there were some false assumptions about how the byte offset ranges were calculated for each token. This has been fixed, and the byte offset ranges should now be more accurate when determining the boundaries of each token. This only affects some optimizations in chunk sizing, and should not affect the actual chunk output.

Breaking Changes

There should only be breaking chunk output for those of you using a Hugging Face tokenizer with padding enabled. Because padding tokens are no longer counted, the chunks will likely be larger than before, and closer to the desired behavior.

Note: This will mean the generated chunks may also be larger than the chunk capacity when tokenized, because padding tokens will be added when you tokenize the chunk. The chunk capacity for these tokenizers reflects the number of tokens used in the text, not necessarily the number of tokens that the tokenizer will generate in total.

Full Changelog: benbrandt/text-splitter@v0.8.1...v0.9.0

v0.8.1

What's New

Updates to documentation and examples.

... (truncated)

Changelog

Sourced from text-splitter's changelog.

v0.10.0

Breaking Changes

Improved (but different) Markdown split points #137. In hindsight, the levels used for determining split points in Markdown text were too granular, which led to some strange split points. Many more element types were consolidated into the same levels, which should still provide a good balance between splitting at the right points and not splitting too often.

Because the output of the MarkdownSplitter will be substantially different, especially for smaller chunk sizes, this is considered a breaking change.

v0.9.1

What's New

Python TextSplitter and MarkdownSplitter now both provide a new chunk_indices method that returns a list not only of chunks, but also their corresponding character offsets relative to the original text. This should allow for different string comparison and matching operations on the chunks.
def chunk_indices(
    self, text: str, chunk_capacity: Union[int, Tuple[int, int]]
) -> List[Tuple[int, str]]:
    ...
A similar method already existed on the Rust side. The key difference is that these offsets are character not byte offsets. For Rust strings, it is usually helpful to have the byte offset, but in Python, most string methods and operations deal with character indices.

v0.9.0

What's New

More robust handling of Hugging Face tokenizers as chunk sizers.

Tokenizers with padding enabled no longer count padding tokens when generating chunks. This caused some unexpected behavior, especially if the chunk capacity didn't perfectly line up with the padding size(s). Now, the tokenizer's padding token is ignored when counting the number of tokens generated in a chunk.

In the process, it also became clear there were some false assumptions about how the byte offset ranges were calculated for each token. This has been fixed, and the byte offset ranges should now be more accurate when determining the boundaries of each token. This only affects some optimizations in chunk sizing, and should not affect the actual chunk output.

Breaking Changes

There should only be breaking chunk output for those of you using a Hugging Face tokenizer with padding enabled. Because padding tokens are no longer counted, the chunks will likely be larger than before, and closer to the desired behavior.

Note: This will mean the generated chunks may also be larger than the chunk capacity when tokenized, because padding tokens will be added when you tokenize the chunk. The chunk capacity for these tokenizers reflects the number of tokens used in the text, not necessarily the number of tokens that the tokenizer will generate in total.

v0.8.1

What's New

Updates to documentation and examples.

Update pyo3 to 0.21.0 in Python package, which should bring some performance improvements.

v0.8.0

What's New

... (truncated)

Commits

dee0ec6 cargo update
08187d3 Update version and changelog
4eabfec feat!: Consolidate Markdown split levels
17bc95a fix: try to make sure the CI isn't using the pip index/cache when installing ...
cd4ff47 Prep 0.9.1 release
834d567 Python splitters optionally provide chunk char offsets
ae9730c chore: cargo update
716590c Prep 0.9.0 release
3afe119 fix: unneeded details tag in the readme
0e63788 Bump pulldown-cmark from 0.10.0 to 0.10.2
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [text-splitter](https://github.com/benbrandt/text-splitter) from 0.6.3 to 0.10.0. - [Release notes](https://github.com/benbrandt/text-splitter/releases) - [Changelog](https://github.com/benbrandt/text-splitter/blob/main/CHANGELOG.md) - [Commits](benbrandt/text-splitter@v0.6.3...v0.10.0) --- updated-dependencies: - dependency-name: text-splitter dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]>

dependabot · 2024-04-22T22:47:29Z

Superseded by #133.

dependabot bot added the dependencies Pull requests that update a dependency file label Apr 8, 2024

dependabot bot closed this Apr 22, 2024

dependabot bot deleted the dependabot/cargo/text-splitter-0.10.0 branch April 22, 2024 22:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(deps): bump text-splitter from 0.6.3 to 0.10.0 #128

chore(deps): bump text-splitter from 0.6.3 to 0.10.0 #128

dependabot bot commented on behalf of github Apr 8, 2024

dependabot bot commented on behalf of github Apr 22, 2024

chore(deps): bump text-splitter from 0.6.3 to 0.10.0 #128

chore(deps): bump text-splitter from 0.6.3 to 0.10.0 #128

Conversation

dependabot bot commented on behalf of github Apr 8, 2024

v0.10.0

Breaking Changes

v0.9.1

What's Changed

v0.9.0

What's New

Breaking Changes

v0.8.1

What's New

v0.10.0

Breaking Changes

v0.9.1

What's New

v0.9.0

What's New

Breaking Changes

v0.8.1

What's New

v0.8.0

What's New

dependabot bot commented on behalf of github Apr 22, 2024