Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(deps): bump text-splitter from 0.6.3 to 0.10.0 #128

Closed
wants to merge 1 commit into from

Conversation

dependabot[bot]
Copy link
Contributor

@dependabot dependabot bot commented on behalf of github Apr 8, 2024

Bumps text-splitter from 0.6.3 to 0.10.0.

Release notes

Sourced from text-splitter's releases.

v0.10.0

Breaking Changes

Improved (but different) Markdown split points #137. In hindsight, the levels used for determining split points in Markdown text were too granular, which led to some strange split points. Several element types were consolidated into the same levels, which should still provide a good balance between splitting at the right points and not splitting too often.

Because the output of the MarkdownSplitter will be substantially different, especially for smaller chunk sizes, this is considered a breaking change.

Full Changelog: benbrandt/text-splitter@v0.9.1...v0.10.0

v0.9.1

What's Changed

Python TextSplitter and MarkdownSplitter now both provide a new chunk_indices method that returns a list not only of chunks, but also their corresponding character offsets relative to the original text. This should allow for different string comparison and matching operations on the chunks.

def chunk_indices(
    self, text: str, chunk_capacity: Union[int, Tuple[int, int]]
) -> List[Tuple[int, str]]:
    ...

A similar method already existed on the Rust side. The key difference is that these offsets are character not byte offsets. For Rust strings, it is usually helpful to have the byte offset, but in Python, most string methods and operations deal with character indices.

by @​benbrandt in benbrandt/text-splitter#135

Full Changelog: benbrandt/text-splitter@v0.9.0...v0.9.1

v0.9.0

What's New

More robust handling of Hugging Face tokenizers as chunk sizers.

  • Tokenizers with padding enabled no longer count padding tokens when generating chunks. This caused some unexpected behavior, especially if the chunk capacity didn't perfectly line up with the padding size(s). Now, the tokenizer's padding token is ignored when counting the number of tokens generated in a chunk.
  • In the process, it also became clear there were some false assumptions about how the byte offset ranges were calculated for each token. This has been fixed, and the byte offset ranges should now be more accurate when determining the boundaries of each token. This only affects some optimizations in chunk sizing, and should not affect the actual chunk output.

Breaking Changes

There should only be breaking chunk output for those of you using a Hugging Face tokenizer with padding enabled. Because padding tokens are no longer counted, the chunks will likely be larger than before, and closer to the desired behavior.

Note: This will mean the generated chunks may also be larger than the chunk capacity when tokenized, because padding tokens will be added when you tokenize the chunk. The chunk capacity for these tokenizers reflects the number of tokens used in the text, not necessarily the number of tokens that the tokenizer will generate in total.

Full Changelog: benbrandt/text-splitter@v0.8.1...v0.9.0

v0.8.1

What's New

  • Updates to documentation and examples.

... (truncated)

Changelog

Sourced from text-splitter's changelog.

v0.10.0

Breaking Changes

Improved (but different) Markdown split points #137. In hindsight, the levels used for determining split points in Markdown text were too granular, which led to some strange split points. Many more element types were consolidated into the same levels, which should still provide a good balance between splitting at the right points and not splitting too often.

Because the output of the MarkdownSplitter will be substantially different, especially for smaller chunk sizes, this is considered a breaking change.

v0.9.1

What's New

Python TextSplitter and MarkdownSplitter now both provide a new chunk_indices method that returns a list not only of chunks, but also their corresponding character offsets relative to the original text. This should allow for different string comparison and matching operations on the chunks.

def chunk_indices(
    self, text: str, chunk_capacity: Union[int, Tuple[int, int]]
) -> List[Tuple[int, str]]:
    ...

A similar method already existed on the Rust side. The key difference is that these offsets are character not byte offsets. For Rust strings, it is usually helpful to have the byte offset, but in Python, most string methods and operations deal with character indices.

v0.9.0

What's New

More robust handling of Hugging Face tokenizers as chunk sizers.

  • Tokenizers with padding enabled no longer count padding tokens when generating chunks. This caused some unexpected behavior, especially if the chunk capacity didn't perfectly line up with the padding size(s). Now, the tokenizer's padding token is ignored when counting the number of tokens generated in a chunk.
  • In the process, it also became clear there were some false assumptions about how the byte offset ranges were calculated for each token. This has been fixed, and the byte offset ranges should now be more accurate when determining the boundaries of each token. This only affects some optimizations in chunk sizing, and should not affect the actual chunk output.

Breaking Changes

There should only be breaking chunk output for those of you using a Hugging Face tokenizer with padding enabled. Because padding tokens are no longer counted, the chunks will likely be larger than before, and closer to the desired behavior.

Note: This will mean the generated chunks may also be larger than the chunk capacity when tokenized, because padding tokens will be added when you tokenize the chunk. The chunk capacity for these tokenizers reflects the number of tokens used in the text, not necessarily the number of tokens that the tokenizer will generate in total.

v0.8.1

What's New

  • Updates to documentation and examples.
  • Update pyo3 to 0.21.0 in Python package, which should bring some performance improvements.

v0.8.0

What's New

... (truncated)

Commits
  • dee0ec6 cargo update
  • 08187d3 Update version and changelog
  • 4eabfec feat!: Consolidate Markdown split levels
  • 17bc95a fix: try to make sure the CI isn't using the pip index/cache when installing ...
  • cd4ff47 Prep 0.9.1 release
  • 834d567 Python splitters optionally provide chunk char offsets
  • ae9730c chore: cargo update
  • 716590c Prep 0.9.0 release
  • 3afe119 fix: unneeded details tag in the readme
  • 0e63788 Bump pulldown-cmark from 0.10.0 to 0.10.2
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [text-splitter](https://github.com/benbrandt/text-splitter) from 0.6.3 to 0.10.0.
- [Release notes](https://github.com/benbrandt/text-splitter/releases)
- [Changelog](https://github.com/benbrandt/text-splitter/blob/main/CHANGELOG.md)
- [Commits](benbrandt/text-splitter@v0.6.3...v0.10.0)

---
updated-dependencies:
- dependency-name: text-splitter
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
@dependabot dependabot bot added the dependencies Pull requests that update a dependency file label Apr 8, 2024
Copy link
Contributor Author

dependabot bot commented on behalf of github Apr 22, 2024

Superseded by #133.

@dependabot dependabot bot closed this Apr 22, 2024
@dependabot dependabot bot deleted the dependabot/cargo/text-splitter-0.10.0 branch April 22, 2024 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0 participants