Conversation
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR adds performance improvements and benchmarks while also expanding the StringOffsets API with a new len() method. Key changes include:
- Introducing a len() method and adjusting UTF-8/UTF-16 offset conversions in StringOffsets.
- Refactoring the internal loop in new_converter to use a for loop instead of manual index incrementation.
- Replacing assert! calls with debug_assert! in BitRank for performance and updating benchmark configurations.
Reviewed Changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| crates/string-offsets/src/lib.rs | Added len() method, modified offset conversion logic, and refactored character iteration. |
| crates/string-offsets/src/bitrank.rs | Changed assertion macros and adjusted BitRankBuilder initialization. |
| crates/string-offsets/benchmarks/performance.rs | Added benchmark suite for StringOffsets construction. |
| crates/string-offsets/Cargo.toml | Included criterion as a dependency and registered the new benchmark. |
| crates/bpe/benchmarks/performance.rs | Updated random number generation for performance benchmarks. |
Files not reviewed (1)
- crates/string-offsets/benchmarks/Cargo.toml.bak: Language not supported
Comments suppressed due to low confidence (2)
crates/string-offsets/src/lib.rs:371
- The loop now iterates over every byte instead of jumping by character length; please verify that this change correctly preserves multi-byte and invalid UTF-8 character handling as intended.
for i in 0..content.len() {
crates/string-offsets/src/bitrank.rs:51
- Switching from assert_eq! to debug_assert_eq! may let duplicate positions pass undetected in release builds; consider using assert_eq! if duplicate detection is critical for production correctness.
debug_assert_eq!(self.bits[chunk_idx] & mask, 0, "toggling bits off indicates that the original data was incorrect, most likely containing duplicate values.");
Tip: Leave feedback on Copilot's review comments with the 👎 and 👍 buttons to help improve review quality. Learn more
gorzell
approved these changes
Mar 25, 2025
| let mut utf8_builder = BitRankBuilder::with_capacity(n); | ||
| let mut utf16_builder = BitRankBuilder::with_capacity(n); | ||
| let mut line_builder = BitRankBuilder::with_capacity(n); | ||
| let mut utf8_builder = BitRankBuilder::with_capacity(n + 1); |
Collaborator
Author
There was a problem hiding this comment.
I moved all bits in this bitrank by 1. As a result, I need one more bit at the end...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The compiler can generate faster code if the loop increment is not data dependent.
Most characters are anyways ascii.