-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bench and speed things up #58
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds performance improvements and benchmarks while also expanding the StringOffsets API with a new len() method. Key changes include:
- Introducing a len() method and adjusting UTF-8/UTF-16 offset conversions in StringOffsets.
- Refactoring the internal loop in new_converter to use a for loop instead of manual index incrementation.
- Replacing assert! calls with debug_assert! in BitRank for performance and updating benchmark configurations.
Reviewed Changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
crates/string-offsets/src/lib.rs | Added len() method, modified offset conversion logic, and refactored character iteration. |
crates/string-offsets/src/bitrank.rs | Changed assertion macros and adjusted BitRankBuilder initialization. |
crates/string-offsets/benchmarks/performance.rs | Added benchmark suite for StringOffsets construction. |
crates/string-offsets/Cargo.toml | Included criterion as a dependency and registered the new benchmark. |
crates/bpe/benchmarks/performance.rs | Updated random number generation for performance benchmarks. |
Files not reviewed (1)
- crates/string-offsets/benchmarks/Cargo.toml.bak: Language not supported
Comments suppressed due to low confidence (2)
crates/string-offsets/src/lib.rs:371
- The loop now iterates over every byte instead of jumping by character length; please verify that this change correctly preserves multi-byte and invalid UTF-8 character handling as intended.
for i in 0..content.len() {
crates/string-offsets/src/bitrank.rs:51
- Switching from assert_eq! to debug_assert_eq! may let duplicate positions pass undetected in release builds; consider using assert_eq! if duplicate detection is critical for production correctness.
debug_assert_eq!(self.bits[chunk_idx] & mask, 0, "toggling bits off indicates that the original data was incorrect, most likely containing duplicate values.");
Tip: Leave feedback on Copilot's review comments with the 👎 and 👍 buttons to help improve review quality. Learn more
let mut utf8_builder = BitRankBuilder::with_capacity(n); | ||
let mut utf16_builder = BitRankBuilder::with_capacity(n); | ||
let mut line_builder = BitRankBuilder::with_capacity(n); | ||
let mut utf8_builder = BitRankBuilder::with_capacity(n + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Why + 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved all bits in this bitrank by 1. As a result, I need one more bit at the end...
The compiler can generate faster code if the loop increment is not data dependent.
Most characters are anyways ascii.