Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bench and speed things up #58

Merged
merged 6 commits into from
Mar 26, 2025
Merged

Add bench and speed things up #58

merged 6 commits into from
Mar 26, 2025

Conversation

aneubeck
Copy link
Collaborator

The compiler can generate faster code if the loop increment is not data dependent.
Most characters are anyways ascii.

@Copilot Copilot bot review requested due to automatic review settings March 25, 2025 11:48
@aneubeck aneubeck requested a review from a team as a code owner March 25, 2025 11:48
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds performance improvements and benchmarks while also expanding the StringOffsets API with a new len() method. Key changes include:

  • Introducing a len() method and adjusting UTF-8/UTF-16 offset conversions in StringOffsets.
  • Refactoring the internal loop in new_converter to use a for loop instead of manual index incrementation.
  • Replacing assert! calls with debug_assert! in BitRank for performance and updating benchmark configurations.

Reviewed Changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
crates/string-offsets/src/lib.rs Added len() method, modified offset conversion logic, and refactored character iteration.
crates/string-offsets/src/bitrank.rs Changed assertion macros and adjusted BitRankBuilder initialization.
crates/string-offsets/benchmarks/performance.rs Added benchmark suite for StringOffsets construction.
crates/string-offsets/Cargo.toml Included criterion as a dependency and registered the new benchmark.
crates/bpe/benchmarks/performance.rs Updated random number generation for performance benchmarks.
Files not reviewed (1)
  • crates/string-offsets/benchmarks/Cargo.toml.bak: Language not supported
Comments suppressed due to low confidence (2)

crates/string-offsets/src/lib.rs:371

  • The loop now iterates over every byte instead of jumping by character length; please verify that this change correctly preserves multi-byte and invalid UTF-8 character handling as intended.
for i in 0..content.len() {

crates/string-offsets/src/bitrank.rs:51

  • Switching from assert_eq! to debug_assert_eq! may let duplicate positions pass undetected in release builds; consider using assert_eq! if duplicate detection is critical for production correctness.
debug_assert_eq!(self.bits[chunk_idx] & mask, 0, "toggling bits off indicates that the original data was incorrect, most likely containing duplicate values.");

Tip: Leave feedback on Copilot's review comments with the 👎 and 👍 buttons to help improve review quality. Learn more

let mut utf8_builder = BitRankBuilder::with_capacity(n);
let mut utf16_builder = BitRankBuilder::with_capacity(n);
let mut line_builder = BitRankBuilder::with_capacity(n);
let mut utf8_builder = BitRankBuilder::with_capacity(n + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Why + 1?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved all bits in this bitrank by 1. As a result, I need one more bit at the end...

@aneubeck aneubeck enabled auto-merge March 26, 2025 08:09
@aneubeck aneubeck merged commit e9cc593 into main Mar 26, 2025
7 checks passed
@aneubeck aneubeck deleted the aneubeck/bench branch March 26, 2025 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants