Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make features configurable #59

Merged
merged 8 commits into from
Mar 28, 2025
Merged

Make features configurable #59

merged 8 commits into from
Mar 28, 2025

Conversation

aneubeck
Copy link
Collaborator

No description provided.

@Copilot Copilot bot review requested due to automatic review settings March 26, 2025 07:40
@aneubeck aneubeck requested a review from a team as a code owner March 26, 2025 07:40
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR makes the feature set of the string offsets library configurable via a compile‐time configuration type. It introduces a new config module with configuration types (AllConfig and OnlyLines), adapts the StringOffsets API to be generic over a configuration parameter, and updates benchmark and bitrank implementations accordingly.

  • Introduces a generic configuration parameter for StringOffsets.
  • Adds a new configuration module (config.rs) to control feature enablement.
  • Updates the BitRank builder and benchmark files to reflect these changes.

Reviewed Changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
crates/string-offsets/src/lib.rs General API update for StringOffsets to support configurable features
crates/string-offsets/src/config.rs New module defining configuration types (AllConfig, OnlyLines)
crates/string-offsets/src/bitrank.rs Updates to bit setting mechanisms and capacity handling in BitRankBuilder
crates/string-offsets/benchmarks/performance.rs Benchmarks updated to use new construction APIs
crates/string-offsets/Cargo.toml Added benchmark dependency and configuration
crates/bpe/benchmarks/performance.rs Benchmarks updated to use the new rand API
Files not reviewed (1)
  • crates/string-offsets/benchmarks/Cargo.toml.bak: Language not supported
Comments suppressed due to low confidence (2)

crates/string-offsets/src/lib.rs:295

  • Changing the calculation in utf8_to_char by adding 1 to byte_number may cause off-by-one errors. Please verify that this adjustment correctly preserves the intended offset conversion behavior for all edge cases.
self.utf8_to_char.rank(byte_number + 1) - 1

crates/string-offsets/src/bitrank.rs:47

  • Switching from a runtime assert to a debug_assert may allow out-of-bound indices to go unchecked in production. Consider whether this check is critical and if it should remain a runtime assert.
debug_assert!(index < BITS_PER_BLOCK);

Tip: Copilot only keeps its highest confidence comments to reduce noise and keep you focused. Learn more

Copy link
Contributor

@hendrikvanantwerpen hendrikvanantwerpen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the bitrank changes, but looked at the rest. The type level config trick is nice to support per-instance variations! The lib changes look good. We could simplify the wasm module more, now that it's separate.

Edit: This seems to include changes that are supposed to be merged as part of #58 already? At least the bitrank changes in that PR look the same....

Copy link
Contributor

@jorendorff jorendorff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry -- today was a little crazy. I'll get this reviewed tomorrow. Just a few early comments.

Copy link
Contributor

@jorendorff jorendorff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great.

To revisit my earlier comments about whether type-level config is necessary: I see that it causes Rust to emit custom builds of new_converter.

I wonder what happens if the config is a plain old struct with boolean fields, and new_converter(config: Config, content: &[u8]) and StringOffsets::new_with_config(config: Config, content: &str) are both marked #[inline(always)]. When the caller calls it with a constant config, seems like all the branches should be optimized away, without type-level booleans.

Admittedly, it's impossible to be sure without trying it. :-\

@aneubeck aneubeck enabled auto-merge March 28, 2025 08:46
@aneubeck aneubeck disabled auto-merge March 28, 2025 08:47
@aneubeck aneubeck enabled auto-merge March 28, 2025 08:48
@aneubeck aneubeck merged commit 7ac476f into main Mar 28, 2025
7 checks passed
@aneubeck aneubeck deleted the aneubeck/config branch March 28, 2025 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants