Skip to content

CodePointTrie support for normalizer and collator perf improvements#7768

Open
hsivonen wants to merge 5 commits intounicode-org:mainfrom
hsivonen:withtrie
Open

CodePointTrie support for normalizer and collator perf improvements#7768
hsivonen wants to merge 5 commits intounicode-org:mainfrom
hsivonen:withtrie

Conversation

@hsivonen
Copy link
Member

@hsivonen hsivonen commented Mar 11, 2026

Split out of #7526 and #7600. The code here needs to be published to crates.io, before those changes can land, because the utf8_iter and utf16_iter crates need to depend on a version icu_collections that has this code on crates.io.

Changelog

icu_collections: Add CodePointTrie getters for fusing lookup into iterating over text: getters by Latin1, ASCII, two-byte UTF-8, and three-byte UTF-8.

  • New methods: get8(), get7(), get_utf8_two_byte(), get_utf8_three_byte() on CodePointTrie and TypedCodePointTrie
  • New trait: AbstractCodePointTrie

icu_collections: Serde and databake support for typed CodePointTries

  • New impls: databake::Bake, BakeSize, serde::Serialize, serde::Deserialize for FastCodePointTrie and SmallCodePointTrie

icu_collections: Iterators by char and TrieValue pairs for Latin1,str, and delegate iterator over char.

  • New types: CharIndicesWithTrie, CharIndicesWithTrieDefaultForAscii, CharIterWithTrie, CharsWithTrie, CharsWithTrieDefaultForAscii, CharsWithTrieEx, Latin1CharIndicesWithTrie, Latin1CharsWithTrie,
  • New trait: WithTrie
  • New extension traits on str: CharsWithTrieDefaultForAsciiEx, Latin1CharsWithTrieEx

@hsivonen hsivonen requested a review from echeran as a code owner March 11, 2026 16:35
@hsivonen hsivonen requested a review from Manishearth March 11, 2026 16:35
Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take some time to properly review

}

#[inline(always)]
unsafe fn get_bit_prefix_suffix_assuming_fast_index(
Copy link
Member

@Manishearth Manishearth Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: please document safety invariants (even if it's obvious from the function name)

edit: it's not; because there are invariants on bit_prefix and bit_suffix.

We should document this and ensure it's upheld by the callers.

pub unsafe fn get7(&self, ascii: u8) -> T {
debug_assert!(ascii < 128);
debug_assert!((ascii as usize) < self.data.len());
// SAFETY: Length of `self.data` checked in the constructor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: We may add more ctors in the future. This should reference the safety invariants on data, just say // SAFETY: Allowed by datas safety invariant, updating data's invariant to require that it has at least 128 elements and updating the constructor validation to saying something like // data safety invariant upheld here

debug_assert!(low_six <= 0b111_111); // Safety invariant.
debug_assert!(high_five <= 0b11_111); // Safety invariant.
debug_assert!(high_five > 0b1); // Non-shortest form; not safety invariant.
// SAFETY: The highest character representable as a two-byte
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe introduce a newline so that this formats better

@@ -0,0 +1,1240 @@
// This file is part of ICU4X. For terms of use, please see the file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is a lot of unsafe code and I'm not convinced it is justified. Can we reduce the amount of unsafe code in this PR by writing these iterators to wrap CharIndices? There will still be unsafe in this file, but it will be around CPT invariants rather than also around UTF8 decoding.

Separately we can try and justify additional unsafe using benchmarks if needed, and that would be a nice scoped PR that can be easily reviewed and benchmarked.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fusing the trie lookup into UTF-8 decoding is the key point of this changeset: CPT in ICU4C has been designed so that its bit split lines up with the bits in the last UTF-8 trail byte, and we've been using it pessimally in ICU4X.

I guess I will need to port the UTF-16 NFC to NFD throughput benchmark to str and then get exact numbers for the effect here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see. I feel like using CharIndices (especially with its offset function) you might still be able to get the same benefits, but I understand if that was the point of this change.

In that case we should probably have more careful tracking of the invariant on the contained iterator whenever it is advanced.

pub fn get8(&self, latin1: u8) -> T {
let code_point = u32::from(latin1);
debug_assert!(code_point <= SMALL_TYPE_FAST_INDEXING_MAX);
// SAFETY: `u8` is always below `SMALL_TYPE_FAST_INDEXING_MAX` and,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non blocking): worth documenting on those two constants that their precise values are extremely safety relevant and relied upon by many different checks in this file

debug_assert!(low_six <= 0b111_111); // Safety invariant.
debug_assert!(high_five <= 0b11_111); // Safety invariant.
debug_assert!(high_five > 0b1); // Non-shortest form; not safety invariant.
// SAFETY: The highest character representable as a two-byte
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: the safety invariants on this function are not currently documented, but once they are, this comment should be in terms of those invariants

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a line break.

///
/// `low_six` must not have bit positions other than the lowest 6 set to 1.
///
/// # Intended Invariant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: what is this? Is this a non-safety-relevant invariant?

Perhaps explicitly say it is non-safety relevant.

@hsivonen hsivonen requested a review from a team as a code owner March 17, 2026 11:00
@Manishearth
Copy link
Member

Great, thanks for updating all those invariants, this is looking much better! I'll try and finish review today, so we can get this in for 2.2.

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good.

I am in favor of landing this for the 2.2 release, which we're hoping to make next week. I have convinced myself there is no code where the safety is worse than before (the code for get_bit_prefix_suffix_assuming_fast_index was already lacking justification), and the newly introduced unsafe code is well done. Further cleanups/documentation can be performed (I can file followups for the main unsafe issues)

It would be nice to have as much of these comments addressed as possible, but we should also try to land this before next week.

///
/// # Safety
///
/// `high_ten` must not have bit positions other than the lowest 10 set to 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: we should consider using different types here, like u16 and u8. We can then as cast.

non blocking

/// # Safety
///
/// `ascii` must be less than 128.
unsafe fn ascii(&self, ascii: u8) -> T;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: should these be get_*?

I think they don't have to be, but this is a new public API so we should think about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are mostly internal facing we should namespace them as abstract_cpt_ascii or something.

}
}

impl<'slice, 'trie, T, V> Iterator for CharsWithTrieDefaultForAscii<'slice, 'trie, T, V>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: any hope of sharing code between the CharsWithTrie and CharsWithTrieDefaultForAscii impls?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems straightforward: have a type CharsWithTrieWithDefaultHandling that has .next<F>() where it calls f(lead) in the default case.

@Manishearth Manishearth added the discuss-priority Discuss at the next ICU4X meeting label Mar 24, 2026
@Manishearth
Copy link
Member

One note: This adds a lot of new APIs, and if we want to land this as new public APIs we will need to make sure they are

This adds:

  • New iterator types
  • A trait for abstracting over those iterator types
  • New get methods on CPT (get7, etc), some of them unsafe
  • A trait for abstracting over CPT types, with many methods

How much of this is needed by utf8_iter? Can we mark most of these new APIs as unstable for now and still make a release? I think I see a fair number of naming questions that we should spend time on (marked as discuss-priority to see if we can handle them Thursday)

cc @sffc to look at new APIs and naming as well

/// With debug assertions enabled, panics if the above safety invariants are
/// violated or `high_five` represents non-shortest form.
#[inline(always)]
pub unsafe fn get_utf8_two_byte(&self, high_five: u32, low_six: u32) -> T {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new public function. Should these parameters be narrower types?

/// Method naming intentionally differs from the method naming on
/// those types in order to disambiguate.
#[allow(private_bounds)] // Permit sealing
pub trait AbstractCodePointTrie<'trie, T: TrieValue>: Seal {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we already have a TypedCodePointTrie trait. Can we avoid duplicating a similar trait? How much is this needed? I understand that TypedCodePointTrie cannot abstract over CodePointTrie, but what is the use case for abstracting over all three here?

In retrospect, TypedCodePointTrie should have been designed as a trait with a single associated constant that inherited from AbstractCodePointTrie (probably CodePointTrieLike).

Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the large number of new APIs I'm actually going to wait for us to discuss them more. Might not be worth trying to make this land for 2.2, but perhaps we can resolve everything by Thursday.

If there's an MVP set of changes that enable utf8_iter integration we should try for that.

But also I'm open to doing out-of-cycle releases for this. I dislike doing nontrivial API additions in an out of cycle release but ..... I don't want to rush this release either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

discuss-priority Discuss at the next ICU4X meeting

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants