Add possessive quantifiers to avoid catastrophic backtracking#258
Add possessive quantifiers to avoid catastrophic backtracking#258hauntsaninja merged 5 commits intoopenai:mainfrom
Conversation
30c92dc to
ccd8702
Compare
| big_value = "^" * 1000000 | ||
| assert big_value == enc.decode(enc.encode(big_value)) | ||
|
|
||
| big_value = " " + big_value |
There was a problem hiding this comment.
space is often optional at the beginning, this way the backtracking can reach the space - let's test that as well
2d616bd to
21c5688
Compare
| big_value = " " + big_value | ||
| assert big_value == enc.decode(enc.encode(big_value)) | ||
|
|
||
| big_value = big_value + "\n" |
There was a problem hiding this comment.
some groups require a newline at the end, stress those paths as well
src/lib.rs
Outdated
| pattern: &str, | ||
| ) -> PyResult<Self> { | ||
| let regex = Regex::new(pattern) | ||
| let regex = RegexBuilder::new(pattern).backtrack_limit(100_000).build() |
There was a problem hiding this comment.
doesn't work for values bigger than a million - see fancy-regex/fancy-regex#134
I've set it lower for now, hoping we'll be able to fix the whitespace problem
019de85 to
51c8a8a
Compare
| return { | ||
| "name": "cl100k_base", | ||
| "pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""", | ||
| "pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}++|\p{N}{1,3}+| ?[^\s\p{L}\p{N}]++[\r\n]*+|\s++$|\s*[\r\n]|\s+(?!\S)|\s""", |
There was a problem hiding this comment.
seems the cl100k also had some backtracking problems, these possessives improve the situation considerably (e.g. in Java these aren't necessary, see knuddelsgmbh/jtokkit#87)
There was a problem hiding this comment.
can we collapse the \s+(?!\S) too? it should be equivalent to \s+$ no?
There was a problem hiding this comment.
It sounds reasonable, but check the tests to understand why they're not the same
| @pytest.mark.parametrize("make_enc", ENCODING_FACTORIES) | ||
| def test_extremely_big_encoding(make_enc: Callable[[], tiktoken.Encoding]): | ||
| enc = make_enc() | ||
| for c in ["^", "0", "a", "'s", " ", "\n"]: |
There was a problem hiding this comment.
stressing different parts of the regex, makin sure none have catastrophic backtracking
| # The pattern in the original GPT-2 release is: | ||
| # r"""'s|'t|'re|'ve|'m|'ll|'d| ?[\p{L}]+| ?[\p{N}]+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""" | ||
| # This is equivalent, but executes faster: | ||
| _legacy_splitter_regex = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}++| ?\p{N}++| ?[^\s\p{L}\p{N}]++|\s++$|\s+(?!\S)|\s""" |
There was a problem hiding this comment.
The whitespaces can't be possessive (it needs to step back when encountering a non-whitespace), but we can rule out the offending bactracking case by adding a possessive trailing whitespace check.
| fancy-regex = "0.13.0" | ||
| regex = "1.10.3" |
There was a problem hiding this comment.
not absolutely necessary, but adds a tiny speed increase
| pattern: &str, | ||
| ) -> PyResult<Self> { | ||
| let regex = Regex::new(pattern) | ||
| let regex = RegexBuilder::new(pattern).backtrack_limit(10_000).build() |
There was a problem hiding this comment.
after this change we should never backtract catastrophically - and if we do, this will warn us early
|
|
||
| big_value = big_value + "\n" | ||
| assert big_value == enc.decode(enc.encode(big_value)) | ||
|
|
There was a problem hiding this comment.
big_value = big_value + "x" would still fail for whitespaces, i.e " x".
Seems less typical than the other cases which are fixed here, not yet sure how to fix this one, though, the fancy-regex seems pretty basic in this regard...
|
Was the backtrack limit reverted intentionally in 05e66e8? 05e66e8#diff-b1a35a68f14e696205874893c07fd24fdb88882b47c23cc0e0c80a30c7d53759L421-R438 Was there a regression? |
|
Yes, it was reverted intentionally. There are OpenAI internal encodings where setting the limit caused issues. |
|
Thanks for checking @tmm1, @hauntsaninja. |
tiktoken-rs v0.9.1 uses fancy-regex for BPE tokenization, which can stack overflow on large inputs due to catastrophic backtracking (openai/tiktoken#245). The Rust code unwraps the error, panics across the FFI boundary, and aborts the entire Ruby process — this is not rescuable from Ruby. A secondary issue is quadratic BPE merge time for long non-whitespace runs (openai/tiktoken#195), which causes the process to hang indefinitely. Python tiktoken fixed the backtracking in v0.8.0 with possessive quantifiers (openai/tiktoken#258), but tiktoken-rs 0.9.1 and tiktoken_ruby 0.0.15.1 have not ported the fix, and no newer versions are available. This adds a safe_encode method that chunks text larger than 50K characters at whitespace boundaries before passing to tiktoken. BPE token boundaries never span whitespace, so chunked encoding produces identical results for normal text. For pathological inputs (e.g. 500K repeated characters with no whitespace), it completes in seconds instead of crashing or hanging. Co-Authored-By: Claude Opus 4.6 <[email protected]>
tiktoken-rs v0.9.1 uses fancy-regex for BPE tokenization, which can stack overflow on large inputs due to catastrophic backtracking (openai/tiktoken#245). The Rust code unwraps the error, panics across the FFI boundary, and aborts the entire Ruby process — this is not rescuable from Ruby. A secondary issue is quadratic BPE merge time for long non-whitespace runs (openai/tiktoken#195), which causes the process to hang indefinitely. Python tiktoken fixed the backtracking in v0.8.0 with possessive quantifiers (openai/tiktoken#258), but tiktoken-rs 0.9.1 and tiktoken_ruby 0.0.15.1 have not ported the fix, and no newer versions are available. This adds a safe_encode method that chunks text larger than 50K characters at whitespace boundaries before passing to tiktoken. BPE token boundaries never span whitespace, so chunked encoding produces identical results for normal text. For pathological inputs (e.g. 500K repeated characters with no whitespace), it completes in seconds instead of crashing or hanging.
tiktoken-rs v0.9.1 uses fancy-regex for BPE tokenization, which can stack overflow on large inputs due to catastrophic backtracking (openai/tiktoken#245). The Rust code unwraps the error, panics across the FFI boundary, and aborts the entire Ruby process — this is not rescuable from Ruby. A secondary issue is quadratic BPE merge time for long non-whitespace runs (openai/tiktoken#195), which causes the process to hang indefinitely. Python tiktoken fixed the backtracking in v0.8.0 with possessive quantifiers (openai/tiktoken#258), but tiktoken-rs 0.9.1 and tiktoken_ruby 0.0.15.1 have not ported the fix, and no newer versions are available. This adds a safe_encode method that chunks text larger than 50K characters at whitespace boundaries before passing to tiktoken. BPE token boundaries never span whitespace, so chunked encoding produces identical results for normal text. For pathological inputs (e.g. 500K repeated characters with no whitespace), it completes in seconds instead of crashing or hanging.
tiktoken-rs v0.9.1 uses fancy-regex for BPE tokenization, which can stack overflow on large inputs due to catastrophic backtracking (openai/tiktoken#245). The Rust code unwraps the error, panics across the FFI boundary, and aborts the entire Ruby process — this is not rescuable from Ruby. A secondary issue is quadratic BPE merge time for long non-whitespace runs (openai/tiktoken#195), which causes the process to hang indefinitely. Python tiktoken fixed the backtracking in v0.8.0 with possessive quantifiers (openai/tiktoken#258), but tiktoken-rs 0.9.1 and tiktoken_ruby 0.0.15.1 have not ported the fix, and no newer versions are available. This adds a safe_encode method that chunks text larger than 50K characters at whitespace boundaries before passing to tiktoken. BPE token boundaries never span whitespace, so chunked encoding produces identical results for normal text. For pathological inputs (e.g. 500K repeated characters with no whitespace), it completes in seconds instead of crashing or hanging.
tiktoken-rs v0.9.1 uses fancy-regex for BPE tokenization, which can stack overflow on large inputs due to catastrophic backtracking (openai/tiktoken#245). The Rust code unwraps the error, panics across the FFI boundary, and aborts the entire Ruby process — this is not rescuable from Ruby. A secondary issue is quadratic BPE merge time for long non-whitespace runs (openai/tiktoken#195), which causes the process to hang indefinitely. Python tiktoken fixed the backtracking in v0.8.0 with possessive quantifiers (openai/tiktoken#258), but tiktoken-rs 0.9.1 and tiktoken_ruby 0.0.15.1 have not ported the fix, and no newer versions are available. This adds a safe_encode method that chunks text larger than 50K characters at whitespace boundaries before passing to tiktoken. BPE token boundaries never span whitespace, so chunked encoding produces identical results for normal text. For pathological inputs (e.g. 500K repeated characters with no whitespace), it completes in seconds instead of crashing or hanging.
tiktoken-rs v0.9.1 uses fancy-regex for BPE tokenization, which can stack overflow on large inputs due to catastrophic backtracking (openai/tiktoken#245). The Rust code unwraps the error, panics across the FFI boundary, and aborts the entire Ruby process — this is not rescuable from Ruby. A secondary issue is quadratic BPE merge time for long non-whitespace runs (openai/tiktoken#195), which causes the process to hang indefinitely. Python tiktoken fixed the backtracking in v0.8.0 with possessive quantifiers (openai/tiktoken#258), but tiktoken-rs 0.9.1 and tiktoken_ruby 0.0.15.1 have not ported the fix, and no newer versions are available. This adds a safe_encode method that chunks text larger than 50K characters at whitespace boundaries before passing to tiktoken. BPE token boundaries never span whitespace, so chunked encoding produces identical results for normal text. For pathological inputs (e.g. 500K repeated characters with no whitespace), it completes in seconds instead of crashing or hanging.
Fixes the crash in #245 by prohibiting the regex engine from backtracking catastrophically via possessive quantifiers.
Interestingly these possesives make the encoding a lot faster again in
fancy-regex.Before this change (but with large byte pair merge PR cherry-picked):
Same, with these changes applied:
Updating the regex libs makes it a tiny bit faster still:
This is almost 2x faster than before any of the optimizations.
Opened an issue for increasing the default backtrack limit, see: fancy-regex/fancy-regex#134, but it shouldn't be necessary here anymore.