Add possessive quantifiers to avoid catastrophic backtracking by l0rinc · Pull Request #258 · openai/tiktoken

l0rinc · 2024-02-11T22:20:57Z

Fixes the crash in #245 by prohibiting the regex engine from backtracking catastrophically via possessive quantifiers.

Interestingly these possesives make the encoding a lot faster again in fancy-regex.

Before this change (but with large byte pair merge PR cherry-picked):

num_threads: 1, num_bytes: 98379553
tiktoken 	11,946,036 bytes / s
tiktoken 	11,961,343 bytes / s
tiktoken 	11,995,846 bytes / s
tiktoken 	11,951,263 bytes / s
tiktoken 	11,983,405 bytes / s

Same, with these changes applied:

num_threads: 1, num_bytes: 98379553
tiktoken 	14,511,827 bytes / s
tiktoken 	14,638,134 bytes / s
tiktoken 	14,644,029 bytes / s
tiktoken 	14,729,030 bytes / s
tiktoken 	14,666,903 bytes / s

Updating the regex libs makes it a tiny bit faster still:

num_threads: 1, num_bytes: 98379553
tiktoken 	14,485,590 bytes / s
tiktoken 	14,854,049 bytes / s
tiktoken 	14,891,086 bytes / s
tiktoken 	14,843,007 bytes / s
tiktoken 	14,874,520 bytes / s

This is almost 2x faster than before any of the optimizations.

Opened an issue for increasing the default backtrack limit, see: fancy-regex/fancy-regex#134, but it shouldn't be necessary here anymore.

l0rinc · 2024-02-12T10:42:35Z

tests/test_encoding.py

+        big_value = "^" * 1000000
+        assert big_value == enc.decode(enc.encode(big_value))
+
+        big_value = " " + big_value


space is often optional at the beginning, this way the backtracking can reach the space - let's test that as well

l0rinc · 2024-02-12T11:49:17Z

tests/test_encoding.py

+        big_value = " " + big_value
+        assert big_value == enc.decode(enc.encode(big_value))
+
+        big_value = big_value + "\n"


some groups require a newline at the end, stress those paths as well

l0rinc · 2024-02-12T13:33:22Z

src/lib.rs

        pattern: &str,
    ) -> PyResult<Self> {
-        let regex = Regex::new(pattern)
+        let regex = RegexBuilder::new(pattern).backtrack_limit(100_000).build()


doesn't work for values bigger than a million - see fancy-regex/fancy-regex#134

I've set it lower for now, hoping we'll be able to fix the whitespace problem

l0rinc · 2024-02-13T11:22:21Z

tiktoken_ext/openai_public.py

    return {
        "name": "cl100k_base",
-        "pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""",
+        "pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}++|\p{N}{1,3}+| ?[^\s\p{L}\p{N}]++[\r\n]*+|\s++$|\s*[\r\n]|\s+(?!\S)|\s""",


seems the cl100k also had some backtracking problems, these possessives improve the situation considerably (e.g. in Java these aren't necessary, see knuddelsgmbh/jtokkit#87)

can we collapse the \s+(?!\S) too? it should be equivalent to \s+$ no?

It sounds reasonable, but check the tests to understand why they're not the same

l0rinc · 2024-02-13T11:22:57Z

tests/test_encoding.py

+@pytest.mark.parametrize("make_enc", ENCODING_FACTORIES)
+def test_extremely_big_encoding(make_enc: Callable[[], tiktoken.Encoding]):
+    enc = make_enc()
+    for c in ["^", "0", "a", "'s", " ", "\n"]:


stressing different parts of the regex, makin sure none have catastrophic backtracking

l0rinc · 2024-02-13T11:24:29Z

tiktoken_ext/openai_public.py

+# The pattern in the original GPT-2 release is:
+# r"""'s|'t|'re|'ve|'m|'ll|'d| ?[\p{L}]+| ?[\p{N}]+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
+# This is equivalent, but executes faster:
+_legacy_splitter_regex = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}++| ?\p{N}++| ?[^\s\p{L}\p{N}]++|\s++$|\s+(?!\S)|\s"""


The whitespaces can't be possessive (it needs to step back when encountering a non-whitespace), but we can rule out the offending bactracking case by adding a possessive trailing whitespace check.

l0rinc · 2024-02-13T11:40:44Z

Cargo.toml

+fancy-regex = "0.13.0"
+regex = "1.10.3"


not absolutely necessary, but adds a tiny speed increase

l0rinc · 2024-02-13T11:41:29Z

src/lib.rs

        pattern: &str,
    ) -> PyResult<Self> {
-        let regex = Regex::new(pattern)
+        let regex = RegexBuilder::new(pattern).backtrack_limit(10_000).build()


after this change we should never backtract catastrophically - and if we do, this will warn us early

l0rinc · 2024-02-14T10:37:41Z

tests/test_encoding.py

+
+        big_value = big_value + "\n"
+        assert big_value == enc.decode(enc.encode(big_value))
+


big_value = big_value + "x" would still fail for whitespaces, i.e "　　　　　　　　x".
Seems less typical than the other cases which are fixed here, not yet sure how to fix this one, though, the fancy-regex seems pretty basic in this regard...

hauntsaninja

Thanks!

tmm1 · 2024-10-16T21:39:47Z

Was the backtrack limit reverted intentionally in 05e66e8? 05e66e8#diff-b1a35a68f14e696205874893c07fd24fdb88882b47c23cc0e0c80a30c7d53759L421-R438

Was there a regression?

hauntsaninja · 2024-10-16T22:28:51Z

Yes, it was reverted intentionally. There are OpenAI internal encodings where setting the limit caused issues.

l0rinc · 2024-10-16T22:55:07Z

Thanks for checking @tmm1, @hauntsaninja.

tiktoken-rs v0.9.1 uses fancy-regex for BPE tokenization, which can stack overflow on large inputs due to catastrophic backtracking (openai/tiktoken#245). The Rust code unwraps the error, panics across the FFI boundary, and aborts the entire Ruby process — this is not rescuable from Ruby. A secondary issue is quadratic BPE merge time for long non-whitespace runs (openai/tiktoken#195), which causes the process to hang indefinitely. Python tiktoken fixed the backtracking in v0.8.0 with possessive quantifiers (openai/tiktoken#258), but tiktoken-rs 0.9.1 and tiktoken_ruby 0.0.15.1 have not ported the fix, and no newer versions are available. This adds a safe_encode method that chunks text larger than 50K characters at whitespace boundaries before passing to tiktoken. BPE token boundaries never span whitespace, so chunked encoding produces identical results for normal text. For pathological inputs (e.g. 500K repeated characters with no whitespace), it completes in seconds instead of crashing or hanging. Co-Authored-By: Claude Opus 4.6 <[email protected]>

tiktoken-rs v0.9.1 uses fancy-regex for BPE tokenization, which can stack overflow on large inputs due to catastrophic backtracking (openai/tiktoken#245). The Rust code unwraps the error, panics across the FFI boundary, and aborts the entire Ruby process — this is not rescuable from Ruby. A secondary issue is quadratic BPE merge time for long non-whitespace runs (openai/tiktoken#195), which causes the process to hang indefinitely. Python tiktoken fixed the backtracking in v0.8.0 with possessive quantifiers (openai/tiktoken#258), but tiktoken-rs 0.9.1 and tiktoken_ruby 0.0.15.1 have not ported the fix, and no newer versions are available. This adds a safe_encode method that chunks text larger than 50K characters at whitespace boundaries before passing to tiktoken. BPE token boundaries never span whitespace, so chunked encoding produces identical results for normal text. For pathological inputs (e.g. 500K repeated characters with no whitespace), it completes in seconds instead of crashing or hanging.

l0rinc mentioned this pull request Feb 11, 2024

Panic (stack overflow) when encoding a certain string #245

Open

l0rinc force-pushed the paplorinc/regex-possessives branch from 30c92dc to ccd8702 Compare February 12, 2024 09:42

l0rinc commented Feb 12, 2024

View reviewed changes

l0rinc marked this pull request as draft February 12, 2024 11:07

Lőrinc added 3 commits February 12, 2024 12:46

Add tests for humongous encodings

db3155c

Add possessive quantifiers to legacy encodings as well

58cf8f6

Update regex dependencies

21c5688

l0rinc force-pushed the paplorinc/regex-possessives branch from 2d616bd to 21c5688 Compare February 12, 2024 11:47

l0rinc commented Feb 12, 2024

View reviewed changes

l0rinc marked this pull request as ready for review February 12, 2024 13:33

Lőrinc added 2 commits February 13, 2024 12:02

Lower backtrack_limit to fail earlier for invalid input

5f07fc2

Fix whitespace catastrophic backtracking

51c8a8a

l0rinc force-pushed the paplorinc/regex-possessives branch from 019de85 to 51c8a8a Compare February 13, 2024 11:20

l0rinc commented Feb 13, 2024

View reviewed changes

l0rinc changed the title ~~Add possessive quantifiers to legacy encodings as well~~ Add possessive quantifiers to avoid catastrophic backtracking Feb 13, 2024

l0rinc commented Feb 14, 2024

View reviewed changes

hauntsaninja approved these changes Oct 3, 2024

View reviewed changes

hauntsaninja merged commit 9f7f69d into openai:main Oct 3, 2024

l0rinc deleted the paplorinc/regex-possessives branch October 3, 2024 06:34

xfalcox mentioned this pull request Feb 26, 2026

FIX: Chunk large inputs to prevent tiktoken-rs stack overflow discourse/discourse_ai-tokenizers#9

Merged

4 tasks


		big_value = big_value + "\n"
		assert big_value == enc.decode(enc.encode(big_value))

Conversation

l0rinc commented Feb 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

l0rinc Feb 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

l0rinc Feb 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hauntsaninja left a comment

Choose a reason for hiding this comment

Uh oh!

tmm1 commented Oct 16, 2024

Uh oh!

hauntsaninja commented Oct 16, 2024

Uh oh!

l0rinc commented Oct 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

l0rinc commented Feb 11, 2024 •

edited

Loading

l0rinc Feb 12, 2024 •

edited

Loading

l0rinc Feb 14, 2024 •

edited

Loading