Skip to content

[Tokenizer] Add encode ordinary mode (no special token matching)#23830

Open
jtuyls wants to merge 1 commit intoiree-org:mainfrom
jtuyls:fix-tiktoken-encode-ordinary
Open

[Tokenizer] Add encode ordinary mode (no special token matching)#23830
jtuyls wants to merge 1 commit intoiree-org:mainfrom
jtuyls:fix-tiktoken-encode-ordinary

Conversation

@jtuyls
Copy link
Contributor

@jtuyls jtuyls commented Mar 18, 2026

Add IREE_TOKENIZER_ENCODE_FLAG_NO_SPECIAL_TOKEN_MATCHING to skip special token recognition in input text. Equivalent to tiktoken's encode_ordinary(). Fixes 4/76 tiktoken smoketest failures. Also skips tiktoken models in the
HuggingFace smoketest (tested by the dedicated tiktoken smoketest instead).

Test Results:

  • HuggingFace smoketest: 1667/1667 passed (tiktoken models skipped, tested separately)
  • Tiktoken smoketest: 76/76 passed (all 4 special_token_endoftext tests now pass with --match_special)

Add IREE_TOKENIZER_ENCODE_FLAG_NO_SPECIAL_TOKEN_MATCHING to skip special
token recognition in input text. Equivalent to tiktoken's encode_ordinary().
Fixes 4/76 tiktoken smoketest failures. Also skips tiktoken models in the
HuggingFace smoketest (tested by the dedicated tiktoken smoketest instead).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Signed-off-by: Jorn <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant