Skip to content

Tokenizer splits certain tokens but doesn't split other tokens of the same type. #12448

Discussion options

You must be logged in to vote

tokenizer.explain is showing you where to adjust the settings: there is a suffix pattern that is splitting off T in this case.

In this particular case it's related to the UNITS (like T for terabyte).

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by adrianeboyd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants
Converted from issue

This discussion was converted from issue #12426 on March 20, 2023 08:17.