You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
start_offset and end_offset is based on the source text (i.e. "㍿").
Thus in this case tokens[0].end_offset is correct and those of tokens[1] and token[2] is wrong (they should be [0,1] and [1,1]).
This should be fixed.
FYI, Sudachi applies its own text normalization and may behave unexpectedly with the icu normalization.
You may also want to use allowEmptyMorpheme sudachi option to change token[2] offset from [1,1] to [0,1].
Summary
When using the
icu_normalizer
, theend_offset
of the token is incorrect when the character is a ligature.Environment
Steps to reproduce
POST /sample
POST /sample/_analyze
Expected behavior
I think
tokens[0].end_offset
should be 4.Actual behavior
tokens[0].end_offset
is 1.the behavior of mode A is correct.
The text was updated successfully, but these errors were encountered: