Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add offset correction for split filter #149

Open
wants to merge 13 commits into
base: develop
Choose a base branch
from

Conversation

mh-northlander
Copy link
Collaborator

@mh-northlander mh-northlander commented Nov 1, 2024

fix #148

Fix the offset when the length of the morpheme is different from the actual length in the input text.

Since correctOffset method is only available for CharFilter and Tokenizer, we need to keep offset mapping information for sudachi_split filter somehow.
I chose to put it in MorphemeAttribute because it is morpheme related data.

With combined character, extended mode behaves badly (e.g. "㍍㍉" will be extended to "㍍", "㍉", not "メ", "ー", "ト", "ル”, "ミ", "リ").
In this case using the normalized form will be more natural, but we cannot calculate offsets for them (mapping between surface and normalized form is missing). So I chose to keep using surface, i.e. text before normalization.

Also change dictionary type for test from small to core, to test "㍿" which is not in the small dict.

Note that due to the correctOffset behavior of icu-normalizer, offsets for subsplits of ㍿ (株式 + 会社) is now [0,0] + [0,1].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

end_offset of lingature charactor is wrong when using icu_normalizer
1 participant