Closed
Description
Summary
When using the icu_normalizer
, the end_offset
of the token is incorrect when the character is a ligature.
Environment
- OpenSearch version: 2.15.0
- elasticsearch-sudachi version: 3.2.3
Steps to reproduce
POST /sample
{
"settings": {
"index": {
"analysis": {
"char_filter": {
"normalize": {
"type": "icu_normalizer",
"name": "nfkc",
"mode": "compose"
}
},
"filter": {
"sudachi_split_filter": {
"type": "sudachi_split",
"mode": "search"
}
},
"analyzer": {
"default": {
"type": "custom",
"char_filter": [
"normalize"
],
"tokenizer": "sudachi_tokenizer",
"filter": [
"sudachi_split_filter"
]
}
}
}
}
}
}
POST /sample/_analyze
{
"analyzer": "default",
"text": "㍿"
}
Expected behavior
I think tokens[0].end_offset
should be 4.
{
"tokens": [
{
"token": "株式会社",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0,
"positionLength": 2
},
{
"token": "株式",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "会社",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 1
}
]
}
Actual behavior
tokens[0].end_offset
is 1.
the behavior of mode A is correct.
{
"tokens": [
{
"token": "株式会社",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0,
"positionLength": 2
},
{
"token": "株式",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "会社",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 1
}
]
}