Skip to content

end_offset of lingature charactor is wrong when using icu_normalizer #148

Closed
@hongu-ku

Description

@hongu-ku

Summary

When using the icu_normalizer, the end_offset of the token is incorrect when the character is a ligature.

Environment

  • OpenSearch version: 2.15.0
  • elasticsearch-sudachi version: 3.2.3

Steps to reproduce

POST /sample

{
  "settings": {
    "index": {
      "analysis": {
        "char_filter": {
          "normalize": {
            "type": "icu_normalizer",
            "name": "nfkc",
            "mode": "compose"
          }
        },
        "filter": {
          "sudachi_split_filter": {
            "type": "sudachi_split",
            "mode": "search"
          }
        },
        "analyzer": {
          "default": {
            "type": "custom",
            "char_filter": [
              "normalize"
            ],
            "tokenizer": "sudachi_tokenizer",
            "filter": [
              "sudachi_split_filter"
            ]
          }
        }
      }
    }
  }
}

POST /sample/_analyze

{
  "analyzer": "default",
  "text": ""
}

Expected behavior

I think tokens[0].end_offset should be 4.

{
  "tokens": [
    {
      "token": "株式会社",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0,
      "positionLength": 2
    },
    {
      "token": "株式",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "会社",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 1
    }
  ]
}

Actual behavior

tokens[0].end_offset is 1.
the behavior of mode A is correct.

{
  "tokens": [
    {
      "token": "株式会社",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0,
      "positionLength": 2
    },
    {
      "token": "株式",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "会社",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 1
    }
  ]
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions