Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

end_offset of lingature charactor is wrong when using icu_normalizer #148

Open
hongu-ku opened this issue Oct 23, 2024 · 1 comment · May be fixed by #149
Open

end_offset of lingature charactor is wrong when using icu_normalizer #148

hongu-ku opened this issue Oct 23, 2024 · 1 comment · May be fixed by #149
Labels

Comments

@hongu-ku
Copy link

Summary

When using the icu_normalizer, the end_offset of the token is incorrect when the character is a ligature.

Environment

  • OpenSearch version: 2.15.0
  • elasticsearch-sudachi version: 3.2.3

Steps to reproduce

POST /sample

{
  "settings": {
    "index": {
      "analysis": {
        "char_filter": {
          "normalize": {
            "type": "icu_normalizer",
            "name": "nfkc",
            "mode": "compose"
          }
        },
        "filter": {
          "sudachi_split_filter": {
            "type": "sudachi_split",
            "mode": "search"
          }
        },
        "analyzer": {
          "default": {
            "type": "custom",
            "char_filter": [
              "normalize"
            ],
            "tokenizer": "sudachi_tokenizer",
            "filter": [
              "sudachi_split_filter"
            ]
          }
        }
      }
    }
  }
}

POST /sample/_analyze

{
  "analyzer": "default",
  "text": ""
}

Expected behavior

I think tokens[0].end_offset should be 4.

{
  "tokens": [
    {
      "token": "株式会社",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0,
      "positionLength": 2
    },
    {
      "token": "株式",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "会社",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 1
    }
  ]
}

Actual behavior

tokens[0].end_offset is 1.
the behavior of mode A is correct.

{
  "tokens": [
    {
      "token": "株式会社",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0,
      "positionLength": 2
    },
    {
      "token": "株式",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "会社",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 1
    }
  ]
}
@mh-northlander
Copy link
Collaborator

start_offset and end_offset is based on the source text (i.e. "㍿").
Thus in this case tokens[0].end_offset is correct and those of tokens[1] and token[2] is wrong (they should be [0,1] and [1,1]).
This should be fixed.

FYI, Sudachi applies its own text normalization and may behave unexpectedly with the icu normalization.
You may also want to use allowEmptyMorpheme sudachi option to change token[2] offset from [1,1] to [0,1].

{
  "settings": {
    "analysis": {
      "tokenizer": {
        "sudachi_disallow_empty_morpheme": {
          "type": "sudachi_tokenizer",
          "additional_settings": "{\"allowEmptyMorpheme\":false}"
        }
      }
    }
  }
}

@mh-northlander mh-northlander linked a pull request Nov 1, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants