Skip to content

[Feature Request] Adding compatibility with unidic-lite #161

@acn-gen-tomita

Description

@acn-gen-tomita

Is your feature request related to a problem? Please describe.

  • Vibrato currently supports tokenization with mecab-compatible dictionary (unidic-mecab-2_1_2), but this dictionary produces tokenization results that differ from those of unidic-lite. As the lite version is often used with some machine learning models and libraries, adding support for unidic-lite would be very beneficial especially for those who wants to replace the tokenizer written in Python with Vibrato.
  • One example word that is tokenized differently is "JIS" that won't be tokenized with unidic-lite but will be tokenized to "J", "I", "S" with unidic-mecab-2_1_2 vibrato.

Describe the solution you'd like

NOTE: I am assuming that the difference in the system dictionary is the reason for the different tokenization results. Yet, please confirm if this is the case.

  • As unidic-lite is also provided in .dic format, I have tried to load it directly with Vibrato but getting the error "The magic number of the input model mismatches.".
  • I have also tried to compile it for vibrato but the distributed unidic-lite seems missing the csv files and can not figure out how to convert it. The reference I used is https://github.com/daac-tools/vibrato/blob/main/docs/compile.md. Any help on this is highly appreciated.

Describe alternatives you've considered

  • N/A

Additional context

  • I can't upload the system dictionary file of unidic-lite as it it too large for GitHub. So I leave the instructions to finding the file.
    • Download unidic-lite with pip install unidic-lite
    • The dictionary is installed in the directory of unidic-lite package. You can find it by running the following command:
      python -c "import unidic; print(unidic.DICDIR)"
    • Finally, thank you so much for your great work on this library. This is incredibly fast and performant with good interface.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions