-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Description
Is your feature request related to a problem? Please describe.
- Vibrato currently supports tokenization with mecab-compatible dictionary (
unidic-mecab-2_1_2
), but this dictionary produces tokenization results that differ from those of unidic-lite. As the lite version is often used with some machine learning models and libraries, adding support for unidic-lite would be very beneficial especially for those who wants to replace the tokenizer written in Python with Vibrato. - One example word that is tokenized differently is "JIS" that won't be tokenized with unidic-lite but will be tokenized to "J", "I", "S" with unidic-mecab-2_1_2 vibrato.
Describe the solution you'd like
NOTE: I am assuming that the difference in the system dictionary is the reason for the different tokenization results. Yet, please confirm if this is the case.
- As
unidic-lite
is also provided in.dic
format, I have tried to load it directly withVibrato
but getting the error "The magic number of the input model mismatches.". - I have also tried to compile it for vibrato but the distributed
unidic-lite
seems missing the csv files and can not figure out how to convert it. The reference I used ishttps://github.com/daac-tools/vibrato/blob/main/docs/compile.md
. Any help on this is highly appreciated.
Describe alternatives you've considered
- N/A
Additional context
- I can't upload the system dictionary file of
unidic-lite
as it it too large for GitHub. So I leave the instructions to finding the file.- Download
unidic-lite
withpip install unidic-lite
- The dictionary is installed in the directory of
unidic-lite
package. You can find it by running the following command:python -c "import unidic; print(unidic.DICDIR)"
- Finally, thank you so much for your great work on this library. This is incredibly fast and performant with good interface.
- Download
Metadata
Metadata
Assignees
Labels
No labels