Loading a large dictionary such as UniDic cwj 2023-02 is slow in comparison to mecab

**Is your feature request related to a problem? Please describe.**
Loading a large dictionary such as UniDic 2023-02 is slow in comparison to mecab.

I've downloaded the latest unidic cwj 2023-02 at https://clrd.ninjal.ac.jp/unidic/download.html#unidic_bccwj and built my own compiled vibrato dictionary using 
```
cargo run --release -p compile -- -l unidic-cwj-202302_full/lex.csv -m unidic-cwj-202302_full/matrix.def -u unidic-cwj-202302_full/unk.def -c unidic-cwj-202302_full/char.def -o system.dic.zst
```

and then I tried tokenizing the example sentence from the docs

```
> time echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i system.dic.zst
     Running `target/release/tokenize -i system.dic.zst`
Loading the dictionary...
Ready to tokenize
本      名詞,普通名詞,一般,*,*,*,ホン,本,本,ホン,本,ホン,漢,ホ濁,基本形,*,*,*,*,体,ホン,ホン,ホン,ホン,1,C3,*,9584176605045248,34867
と      助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
カレー  名詞,普通名詞,一般,*,*,*,カレー,カレー-curry,カレー,カレー,カレー,カレー,外,*,*,*,*,*,*,体,カレー,カレー,カレー,カレー,0,C2,*,2018162216411648,7342
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*,*,*,格助,ノ,ノ,ノ,ノ,*,名詞%F1,*,7968444268028416,28989
街      名詞,普通名詞,一般,*,*,*,マチ,街,街,マチ,街,マチ,和,*,*,*,*,*,*,体,マチ,マチ,マチ,マチ,2,C3,*,9827718430597632,35753
神保町  名詞,固有名詞,地名,一般,*,*,ジンボウチョウ,ジンボウチョウ,神保町,ジンボーチョー,神保町,ジンボーチョー,固,*,*,*,*,*,*,地名,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,
"3,0",*,*,5174035466035712,18823
へ      助詞,格助詞,*,*,*,*,ヘ,へ,へ,エ,へ,エ,和,*,*,*,*,*,*,格助,ヘ,ヘ,ヘ,ヘ,*,名詞%F1,*,9296104558567936,33819
よう    形容詞,非自立可能,*,*,形容詞,連用形-ウ音便,ヨイ,良い,よう,ヨー,よい,ヨイ,和,*,*,*,*,*,*,相,ヨウ,ヨイ,ヨウ,ヨイ,1,C3,*,10716957049496195,38988
こそ    助詞,係助詞,*,*,*,*,コソ,こそ,こそ,コソ,こそ,コソ,和,*,*,*,*,*,*,係助,コソ,コソ,コソ,コソ,*,"形容詞%F2@0,名詞%F2@1,動詞%F2@0",*,3501403402281472,12738
。      補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25
EOS

________________________________________________________
Executed in   13.96 secs    fish           external
   usr time   13.09 secs    0.00 micros   13.09 secs
   sys time    0.86 secs    0.00 micros    0.86 secs
```

but it takes around 14 seconds to load the dictionary.

In comparison, mecab is near instant
```
> time echo "本とカレーの街神保町へようこそ。" | mecab --dicdir="unidic-cwj-202302_full"
本      名詞,普通名詞,一般,*,*,*,ホン,本,本,ホン,本,ホン,漢,ホ濁,基本形,*,*,*,*,体,ホン,ホン,ホン,ホン,1,C3,*,9584176605045248,34867
と      助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
カレー  名詞,普通名詞,一般,*,*,*,カレー,カレー-curry,カレー,カレー,カレー,カレー,外,*,*,*,*,*,*,体,カレー,カレー,カレー,カレー,0,C2,*,2018162216411648,7342
の      助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*,*,*,格助,ノ,ノ,ノ,ノ,*,名詞%F1,*,7968444268028416,28989
街      名詞,普通名詞,一般,*,*,*,マチ,街,街,マチ,街,マチ,和,*,*,*,*,*,*,体,マチ,マチ,マチ,マチ,2,C3,*,9827718430597632,35753
神保町  名詞,固有名詞,地名,一般,*,*,ジンボウチョウ,ジンボウチョウ,神保町,ジンボーチョー,神保町,ジンボーチョー,固,*,*,*,*,*,*,地名,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,
"3,0",*,*,5174035466035712,18823
へ      助詞,格助詞,*,*,*,*,ヘ,へ,へ,エ,へ,エ,和,*,*,*,*,*,*,格助,ヘ,ヘ,ヘ,ヘ,*,名詞%F1,*,9296104558567936,33819
よう    形容詞,非自立可能,*,*,形容詞,連用形-ウ音便,ヨイ,良い,よう,ヨー,よい,ヨイ,和,*,*,*,*,*,*,相,ヨウ,ヨイ,ヨウ,ヨイ,1,C3,*,10716957049496195,38988
こそ    助詞,係助詞,*,*,*,*,コソ,こそ,こそ,コソ,こそ,コソ,和,*,*,*,*,*,*,係助,コソ,コソ,コソ,コソ,*,"形容詞%F2@0,名詞%F2@1,動詞%F2@0",*,3501403402281472,12738
。      補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25
EOS

________________________________________________________
Executed in   28.32 millis    fish           external
   usr time    0.00 millis    0.00 micros    0.00 millis
   sys time   31.25 millis    0.00 micros   31.25 millis
```

I looked at the code and it seems like all the time is taken from deserializing bincode into the `DictionaryInner` struct. In particular, when it runs the `read_common` function

```rust
    fn read_common<R>(mut rdr: R) -> Result<DictionaryInner>
    where
        R: Read,
    {
        let mut magic = [0; MODEL_MAGIC.len()];
        rdr.read_exact(&mut magic)?;
        if magic != MODEL_MAGIC {
            return Err(VibratoError::invalid_argument(
                "rdr",
                "The magic number of the input model mismatches.",
            ));
        }
        let config = common::bincode_config();
        let data = bincode::decode_from_std_read(&mut rdr, config)?;
        Ok(data)
    }
```

It takes a long time to complete `let data = bincode::decode_from_std_read(&mut rdr, config)?;` so it seems like bincode deserialization is slow.

How is mecab able to return results so quickly despite not loading everything in memory like vibrato? It seems like it doesn't take any memory when I use mecab, whereas vibrato takes 1 GB memory to cache everything before being able to tokenize

**Describe the solution you'd like**
Could we use a faster serde framework like [rkyv](https://github.com/rkyv/rkyv) ? According to the [benchmarks](https://github.com/djkoloski/rust_serialization_benchmark), it's a lot faster than bincode

According to the [rkyv docs](https://docs.rs/rkyv/0.7.44/rkyv/index.html), it says
> It’s similar to other zero-copy deserialization frameworks such as [Cap’n Proto](https://capnproto.org/) and [FlatBuffers](https://google.github.io/flatbuffers). However, while the former have external schemas and heavily restricted data types, rkyv allows all serialized types to be defined in code and can serialize a wide variety of types that the others cannot. Additionally, **rkyv is designed to have little to no overhead**, and in most cases will perform exactly the same as native types.

Not sure if there's any other way to speed it up? Could we somehow parallelize deserialization? 

**Describe alternatives you've considered**
Apparently [bincode is slow for structs that use Vec<u8> and byte slices](https://www.reddit.com/r/rust/comments/eg9cfm/it_seems_bincode_is_surprisingly_slow/), and the recommendation is to use [serde_bytes](https://crates.io/crates/serde_bytes)

The features such as

```rust
pub struct UnkEntry {
    pub cate_id: u16,
    pub left_id: u16,
    pub right_id: u16,
    pub word_cost: i16,
    pub feature: String,
}
pub struct WordFeatures {
    features: Vec<String>,
}
```
are stored as strings, maybe they can be stored as `Vec<u8>` instead? 

**Additional context**
I'm using vibrato version 0.5.1

And here are the compiled dictionary sizes
```
> du -sh system.dic.zst
291M    system.dic.zst

> du -sh system.dic
988M    system.dic
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loading a large dictionary such as UniDic cwj 2023-02 is slow in comparison to mecab #150

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Loading a large dictionary such as UniDic cwj 2023-02 is slow in comparison to mecab #150

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions