Description
Is your feature request related to a problem? Please describe.
Loading a large dictionary such as UniDic 2023-02 is slow in comparison to mecab.
I've downloaded the latest unidic cwj 2023-02 at https://clrd.ninjal.ac.jp/unidic/download.html#unidic_bccwj and built my own compiled vibrato dictionary using
cargo run --release -p compile -- -l unidic-cwj-202302_full/lex.csv -m unidic-cwj-202302_full/matrix.def -u unidic-cwj-202302_full/unk.def -c unidic-cwj-202302_full/char.def -o system.dic.zst
and then I tried tokenizing the example sentence from the docs
> time echo '本とカレーの街神保町へようこそ。' | cargo run --release -p tokenize -- -i system.dic.zst
Running `target/release/tokenize -i system.dic.zst`
Loading the dictionary...
Ready to tokenize
本 名詞,普通名詞,一般,*,*,*,ホン,本,本,ホン,本,ホン,漢,ホ濁,基本形,*,*,*,*,体,ホン,ホン,ホン,ホン,1,C3,*,9584176605045248,34867
と 助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
カレー 名詞,普通名詞,一般,*,*,*,カレー,カレー-curry,カレー,カレー,カレー,カレー,外,*,*,*,*,*,*,体,カレー,カレー,カレー,カレー,0,C2,*,2018162216411648,7342
の 助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*,*,*,格助,ノ,ノ,ノ,ノ,*,名詞%F1,*,7968444268028416,28989
街 名詞,普通名詞,一般,*,*,*,マチ,街,街,マチ,街,マチ,和,*,*,*,*,*,*,体,マチ,マチ,マチ,マチ,2,C3,*,9827718430597632,35753
神保町 名詞,固有名詞,地名,一般,*,*,ジンボウチョウ,ジンボウチョウ,神保町,ジンボーチョー,神保町,ジンボーチョー,固,*,*,*,*,*,*,地名,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,
"3,0",*,*,5174035466035712,18823
へ 助詞,格助詞,*,*,*,*,ヘ,へ,へ,エ,へ,エ,和,*,*,*,*,*,*,格助,ヘ,ヘ,ヘ,ヘ,*,名詞%F1,*,9296104558567936,33819
よう 形容詞,非自立可能,*,*,形容詞,連用形-ウ音便,ヨイ,良い,よう,ヨー,よい,ヨイ,和,*,*,*,*,*,*,相,ヨウ,ヨイ,ヨウ,ヨイ,1,C3,*,10716957049496195,38988
こそ 助詞,係助詞,*,*,*,*,コソ,こそ,こそ,コソ,こそ,コソ,和,*,*,*,*,*,*,係助,コソ,コソ,コソ,コソ,*,"形容詞%F2@0,名詞%F2@1,動詞%F2@0",*,3501403402281472,12738
。 補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25
EOS
________________________________________________________
Executed in 13.96 secs fish external
usr time 13.09 secs 0.00 micros 13.09 secs
sys time 0.86 secs 0.00 micros 0.86 secs
but it takes around 14 seconds to load the dictionary.
In comparison, mecab is near instant
> time echo "本とカレーの街神保町へようこそ。" | mecab --dicdir="unidic-cwj-202302_full"
本 名詞,普通名詞,一般,*,*,*,ホン,本,本,ホン,本,ホン,漢,ホ濁,基本形,*,*,*,*,体,ホン,ホン,ホン,ホン,1,C3,*,9584176605045248,34867
と 助詞,格助詞,*,*,*,*,ト,と,と,ト,と,ト,和,*,*,*,*,*,*,格助,ト,ト,ト,ト,*,"名詞%F1,動詞%F1,形容詞%F2@-1",*,7099014038299136,25826
カレー 名詞,普通名詞,一般,*,*,*,カレー,カレー-curry,カレー,カレー,カレー,カレー,外,*,*,*,*,*,*,体,カレー,カレー,カレー,カレー,0,C2,*,2018162216411648,7342
の 助詞,格助詞,*,*,*,*,ノ,の,の,ノ,の,ノ,和,*,*,*,*,*,*,格助,ノ,ノ,ノ,ノ,*,名詞%F1,*,7968444268028416,28989
街 名詞,普通名詞,一般,*,*,*,マチ,街,街,マチ,街,マチ,和,*,*,*,*,*,*,体,マチ,マチ,マチ,マチ,2,C3,*,9827718430597632,35753
神保町 名詞,固有名詞,地名,一般,*,*,ジンボウチョウ,ジンボウチョウ,神保町,ジンボーチョー,神保町,ジンボーチョー,固,*,*,*,*,*,*,地名,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,ジンボウチョウ,
"3,0",*,*,5174035466035712,18823
へ 助詞,格助詞,*,*,*,*,ヘ,へ,へ,エ,へ,エ,和,*,*,*,*,*,*,格助,ヘ,ヘ,ヘ,ヘ,*,名詞%F1,*,9296104558567936,33819
よう 形容詞,非自立可能,*,*,形容詞,連用形-ウ音便,ヨイ,良い,よう,ヨー,よい,ヨイ,和,*,*,*,*,*,*,相,ヨウ,ヨイ,ヨウ,ヨイ,1,C3,*,10716957049496195,38988
こそ 助詞,係助詞,*,*,*,*,コソ,こそ,こそ,コソ,こそ,コソ,和,*,*,*,*,*,*,係助,コソ,コソ,コソ,コソ,*,"形容詞%F2@0,名詞%F2@1,動詞%F2@0",*,3501403402281472,12738
。 補助記号,句点,*,*,*,*,*,。,。,*,。,*,記号,*,*,*,*,*,*,補助,*,*,*,*,*,*,*,6880571302400,25
EOS
________________________________________________________
Executed in 28.32 millis fish external
usr time 0.00 millis 0.00 micros 0.00 millis
sys time 31.25 millis 0.00 micros 31.25 millis
I looked at the code and it seems like all the time is taken from deserializing bincode into the DictionaryInner
struct. In particular, when it runs the read_common
function
fn read_common<R>(mut rdr: R) -> Result<DictionaryInner>
where
R: Read,
{
let mut magic = [0; MODEL_MAGIC.len()];
rdr.read_exact(&mut magic)?;
if magic != MODEL_MAGIC {
return Err(VibratoError::invalid_argument(
"rdr",
"The magic number of the input model mismatches.",
));
}
let config = common::bincode_config();
let data = bincode::decode_from_std_read(&mut rdr, config)?;
Ok(data)
}
It takes a long time to complete let data = bincode::decode_from_std_read(&mut rdr, config)?;
so it seems like bincode deserialization is slow.
How is mecab able to return results so quickly despite not loading everything in memory like vibrato? It seems like it doesn't take any memory when I use mecab, whereas vibrato takes 1 GB memory to cache everything before being able to tokenize
Describe the solution you'd like
Could we use a faster serde framework like rkyv ? According to the benchmarks, it's a lot faster than bincode
According to the rkyv docs, it says
It’s similar to other zero-copy deserialization frameworks such as Cap’n Proto and FlatBuffers. However, while the former have external schemas and heavily restricted data types, rkyv allows all serialized types to be defined in code and can serialize a wide variety of types that the others cannot. Additionally, rkyv is designed to have little to no overhead, and in most cases will perform exactly the same as native types.
Not sure if there's any other way to speed it up? Could we somehow parallelize deserialization?
Describe alternatives you've considered
Apparently bincode is slow for structs that use Vec and byte slices, and the recommendation is to use serde_bytes
The features such as
pub struct UnkEntry {
pub cate_id: u16,
pub left_id: u16,
pub right_id: u16,
pub word_cost: i16,
pub feature: String,
}
pub struct WordFeatures {
features: Vec<String>,
}
are stored as strings, maybe they can be stored as Vec<u8>
instead?
Additional context
I'm using vibrato version 0.5.1
And here are the compiled dictionary sizes
> du -sh system.dic.zst
291M system.dic.zst
> du -sh system.dic
988M system.dic