Skip to content

Stream dictionary entries #248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 28, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion src/main/java/com/worksap/nlp/sudachi/Dictionary.java
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,9 @@ public interface Dictionary extends AutoCloseable {
/**
* Create a parallel stream of all words in the dictionary as morphemes.
*
* Entries in the stream are not sorted.
* Corresponds to the lines in the lexicon csv, i.e. includes hidden entries and
* excludes entries for normalization form. Entries in the stream are not
* sorted.
*
* @return a stream of morphemes.
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,10 @@ class JapaneseDictionaryTest {
fun entries() {
// contains all morphemes, where all of them have different wordId
assertEquals(41, dict.entries().map { m -> m.getWordId() }.distinct().count())
// includes entry with -1 conjunction cost
assertEquals(1, dict.entries().filter { m -> m.dictionaryForm() == "隠し" }.count())
// excludes phantom entry
assertEquals(0, dict.entries().filter { m -> m.surface() == "なな" }.count())
// use grammar
assertEquals(6, dict.entries().filter { m -> m.partOfSpeech().get(1) == "固有名詞" }.count())
// use lexicon
Expand Down Expand Up @@ -138,6 +142,10 @@ class JapaneseDictionaryTest {
assertEquals(1, sudachi.size)
assertEquals("徳島県産", sudachi[0].getUserData())

// cannot find hidden entry
val hidden = dict.lookup("隠し")
assertTrue(hidden.isEmpty())

// will be normalized
val norm = dict.lookup("特A")
assertEquals(1, norm.size)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,12 @@ class DoubleArrayLexiconTest {
assertEquals("行く", lexicon.string(0, lexicon.getWordInfo(wi.getNormalizedForm()).getHeadword()))
assertEquals("行く", lexicon.string(0, lexicon.getWordInfo(wi.getDictionaryForm()).getHeadword()))

// な。な (phantom normalized form)
wi = lexicon.getWordInfo(getWordId(39))
assertEquals("な。な", lexicon.string(0, wi.getHeadword()))
assertEquals("ナナ", lexicon.string(0, wi.getReadingForm()))
assertEquals("なな", lexicon.string(0, lexicon.getWordInfo(wi.getNormalizedForm()).getHeadword()))

// 東京都
wi = lexicon.getWordInfo(getWordId(6))
assertEquals("東京都", lexicon.string(0, wi.getHeadword()))
Expand Down
2 changes: 1 addition & 1 deletion src/test/resources/dict/lex.csv
Original file line number Diff line number Diff line change
Expand Up @@ -38,5 +38,5 @@ IndexForm,LeftId,RightId,Cost,Headword,POS1,POS2,POS3,POS4,POS5,POS6,Reading_For
012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789,9,9,-9000,,名詞,数詞,*,*,*,*,ゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウゼロイチニサンヨンゴロクナナハチキュウ,,,,,,,
特a,8,8,2914,特A,名詞,普通名詞,一般,*,*,*,トクエー,,,,,,,
隠し,-1,-1,0,,名詞,普通名詞,一般,*,*,*,カクシ,,,,,,,
な。な,8,8,2914,,名詞,普通名詞,一般,*,*,*,ナナ,,,"アイウ,名詞,普通名詞,一般,*,*,*,アイウ","アイウ,名詞,普通名詞,一般,*,*,*,アイウ",,,
な。な,8,8,2914,,名詞,普通名詞,一般,*,*,*,ナナ,なな,,"アイウ,名詞,普通名詞,一般,*,*,*,アイウ","アイウ,名詞,普通名詞,一般,*,*,*,アイウ",,,
東東京都,6,8,6320,,名詞,固有名詞,地名,一般,*,*,ヒガシヒガシキョウト,,,,,"東,名詞,普通名詞,一般,*,*,*,ヒガシ/東,名詞,普通名詞,一般,*,*,*,ヒガシ/京都,名詞,固有名詞,地名,一般,*,*,キョウト",,