-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Compare the following two commands and their output:
$ echo nikinitawiwapamew | divvunspell --always-suggest --zhfst tools/spellcheckers/fstbased/desktop/hfst/crk.zhfst
Reading from stdin...
Input: nikinitawiwapamew [INCORRECT]
nikî-nitawi-wâpamâw 40.458984
naki-nitawi-wâpamêw 49.151657
naki-nitawi-wâpamâw 53.151657
nika-nitawi-wâpamâw 53.151657
nikê-nitawi-wâpamâw 53.151657
nikî-nitawi-asamâw 53.151657
nikî-natawi-wâpamâw 57.151657
nikî-nitawi-wanâmâw 57.151657
niwî-nitawi-wâpamâw 57.151657
kikî-nitawi-wâpamâw 67.15166
$ echo nikinitawiwapamew | hfst-ospell -S -b 16 tools/spellcheckers/fstbased/desktop/hfst/crk.zhfst
"nikinitawiwapamew" is NOT in the lexicon:
Corrections for "nikinitawiwapamew":
nikî-nitawi-wâpamâw 15.458984
naki-nitawi-wâpamêw 24.151655
nikî-nitawi-wanâmâw 27.151655
niwî-nitawi-wâpamâw 27.151655
kikî-nitawi-wâpamâw 27.151655
nikî-natawi-wâpamâw 27.151655
nika-nitawi-wâpamâw 28.151655
nikê-nitawi-wâpamâw 28.151655
nikî-nitawi-asamâw 28.151655
naki-nitawi-wâpamâw 28.151655
nikî-nitawi-wêpimâw 31.151655
What is strange about the weight differences is that the weights are encoded in the fst's (acceptor and error model). So the expectation would be that identical input should give identical weight for identical output.
On the surface, it looks like divvunspell is giving wrong weights — if one takes the acceptor weight of the suggestion + the weight of each editing operation, one comes close to the hfst-ospell weight:
$ echo nikî-nitawi-wâpamâw | hfst-lookup -q tools/spellcheckers/fstbased/desktop/hfst/acceptor.default.hfst
nikî-nitawi-wâpamâw nikî-nitawi-wâpamâw 9,458984
nikî-nitawi-wâpamâw nikî-nitawi-wâpamâw 11,151655
The lowest weight is the one used, and there are four editing operations applied to the input string, with the following weight:
# strings.default.regex:
{in} -> {î-n}::1,
{iw} -> {i-w}::1;
## editdist.default.txt:
a â 2
# final_strings.default.txt:
mew:mâw 4
9,458984 + 1 +1 + 2 + 4 = 17,458984
hfst-ospell is still 2 off, but that is nevertheless way closer than divvunspells 40.458984.
These differences are problematic for two reasons: it indicates a bug in the weight calculation, and it makes it hard to debug the suggestions and their ordering.