Skip to content

Bug in extra penalty calculations in divvunspell #18

@snomos

Description

@snomos

Compare the following two commands and their output:

$ echo nikinitawiwapamew | divvunspell --always-suggest --zhfst tools/spellcheckers/fstbased/desktop/hfst/crk.zhfst 
Reading from stdin...
Input: nikinitawiwapamew		[INCORRECT]
nikî-nitawi-wâpamâw		40.458984
naki-nitawi-wâpamêw		49.151657
naki-nitawi-wâpamâw		53.151657
nika-nitawi-wâpamâw		53.151657
nikê-nitawi-wâpamâw		53.151657
nikî-nitawi-asamâw		53.151657
nikî-natawi-wâpamâw		57.151657
nikî-nitawi-wanâmâw		57.151657
niwî-nitawi-wâpamâw		57.151657
kikî-nitawi-wâpamâw		67.15166

$ echo nikinitawiwapamew | hfst-ospell -S -b 16 tools/spellcheckers/fstbased/desktop/hfst/crk.zhfst 
"nikinitawiwapamew" is NOT in the lexicon:
Corrections for "nikinitawiwapamew":
nikî-nitawi-wâpamâw    15.458984
naki-nitawi-wâpamêw    24.151655
nikî-nitawi-wanâmâw    27.151655
niwî-nitawi-wâpamâw    27.151655
kikî-nitawi-wâpamâw    27.151655
nikî-natawi-wâpamâw    27.151655
nika-nitawi-wâpamâw    28.151655
nikê-nitawi-wâpamâw    28.151655
nikî-nitawi-asamâw    28.151655
naki-nitawi-wâpamâw    28.151655
nikî-nitawi-wêpimâw    31.151655

What is strange about the weight differences is that the weights are encoded in the fst's (acceptor and error model). So the expectation would be that identical input should give identical weight for identical output.

On the surface, it looks like divvunspell is giving wrong weights — if one takes the acceptor weight of the suggestion + the weight of each editing operation, one comes close to the hfst-ospell weight:

$ echo nikî-nitawi-wâpamâw | hfst-lookup -q tools/spellcheckers/fstbased/desktop/hfst/acceptor.default.hfst 
nikî-nitawi-wâpamâw	nikî-nitawi-wâpamâw	9,458984
nikî-nitawi-wâpamâw	nikî-nitawi-wâpamâw	11,151655

The lowest weight is the one used, and there are four editing operations applied to the input string, with the following weight:

# strings.default.regex:
{in} -> {î-n}::1,
{iw} -> {i-w}::1;

## editdist.default.txt:
a	â	2

# final_strings.default.txt:
mew:mâw	4

9,458984 + 1 +1 + 2 + 4 = 17,458984

hfst-ospell is still 2 off, but that is nevertheless way closer than divvunspells 40.458984.

These differences are problematic for two reasons: it indicates a bug in the weight calculation, and it makes it hard to debug the suggestions and their ordering.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions