Skip to content

Multiple acceptors and error models #25

Open
@snomos

Description

@snomos

Both old ideas and new development suggest a more flexible approach to accceptors and error models. Below is a list of things discussed in the past, + new ideas inspired by the ongoing machine learning work by @gusmakali, on word completion and prediction. Also some of the tasks mentioned in #19 are relevant to this.

Multiple error models

The idea is that all of the above could be present in one and the same speller archive, and with some configuration specification as to when to apply which model. A very tentative idea could be that a machine learning error model will either get it right with the top hypothesis, or completely fail (as determined by filtering the hypothesis against the lexicon), thus use that one as a first step, then fall back to a hand-tuned error model, and when that fails (it could be written to be on the safe side, ie not suggest anything outside a certain set of errors), fall back to the default error model.

Exactly how this should work and interact is very much an open question, but divvunspell should provide the machinery so that linguists can experiment with it to reach an optimal setup for a given language and device type.

Multiple acceptors

And possibly other variants too.

There are at least two ideas here:

  • we might want to be more careful with what we suggest, and an easy way to do that is verifying suggestions against a more restricted acceptor, e.g. with no dynamic compounding or derivation (such words would still be accepted, just never suggested). Another way of restricting suggestions is to never suggest anything with a weight higher than a limit X, where X is configurable (this has been discussed several times in the past):
    • never suggest if weight higher than configurable weight X
  • in productive word formation it is easy to overgenerate, e.g. for compounds, but subtracting illegal paths from an fst is hugely inefficient and space consuming. What is way better is to have a rejector fst that contains invalid strings, and anything in that fst should always be rejected, in all cases except when explicitly added to a user dictionary by the user.

As part of this work it is probably necessary to rework the zhfst archive format, probably by making the bhfst format the standard, including the json config file used there.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions