Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epic: dictionary-based word-breakers 🔬 #12142

Draft
wants to merge 40 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
425a0a0
feat(common/models/wordbreakers): starting on dictionary-based wordbr…
jahorton Mar 9, 2024
7b456c1
feat(common/models/wordbreakers): actual first-pass implementation
jahorton Mar 10, 2024
2da7b7c
feat(common/models/wordbreakers): pass 2 - should now tokenize full c…
jahorton Mar 10, 2024
5568167
feat(common/models/wordbreakers): dict-breaker helper unit tests (BMP)
jahorton Mar 10, 2024
d40a09d
feat(common/models/wordbreakers): dict-breaker helper unit tests (non…
jahorton Mar 10, 2024
10f91e0
feat(common/models/wordbreakers): dict-breaker unit tests with simpl…
jahorton Mar 10, 2024
1d07e9f
fix(common/models/wordbreakers): blocks empty span output on empty co…
jahorton Mar 10, 2024
3e82434
fix(common/models): update re base branch change
jahorton Mar 14, 2024
56e4052
docs(common/models): updates dict-breaker comments
jahorton Aug 9, 2024
66c156e
change(common/models/wordbreakers): allows wordbreaker unit tests to …
jahorton Mar 10, 2024
3181a4c
feat(common/models/wordbreakers): baby's first khmer wordbreaking test
jahorton Mar 10, 2024
07f6747
chore(common/models/wordbreakers): comment tweak
jahorton Mar 10, 2024
8c5e54a
feat(common/models/wordbreakers): rejoins adjacent single-point spans…
jahorton Mar 10, 2024
766992d
fix(common/models/wordbreakers): handling of penalty transitions
jahorton Mar 10, 2024
99e1f4b
change(common/models): use spread operator to split on codepoints
jahorton Aug 9, 2024
51b62e6
chore(common/models/wordbreakers): Merge branch 'feat/common/models/w…
jahorton Aug 9, 2024
295c70b
chore(common/models/wordbreakers): Merge base branch into feat/common…
jahorton Aug 9, 2024
8ae6a62
chore(common/models): drops unit tests for replaced func
jahorton Aug 9, 2024
258763b
chore(common/models/wordbreakers): Merge branch 'feat/common/models/w…
jahorton Aug 9, 2024
8942883
chore(common/models/wordbreakers): Merge base branch fixes into feat/…
jahorton Aug 9, 2024
001126b
chore: establish dictionary breakers epic
mcdurdin Aug 9, 2024
1b3dfda
Merge pull request #12253 from keymanapp/chore/merge-master-into-dict…
mcdurdin Aug 22, 2024
fcbffc6
chore(common/models/wordbreakers): Merge branch 'epic/dict-breaker' i…
jahorton Aug 23, 2024
d38ed7c
chore(common/models/wordbreakers): Merge branch 'feat/common/models/w…
jahorton Aug 23, 2024
c33c10d
chore(common/models): Apply suggestions from code review
jahorton Aug 26, 2024
9ca5d0a
Merge pull request #12139 from keymanapp/feat/common/models/wordbreak…
jahorton Aug 27, 2024
b0aaf1e
Merge pull request #12140 from keymanapp/change/common/models/wordbre…
jahorton Aug 27, 2024
2e7e9d7
Merge pull request #12141 from keymanapp/feat/common/models/wordbreak…
jahorton Aug 27, 2024
1df99e5
chore: Merge branch 'epic/dict-breaker' into chore/merge-master-into-…
jahorton Aug 29, 2024
cf43c0e
chore: move dict.ts into src
mcdurdin Sep 10, 2024
dad7a6c
Merge pull request #12317 from keymanapp/chore/merge-master-into-dict…
mcdurdin Sep 10, 2024
c2e80b4
chore: Merge branch 'epic/dict-breaker' into chore/merge-master-into-…
mcdurdin Sep 13, 2024
c4991a7
chore: fixup dependency path
mcdurdin Sep 13, 2024
367d32e
Merge pull request #12411 from keymanapp/chore/merge-master-into-dict…
mcdurdin Sep 14, 2024
39430bf
Merge branch 'epic/dict-breaker' into chore/merge-master-into-dict-br…
mcdurdin Oct 10, 2024
04cab9d
Merge pull request #12530 from keymanapp/chore/merge-master-into-dict…
mcdurdin Oct 11, 2024
0db0b26
Merge branch 'epic/dict-breaker' into chore/merge-master-into-dict-br…
mcdurdin Oct 25, 2024
c3f9917
Merge pull request #12575 from keymanapp/chore/merge-master-into-dict…
mcdurdin Oct 25, 2024
e9c530c
Merge branch 'epic/dict-breaker' into chore/merge-master-into-dict-br…
mcdurdin Nov 8, 2024
e22be7f
Merge pull request #12650 from keymanapp/chore/merge-master-into-dict…
mcdurdin Nov 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions web/src/engine/predictive-text/wordbreakers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,5 @@ const breakWords = wordBreakers['default'];
console.log(breakWords('Hello, World!').map(span => span.text));
// prints: [ 'Hello', ',', 'World', '!' ]
```

## TODO: dict-breakers
1 change: 1 addition & 0 deletions web/src/engine/predictive-text/wordbreakers/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ THIS_SCRIPT="$(readlink -f "${BASH_SOURCE[0]}")"
# Note: the raw text files used for data.inc.ts are found within
# /resources/standards-data/unicode-character-database.
builder_describe "Builds the predictive-text wordbreaker implementation module" \
"@../templates test" \
"clean" \
"configure" \
"build" \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ export default function default_(text: string, options?: DefaultWordBreakerOptio
/**
* A span that does not cut out the substring until it absolutely has to!
*/
class LazySpan implements Span {
export class LazySpan implements Span {
private _source: string;
readonly start: number;
readonly end: number;
Expand Down
329 changes: 329 additions & 0 deletions web/src/engine/predictive-text/wordbreakers/src/main/dict/index.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,329 @@
import { LazySpan } from "../default/index.js";
import { Span, LexiconTraversal } from "@keymanapp/common-types";

// Based on the MIN_KEYSTROKE_PROBABILITY penalty used by the lm-worker.
const CHAR_SKIP_PENALTY = -Math.log2(.0001);

// const DEFAULT_PARAMS = {

// }

export function splitOnWhitespace(text: string): Span[] {
const sections: Span[] = [];

let start = 0;

// Surrogate pairs will never overlap \u0020, so we don't need to be
// surrogate-pair aware here.
text += ' ';
for(let index = 0; index < text.length; index++) {
const char = text.charAt(index);
if(char.match(/\s|\u200c/)) {
if(start !== undefined) {
sections.push(new LazySpan(text, start, index));
start = undefined; // we do not emit whitespace tokens here.
}
} else if(start === undefined) {
start = index;
}
}

return sections;
}

export type DictBreakerPath = {
/**
* The index of the character immediately before the most recently-available word boundary.
* Is set to -1 if no such boundary exists.
*/
boundaryIndex: number;

// Could add a 'reference' if we create objects for each char in the context - such as for
// caching & reusing boundary info with future inputs.

/**
* An active traversal representing potential words that may become completed, starting
* immediately after the boundary indicated by `boundaryIndex`.
*/
traversal: LexiconTraversal;

/**
* cost: measured in -log(p) of each decision.
*/
cost: number;

/**
* Indicates if this path's most recent traversal enforces a boundary without matching
* a word in the lexicon.
*/
wasUnmatchedChar?: boolean;

/**
* The path object used to reach the previous boundary.
*/
parent?: DictBreakerPath;
}

/**
* Provides dictionary-based wordbreaking assuming a LexiconTraversal can be specified for
* the dictionary.
* @param fullText The full context to be tokenized.
* @param dictRoot A LexiconTraversal interface from the active LexicalModel,
* allowing efficient dictionary lookups of encountered words.
* @returns
*/
export default function dict(fullText: string, dictRoot: LexiconTraversal): Span[] {
if(!dictRoot) {
throw new Error("Cannot use dictionary-based wordbreaker without `LexiconTraversal` dictionary access");
}

// Whenever we have a space or a ZWNJ (U+200C), we'll assume a 100%-confirmed wordbreak
// at that location. We only need to "guess" at anything between 'em.
const sections = splitOnWhitespace(fullText);
let allSpans: Span[] = [];

for(const section of sections) {
// Technically, this may give us a 'partial' wordbreak at the section's end, which
// may be slightly significant for earlier sections. Probably not worth worrying
// about, though.
allSpans = allSpans.concat(_dict_break(section, dictRoot));
}

return allSpans;
}

// Exposed for testing reasons.
/**
* Given a section of text without whitespaces and ZWNJs, uses the active lexical-model's
* entries to detect optimal word-breaking locations.
* @param span A span representing the section and its position within the context.
* @param dictRoot A LexiconTraversal interface from the active LexicalModel,
* allowing efficient dictionary lookups of encountered words.
* @returns An array of `Span`s representing each tokenized word, indexed according to their
* location in the section's original context.
*/
export function _dict_break(span: Span, dictRoot: LexiconTraversal): Span[] {
if(span.length == 0) {
return [];
}

const text = span.text;
const splitIndex = span.start;

// 1. Splay the string into individual codepoints.
const codepointArr = [...text];

// 2. Initialize tracking vars and prep the loop.
let bestBoundingPath: DictBreakerPath = {
boundaryIndex: -1,
traversal: dictRoot,
cost: 0
};

// Optimization TODO: convert to priority queue?
let activePaths: DictBreakerPath[] = [bestBoundingPath];

// 3. Run the master loop.
// 3a. For each codepoint in the string...
for(let i=0; i < codepointArr.length; i++) {
const codepoint = codepointArr[i];
let paths: DictBreakerPath[] = [];

// 3b. compute all viable paths to continue words & start new ones.
for(const path of activePaths) {
let traversal = path.traversal.child(codepoint);
if(!traversal) {
continue;
}

const pathCtd: DictBreakerPath = {
boundaryIndex: path.boundaryIndex,
traversal: traversal,
cost: (path.parent?.cost ?? 0) - Math.log2(traversal.p),
parent: path.parent
}

paths.push(pathCtd);
}

// 3c. Find the minimal-cost new path with a word boundary, if any exist.
// If the traversal has entries, it's a legal path-end; else it isn't.
const boundingPaths = paths.filter((path) => !!path.traversal.entries.length);
// If none exist, this is the fallback.
const penaltyParent: DictBreakerPath = {
boundaryIndex: i-1, // successor will cover one codepoint
traversal: dictRoot.child(codepoint), // no `entries`, but... it's fine.
// bestBoundingPath is currently a root-level traversal. Its parent corresponds
// to the previous token.
cost: bestBoundingPath.cost + CHAR_SKIP_PENALTY,
parent: bestBoundingPath.parent,
wasUnmatchedChar: true
};

boundingPaths.push(penaltyParent);
// Sort in cost-ascending order.
// As we're using negative log likelihood, smaller is better.
// (The closer to log_2(1) = 0, the better.)
boundingPaths.sort((a, b) => a.cost - b.cost);

// We build a new path starting from this specific path; we're modeling a word-end.
// If it's the "penalty path", we already built it.
const bestBound = boundingPaths[0];
const successorPath: DictBreakerPath = {
boundaryIndex: i,
traversal: dictRoot,
cost: bestBound.cost,
parent: bestBound
}

bestBoundingPath = successorPath;
paths.push(successorPath);

// 3d. We now shift to the next loop iteration; we use the descendant `paths` set.
activePaths = paths;
}

// 4. When all iterations are done, determine the lowest-cost path that
// remains, without regard to if it supports a word-boundary.
//
// If we happen to end on a potential word-boundary, opt for that one. If two
// match aside from boundaryIndex, take the lesser. It comes first, BTW, so
// stable-sorts auto-resolve this.
activePaths.sort((a, b) => (a.cost - b.cost));
const winningPath = activePaths[0];

// 5. Build the spans.
const spans: (Span & { codepointLength: number, unmatched: boolean })[] = [];
const pathAsArray: DictBreakerPath[] = [];

let rewindPath = winningPath;
while(rewindPath) {
const start = rewindPath.boundaryIndex+1;
const end = codepointArr.length; // consistent because of the effects from the splice below
const text = codepointArr.splice(start, end - start).join('');

pathAsArray.unshift(rewindPath);
spans.unshift({
start: start, // currently in code points; we'll correct it on the next pass.
end: end, // same.
length: text.length, // Span spec: in code units
text: text,
codepointLength: end - start,
unmatched: !!rewindPath.wasUnmatchedChar
});
rewindPath = rewindPath.parent;
}

// 6. Span pass 2 - index finalization.
// - Remember, split-index is our offset!
// - We currently have codepoint `start` and `end`, but need code-unit values.
let totalLength = splitIndex;
for(let i = 0; i < spans.length; i++) {
const baseSpan = spans[i];
const start = totalLength;
totalLength += baseSpan.length;

const trueSpan: typeof spans[0] = {
...baseSpan,
start: start,
end: totalLength
};

spans[i] = trueSpan;
}

// If all we had was whitespace, hence no spans, return.
if(spans.length == 0) {
return spans;
}

// 7. Span pass 3: identify continuous penalty spans. Why split into separate spans
// when we can merge all the bits we can't recognize as a big lump instead?
// - Looks nicer in the unit tests, if nothing else.
// - Has _far_ better potential for 'learning' down the line.

let spanBucket: Span[] = [];
const finalSpans: Span[] = [];

function mergeBucket(spanBucket: Span[]) {
if(spanBucket.length > 0) {
const startSpan = spanBucket[0];
const endSpan = spanBucket[spanBucket.length - 1];
//

finalSpans.push({
start: startSpan.start,
end: endSpan.end,
length: endSpan.end - startSpan.start,
text: spanBucket.map((entry) => entry.text).join('')
});
}
}

for(const span of spans) {
if(span.codepointLength == 1 && span.unmatched) {
spanBucket.push(span);
} else {
mergeBucket(spanBucket);
spanBucket = [];
finalSpans.push(span);
}
}

mergeBucket(spanBucket);

// ... and done!
return finalSpans;

/*
Important questions:
- What is the cheapest way to have a word-break boundary after this character?
- This is a 100% valid question; the complications arise in moving from an "earlier"
answer to a "later" answer.

- What words are possible to reach given recent possible boundaries?
- idea: keep a running 'tally' of all Traversals that have valid paths at the
current processing stage, noting what their starting points were.
- matches the approach seen above.
- Possible optimization: ... instead of 'tally'... 'priority queue'?
- cheapest (start point cost) + (current traversal min cost) makes a good
A* heuristic.
- valid heuristic - traversal min-cost will never overestimate.
- would likely avoid a need to search expensive branches this way.
- current correction considers fat-finger prob * correction cost.
- lexical probability only factors in 100%-after corrections, as a
final step, at present, hence why it's not currently available.
- if no longer valid, drop it from the 'tally' / the 'running'.
- after each codepoint, we always add a newly-started traversal.
- worst-case, it comes with a set penalty cost added to the previous
codepoint's cost.
- if a current traversal has entries, we have a direct candidate for best
cost at this location.
- if using the priority queue strat, we may need to process just enough entries
to where the next node is equal or higher cost to the selected entry.
- unless the cost crossed to reach it IS that cost.
- If multiple co-located entries, use the best cost of them all. They all
search-key to each other, anyway, so whatever is best is still "valid enough"
to trigger a boundary.

So...

O(N), N = length of string: loop over each codepoint
O(N [worst-case]): loop over each still-valid traversal
- at most, matches the index of the current codepoint
- in practice, will be NOTABLY fewer after the first few codepoints.
O(1): check if the Traversal can continue with the new incoming char.
*/

// 2. Initial state: one traversal @ root pointing to 'no ancestor'.
// 2b. Could prep to build a 'memo' of calcs reusable by later runs?
// - may be best to note "still-valid start locations at this node"
// 3. Run the boundary search - see the approx looping structure noted above.
// 4. Best answer at the end wins!
// 5. May be worth persisting a cache of recent memos, do a quick diff on the
// most recent as a step 0 in future runs, to reuse data?
// - but... how to clear that cache on model change?
// - duh! validate the passed-in root traversal. If unequal, is diff model. ezpz.

// return [];
}
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import placeholder from "./placeholder.js";
import ascii from "./ascii.js";
import dict from "./dict/index.js";
import default_ from "./default/index.js";

export { placeholder, ascii, default_ as default, default_ as defaultWordbreaker };
export { placeholder, ascii, default_ as default, default_ as defaultWordbreaker, dict };
Loading
Loading