-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO
48 lines (46 loc) · 2.65 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# TODO items
# - optimise regexp itself
# - if all branches of an Alternative are literals, transform to character class { 1|2|3|4|5|6... -> [123456] }
#
# -
# option pos_tagger on
# |tag pos| |every <N> token| (N = 3) |from "model.json"| (path = 'pos-tag.json') |with| |delimiter Punc{.}| (punc{.})
# tag pos
# # Error if pos_tagger is off
# # model structure: https://github.com/alimpfard/citron-tp-test/blob/master/part-6/model.json
#
# # This requires two variable-sized buffers for sentences of {start, length, tag, errc, modifier, *postag} x N
# # where N is the size of the buffer
# # and so, __nlex_root becomes a tailcall to itself until we reach the delimiter.
# # and in doing so, it reads a token into the working buffer, and releases a token from the ready buffer
# # if we reach a delimiter before running out of ready buffer, we just empty the ready buffer without reading more tokens
# # after which time, we rotate the buffers, run the pos tagger (XXX IDEA could the be made faster? if the tagger is bigram, we could
# # run it in tandem with the token generation, and only tag the last bit at the end)
# # and continue as usual
#
#
# - Figure out how to handle:
# - store group positions and lengths and make them available to the inline code
# - effectively makes variable-length lookbehinds possible
# - discard_group(idx)
# - might be difficult to splice out a certain portion of the token
# - what would happen to the fed string?
# - reread_group(idx) - put the given group back on the stream?
# - tag an already tagged token if it matches a certain rule (kinda like stopword)
# - tag word coord_conj and|or|for
# - rule-based token operations (for common mistakes)
# - word:a word:b -> word:ab
# - word:a number:b -> wordnum:ba
# - sentence boundary detection and optional impl of Viterbi with comptime generated model
# - provide an implementation (https://github.com/alimpfard/citron-tp-test/tree/master/part-6) of HMM and Viterbi
# - how to handle lemmatisation on different languages
# - provide a DSL for writing lemmatisers?
# - provide a few well-known implementations for specific languages?
# - allow the user to provide her own implementation that generates llvm bytecode?
# - calling convention?
# - add compile-on-need to python wrapper
# - 'option farsi on' to appease the gods
# - Backtracking:
# - when a branch fails, and we have no tags for it, somehow flag this, go back a character, and take this flag into account
# - whenever we are transitioning somewhere
# - if, that flag fails, we will simply not transition anymore