TODO

# TODO items
# - optimise regexp itself
#   - if all branches of an Alternative are literals, transform to character class { 1|2|3|4|5|6... -> [123456] }
#
# -
#  option pos_tagger on
#  |tag pos| |every <N> token| (N = 3) |from "model.json"| (path = 'pos-tag.json') |with| |delimiter Punc{.}| (punc{.})
#  tag pos
#  # Error if pos_tagger is off
#  # model structure: https://github.com/alimpfard/citron-tp-test/blob/master/part-6/model.json
#
#  # This requires two variable-sized buffers for sentences of {start, length, tag, errc, modifier, *postag} x N
#  #      where N is the size of the buffer
#  # and so, __nlex_root becomes a tailcall to itself until we reach the delimiter.
#  # and in doing so, it reads a token into the working buffer, and releases a token from the ready buffer
#  # if we reach a delimiter before running out of ready buffer, we just empty the ready buffer without reading more tokens
#  # after which time, we rotate the buffers, run the pos tagger (XXX IDEA could the be made faster? if the tagger is bigram, we could
#  #      run it in tandem with the token generation, and only tag the last bit at the end)
#  # and continue as usual
#
#
# - Figure out how to handle:
#   - store group positions and lengths and make them available to the inline code
#       - effectively makes variable-length lookbehinds possible
#       - discard_group(idx)
#           - might be difficult to splice out a certain portion of the token
#               - what would happen to the fed string?
#                   - reread_group(idx) - put the given group back on the stream?
#   - tag an already tagged token if it matches a certain rule (kinda like stopword)
#       - tag word coord_conj and|or|for
#   - rule-based token operations (for common mistakes)
#       - word:a word:b -> word:ab
#       - word:a number:b -> wordnum:ba
#   - sentence boundary detection and optional impl of Viterbi with comptime generated model
#       - provide an implementation (https://github.com/alimpfard/citron-tp-test/tree/master/part-6) of HMM and Viterbi
#   - how to handle lemmatisation on different languages
#       - provide a DSL for writing lemmatisers?
#       - provide a few well-known implementations for specific languages?
#     - allow the user to provide her own implementation that generates llvm bytecode?
#         - calling convention?
#   - add compile-on-need to python wrapper
#   - 'option farsi on' to appease the gods


# - Backtracking:
#  - when a branch fails, and we have no tags for it, somehow flag this, go back a character, and take this flag into account
#  - whenever we are transitioning somewhere
#  - if, that flag fails, we will simply not transition anymore