Hopefully a lexer-generator that doesn't suck
regex token rules can be defined as such
rule_name :: regular_expression
and literal token rules can be defined as:
rule_name -- "some" "token" "\uc{SomeCategory}"
where SomeCategory
is a predefined name (internal) for a category of tokens; the currently supported categories are:
Category name | description |
---|---|
RGI |
the set of RGI emojis as defined in https://unicode.org/Public/emoji/ |
non-captured (inlined) literal matches can be defined as such:
constant_name :- "literal string to match"
# === or ===
constant_name :- -"file/to/read/string/from.txt"
normalisation rules can be defined as such:
normalise { codepoints to normalise } to codepoint
for example
normalise { b c d e f } to a
means "normalise any of 'bcdef' to 'a'"
stopwords can be specified like so:
stopword "stop" "word" "and" "or" "whatever"
# and with file-strings (not yet impl'd)
stopword -"list-of-stopwords.txt"
to completely omit the result of a rule (discard its match), use the ignore
statement:
ignore [ some rules to omit ]
for example ignore [space punct]
will omit all the tokens that would've matched either of the rules space
or punct
Comments span from the character #
until the end of line except inside regex rules, wherein you can use a comment group (# foobar)
# Hi, I am a comment
an option is of the form option <name> <value>
where <value> is either on
or off
currently significant options:
Option | effect | default |
---|---|---|
ignore_stopwords |
entirely removes stopwords from the output stream | off |
pure_normaliser |
creates a function __nlex_pure_normalise that returns the next normalised character in the stream |
off |
unsafe_normaliser |
disable the length check on normalised values | off |
skip_on_error |
skip unmatchable characters | off |
capturing_groups |
enables group captures and generates the functions nlex_get_group_{{start,end}_ptr,length} |
off |
log_{verbose,warning,debug} |
set log level (verbose < warning < debug) | (unset) |
the regex engine is currently very limited in what it supports, however here is a road map:
- execution
- standalone execution
- state traversal codegen
- helper functions codegen
- normalisation codegen
- stopword codegen
- normalisation declarations (
normalise {abcd} to c
) - stopword declarations (
stopwords "stop" "word"
) - normal literal character matching
- sequence matching (/ab/ -> match a, match b)
- rule substitution (
a :: {{b}}x
-> a = match b, then match 'x') - semi-optimised alternative matching
- basic character classes
- ranges in character classes
- character classe extensions
- valid escapes in character classes (
[\p{Ll}\p{Lm}]
) - unicode character classes (
\p{...}
) - optimised alternative matching
- alternatives with priorities
- File strings
- Zero-width assertions
- simple quantifiers (+, *)
- medium-simple quantifiers ({x,y})
- not-simple quantifiers (?)
- insane quantifiers (??)
- rule actions
- Regex captures (start-end)
- backreferences
- recursive matching
read regex_flavour for further details on the specific flavour of regular expressions used in tandem with the generator
To build the compiler the following libraries are required:
- LLVM (>= 8)
- TCL (soft dependency, will be removed later)
- Intel TBB
- OpenMP + pthread
how to build:
$ git clone https://github.com/alimpfard/nlp-lex
$ cd nlp-lex/src
$ mkdir build && cd build && cmake ..
$ make
This is still in alpha stages, so a multistep procedure is used to produce binaries and libraries:
# To create an executable (mostly for test)
$ build/nlex -o output_object.o ../examples/test.nlex # create an object file
$ clang -static -lc output_object.o -o tokenise # link it as a static executable
# To create a shared library
$ build/nlex -o output_object.o --relocation-model pic ../examples/test.nlex
$ clang -shared -lc output_object.o -o libtokenise.so
To target other OSs/architectures/etc, use the appropriate --target-option and --object-format parameters
For instance, to create an object file for x86_64 windows, use
$ build/nlex \
--library \
-o output_object.obj \
--target-arch x86_64 \
--target-sys windows \
--object-format coff \
../examples/test.nlex
# to create dll for windows
$ link /dll /def:output_object.def output_object.obj
Note: Generating executables for windows is currently not supported (RTS issues)
Expand for commandline options
-h
Shows a descriptions of commandline arguments
-g
Generates a graph for what the lexer is supposed to do
-r
Dry run (only perform syntax and semantic checks)
-o [file]
sets the output filename
if the target is a binary file, it will be placed next to the source if this option is not provided
--library
Builds a pure library (standalone, no libc dependency)
--target[-option] <value>
if 'option' is not provided, set the target triple (behaves like clang's -target option)
otherwise, replaces parts of the native target with the provided value
--relocation-model <model>
Sets the relocation model
linking the output to a binary will likely require this to be 'pie'
--object-format <format>
Output a specific kind of object file (default is ELF)
--emit-llvm
If specified, will output llvm IR instead of an object file
-mcpu <cpu>
Sets target CPU family (the default is generic)
--features <features>
Sets target CPU features (no extra features are assumed to exist by default)