lark-parser · RossPatterson · Feb 1, 2024 · Feb 1, 2024 · Feb 1, 2024 · Feb 2, 2024
diff --git a/docs/grammar.md b/docs/grammar.md
@@ -51,6 +51,29 @@ Lark begins the parse with the rule 'start', unless specified otherwise in the o
 
 Names of rules are always in lowercase, while names of terminals are always in uppercase. This distinction has practical effects, for the shape of the generated parse-tree, and the automatic construction of the lexer (aka tokenizer, or scanner).
 
+## EBNF Expressions
+
+The EBNF expression in a Lark termminal definition is a sequence of items to be matched.
+Each item is one of:
+
+* `TERMINAL` - Another terminal, which cannot be defined in terms of this terminal.
+* `"string literal"` - Literal, to be matched as-is.
+* `"string literal"i` - Literal, to be matched case-insensitively.
+* `/regexp literal/` - Regular expression literal.  Can inclde flags.
+* `"character".."character"` - Literal range.  The range represends all values between the two literals, inclusively.
+* `(item item ..)` - Group items
+* `(item | item | ..)` - Alternate items.
+* `[item item ..]` - Maybe. Same as `(item item ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
+* `[item | item | ..]` - Maybe with alternates. Same as `(item | item | ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
+* `item?` - Zero or one instances of item (a "maybe")
+* `item*` - Zero or more instances of item
+* `item+` - One or more instances of item
+* `item ~ n` - Exactly *n* instances of item
+* `item ~ n..m` - Between *n* to *m* instances of item (not recommended for wide ranges, due to performance issues)
+
+The EBNF expression in a Lark rule definition is also a sequence of the same set of items to be matched, with one addition:
+
+* `rule` - A rule, which can include recursive use of this rule.
 
 ## Terminals
 
@@ -59,45 +82,16 @@ Terminals are used to match text into symbols. They can be defined as a combinat
 **Syntax:**
 
 ```html
-<NAME> [. <priority>] : <literals-and-or-terminals>
+<NAME> [. <priority>] : <items-to-match>
 ```
 
-Terminal names must be uppercase.
-
-Literals can be one of:
-
-* `"string"`
-* `/regular expression+/`
-* `"case-insensitive string"i`
-* `/re with flags/imulx`
-* Literal range: `"a".."z"`, `"1".."9"`, etc.
+Terminal names must be uppercase.  They must start with an underscore (`_`) or a letter (`A` through `Z`), and may be composed of letters, underscores, and digits (`0` through `9`).  Terminal names that start with "_" will not be included in the parse tree, unless the `keep_all_tokens` option is specified, or unless they are part of a containing terminal.  Terminals are a linear construct, and therefore may not contain themselves (recursion isn't allowed).
 
-Terminals also support grammar operators, such as `|`, `+`, `*` and `?`.
-
-Terminals are a linear construct, and therefore may not contain themselves (recursion isn't allowed).
+See [EBNF Expressions](#ebnf-expressions) above for the list of items that a terminal can match.
 
 ### Templates
 
-Templates are expanded when preprocessing the grammar.
-
-Definition syntax:
-
-```ebnf
-  my_template{param1, param2, ...}: <EBNF EXPRESSION>
-```
-
-Use syntax:
-
-```ebnf
-some_rule: my_template{arg1, arg2, ...}
-```
-
-Example:
-```ebnf
-_separated{x, sep}: x (sep x)*  // Define a sequence of 'x sep x sep x ...'
-
-num_list: "[" _separated{NUMBER, ","} "]"   // Will match "[1, 2, 3]" etc.
-```
+Templates are not allowed with terminals.
 
 ### Priority
 
@@ -122,7 +116,7 @@ SIGNED_INTEGER: /
  /x
 ```
 
-Supported flags are one of: `imslux`. See Python's regex documentation for more details on each one.
+Supported flags are one of: `imslux`. See Python's [regex documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax) for more details on each one.
 
 Regexps/strings of different flags can only be concatenated in Python 3.6+
 
@@ -196,29 +190,19 @@ _ambig
 
 **Syntax:**
 ```html
-<name> : <items-to-match>  [-> <alias> ]
+<modifiers><name> : <items-to-match>  [-> <alias> ]
        | ...
 ```
 
-Names of rules and aliases are always in lowercase.
+Names of rules and aliases are always in lowercase.  They must start with an underscore (`_`) or a letter (`a` through `z`), and may be composed of letters, underscores, and digits (`0` through `9`).  Rule names that start with "_" will be inlined into their containing rule.
 
 Rule definitions can be extended to the next line by using the OR operator (signified by a pipe: `|` ).
 
-An alias is a name for the specific rule alternative. It affects tree construction.
+An alias is a name for the specific rule alternative. It affects tree construction (see [Shaping the tree](tree_construction#shaping_the_tree).
 
+The affect of a rule on the parse tree can be specified by modifiers.  The `!` modifier causes the rule to keep all its tokens, regardless of whether they are named or not.  The `?` modifier causes the rule to be inlined if it only has a single child.  The `?` modifier cannot be used on rules that are named starting with an underscore.
 
-Each item is one of:
-
-* `rule`
-* `TERMINAL`
-* `"string literal"` or `/regexp literal/`
-* `(item item ..)` - Group items
-* `[item item ..]` - Maybe. Same as `(item item ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
-* `item?` - Zero or one instances of item ("maybe")
-* `item*` - Zero or more instances of item
-* `item+` - One or more instances of item
-* `item ~ n` - Exactly *n* instances of item
-* `item ~ n..m` - Between *n* to *m* instances of item (not recommended for wide ranges, due to performance issues)
+See [EBNF Expressions](#ebnf_expressions) above for the list of items that a rule can match.
 
 **Examples:**
 ```perl
@@ -230,6 +214,29 @@ expr: expr operator expr
 four_words: word ~ 4
 ```
 
+### Templates
+
+Templates are expanded when preprocessing rules in the grammar.
+
+Definition syntax:
+
+```ebnf
+  my_template{param1, param2, ...}: <EBNF EXPRESSION>
+```
+
+Use syntax:
+
+```ebnf
+some_rule: my_template{arg1, arg2, ...}
+```
+
+Example:
+```ebnf
+_separated{x, sep}: x (sep x)*  // Define a sequence of 'x sep x sep x ...'
+
+num_list: "[" _separated{NUMBER, ","} "]"   // Will match "[1, 2, 3]" etc.
+```
+
 ### Priority
 
 Like terminals, rules can be assigned a priority. Rule priorities are signed
@@ -297,12 +304,24 @@ Note that `%ignore` directives cannot be imported. Imported rules will abide by
 
 Declare a terminal without defining it. Useful for plugins.
 
+**Syntax:**
+```html
+%declare <TERMINAL>
+%declare <rule>
+```
+
 ### %override
 
 Override a rule or terminals, affecting all references to it, even in imported grammars.
 
 Useful for implementing an inheritance pattern when importing grammars.
 
+**Syntax:**
+```html
+%override <terminal definition>
+%override <rule definition>
+```
+
 **Example:**
 ```perl
 %import my_grammar (start, number, NUMBER)
@@ -319,6 +338,12 @@ Useful for splitting up a definition of a complex rule with many different optio
 
 Can also be used to implement a plugin system where a core grammar is extended by others.
 
+**Syntax:**
+```html
+%extend <TERMINAL> ... additional terminal alternate ...
+%extend <rule> ... additional rule alternate ...
+```
+
 
 **Example:**
 ```perl

diff --git a/lark/grammars/lark.lark b/lark/grammars/lark.lark
@@ -1,25 +1,39 @@
 # Lark grammar of Lark's syntax
 # Note: Lark is not bootstrapped, its parser is implemented in load_grammar.py
+# This grammar matches that one, but does not enfore some rules that it does.
+# If you want to enforce those, you can pass the "LarkValidatorVisitor" over
+# the parse tree, like this:
+
+# import os
+# import lark
+# from lark.lark_validator_visitor import LarkValidatorVisitor
+#
+# lark_path = os.path.join(os.path.dirname(lark.__file__), 'grammars/lark.lark')
+# lark_parser = Lark.open(lark_path, parser="lalr")
+# parse_tree = lark_parser.parse(my_grammar)
+# LarkValidatorVisitor.validate(parse_tree)
 
 start: (_item? _NL)* _item?
 
 _item: rule
      | token
      | statement
 
-rule: RULE rule_params priority? ":" expansions
-token: TOKEN token_params priority? ":" expansions
+rule: rule_modifiers? RULE rule_params priority? ":" expansions
+token: TOKEN priority? ":" expansions
+
+rule_modifiers: RULE_MODIFIERS
 
 rule_params: ["{" RULE ("," RULE)* "}"]
-token_params: ["{" TOKEN ("," TOKEN)* "}"]
 
 priority: "." NUMBER
 
 statement: "%ignore" expansions                    -> ignore
          | "%import" import_path ["->" name]       -> import
          | "%import" import_path name_list         -> multi_import
-         | "%override" rule                        -> override_rule
+         | "%override" (rule | token)              -> override
          | "%declare" name+                        -> declare
+         | "%extend" (rule | token)                -> extend
 
 !import_path: "."? name ("." name)*
 name_list: "(" name ("," name)* ")"
@@ -39,14 +53,15 @@ name_list: "(" name ("," name)* ")"
 ?value: STRING ".." STRING -> literal_range
       | name
       | (REGEXP | STRING) -> literal
-      | name "{" value ("," value)* "}" -> template_usage
+      | RULE "{" value ("," value)* "}" -> template_usage
 
 name: RULE
     | TOKEN
 
 _VBAR: _NL? "|"
 OP: /[+*]|[?](?![a-z])/
-RULE: /!?[_?]?[a-z][_a-z0-9]*/
+RULE_MODIFIERS: /(!|![?]?|[?]!?)(?=[_a-z])/
+RULE: /_?[a-z][_a-z0-9]*/
 TOKEN: /_?[A-Z][_A-Z0-9]*/
 STRING: _STRING "i"?
 REGEXP: /\/(?!\/)(\\\/|\\\\|[^\/])*?\/[imslux]*/

diff --git a/lark/lark_validator_visitor.py b/lark/lark_validator_visitor.py
@@ -0,0 +1,93 @@
+from .lexer import Token
+from .load_grammar import GrammarError
+from .visitors import Visitor
+from .tree import Tree
+
+class LarkValidatorVisitor(Visitor):
+
+    @classmethod
+    def validate(cls, tree: Tree):
+        visitor = cls()
+        visitor.visit(tree)
+        return tree
+
+    def alias(self, tree: Tree):
+        # Reject alias names in inner 'expansions'.
+        self._reject_aliases(tree.children[0], "Deep aliasing not allowed")
+
+    def ignore(self, tree: Tree):
+        # Reject everything except 'literal' and 'name' > 'TOKEN'.
+        assert len(tree.children) > 0    # The grammar should pass us some things to ignore.
+        if len(tree.children) > 1:
+            self._reject_bad_ignore()
+        node = tree.children[0]
+        if node.data == "expansions":
+            if len(node.children) > 1:
+                self._reject_bad_ignore()
+            node = node.children[0]
+        if node.data == "alias":
+            if len(node.children) > 1:
+                self._reject_bad_ignore()
+            node = node.children[0]
+        if node.data == "expansion":
+            if len(node.children) > 1:
+                self._reject_bad_ignore()
+            node = node.children[0]
+        if node.data == "expr":
+            if len(node.children) > 1:
+                self._reject_bad_ignore()
+            node = node.children[0]
+        if node.data == "atom":
+            if len(node.children) > 1:
+                self._reject_bad_ignore()
+            node = node.children[0]
+        if node.data == "literal":
+            return
+        elif node.data == "name":
+            if node.children[0].data == "TOKEN":
+                return
+        elif node.data == "value":
+            if node.children[0].data == "literal":
+                return
+            elif node.children[0].data == "name":
+                if node.children[0][0].data == "TOKEN":
+                    return
+        self._reject_bad_ignore()
+
+    def token(self, tree: Tree):
+        assert len(tree.children) > 1    # The grammar should pass us at least a token name and an item.
+        first_item = 2 if tree.children[1].data == "priority" else 1
+        # Reject alias names in token definitions.
+        for child in tree.children[first_item:]:
+            self._reject_aliases(child, "Aliasing not allowed in terminals (You used -> in the wrong place)")
+        # Reject template usage in token definitions.  We do this before checking rules
+        # because rule usage looks like template usage, just without parameters.
+        for child in tree.children[first_item:]:
+            self._reject_templates(child, "Templates not allowed in terminals")
+        # Reject rule references in token definitions.
+        for child in tree.children[first_item:]:
+            self._reject_rules(child, "Rules aren't allowed inside terminals")
+
+    def _reject_aliases(self, item: Tree|Token, message: str):
+        if isinstance(item, Tree):
+            if item.data == "alias" and len(item.children) > 1 and item.children[1] is not None:
+                raise GrammarError(message)
+            for child in item.children:
+                self._reject_aliases(child, message)
+
+    def _reject_bad_ignore(self):
+        raise GrammarError("Bad %ignore - must have a Terminal or other value.")
+
+    def _reject_rules(self, item: Tree|Token, message: str):
+        if isinstance(item, Token) and item.type == "RULE":
+            raise GrammarError(message)
+        elif isinstance(item, Tree):
+            for child in item.children:
+                self._reject_rules(child, message)
+
+    def _reject_templates(self, item: Tree|Token, message: str):
+        if isinstance(item, Tree):
+            if item.data == "template_usage":
+                raise GrammarError(message)
+            for child in item.children:
+                self._reject_templates(child, message)
diff --git a/tests/__main__.py b/tests/__main__.py
@@ -9,6 +9,8 @@
 from .test_tools import TestStandalone
 from .test_cache import TestCache
 from .test_grammar import TestGrammar
+from .test_lark_lark import TestLarkLark
+from .test_ignore import TestIgnore
 from .test_reconstructor import TestReconstructor
 from .test_tree_forest_transformer import TestTreeForestTransformer
 from .test_lexer import TestLexer