lark-parser · RossPatterson · Feb 1, 2024 · Feb 1, 2024 · Feb 1, 2024 · Feb 2, 2024
diff --git a/docs/grammar.md b/docs/grammar.md
@@ -59,26 +59,37 @@ Terminals are used to match text into symbols. They can be defined as a combinat
 **Syntax:**
 
 ```html
-<NAME> [. <priority>] : <literals-and-or-terminals>
+<NAME> [. <priority>] : <items-to-match>
 ```
 
-Terminal names must be uppercase.
+Terminal names must be uppercase.  They must start with an underscore (`_`) or a letter (`A` through `Z`), and may be composed of letters, underscores, and digits (`0` through `9`).  Terminal names that start with "_" will not be included in the parse tree, unless the `keep_all_tokens` option is specified.
 
 Literals can be one of:
 
-* `"string"`
-* `/regular expression+/`
-* `"case-insensitive string"i`
-* `/re with flags/imulx`
-* Literal range: `"a".."z"`, `"1".."9"`, etc.
+* Literal range: `"a".."z"`, `"1".."9"`, etc. - Each literal must be a single character, and the range represends all values between the two literals, inclusively.
 
-Terminals also support grammar operators, such as `|`, `+`, `*` and `?`.
+Each item is one of:
+
+* `TERMINAL` - Another terminal, which cannot be defined in terms of this terminal.
+* `"string literal"` - Literal, to be matched as-is.
+* `"string literal"i` - Literal, to be matched case-insensitively.
+* `/regexp literal/` - Regular expression literal.  Can inclde flags.
+* `"character".."character"` - Literal range.  The range represends all values between the two literals, inclusively.
+* `(item item ..)` - Group items
+* `(item | item | ..)` - Alternate items.
+* `[item item ..]` - Maybe. Same as `(item item ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
+* `[item | item | ..]` - Maybe with alternates. Same as `(item | item | ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
+* `item?` - Zero or one instances of item (a "maybe")
+* `item*` - Zero or more instances of item
+* `item+` - One or more instances of item
+* `item ~ n` - Exactly *n* instances of item
+* `item ~ n..m` - Between *n* to *m* instances of item (not recommended for wide ranges, due to performance issues)
 
 Terminals are a linear construct, and therefore may not contain themselves (recursion isn't allowed).
 
 ### Templates
 
-Templates are expanded when preprocessing the grammar.
+Templates are expanded when preprocessing rules in the grammar.  Templates are not allowed with terminals.
 
 Definition syntax:
 
@@ -122,7 +133,7 @@ SIGNED_INTEGER: /
  /x
 ```
 
-Supported flags are one of: `imslux`. See Python's regex documentation for more details on each one.
+Supported flags are one of: `imslux`. See Python's [regex documentation](https://docs.python.org/3/library/re.html#regular-expression-syntax) for more details on each one.
 
 Regexps/strings of different flags can only be concatenated in Python 3.6+
 
@@ -196,25 +207,32 @@ _ambig
 
 **Syntax:**
 ```html
-<name> : <items-to-match>  [-> <alias> ]
+<modifiers><name> : <items-to-match>  [-> <alias> ]
        | ...
 ```
 
-Names of rules and aliases are always in lowercase.
+Names of rules and aliases are always in lowercase.  They must start with an underscore (`_`) or a letter (`a` through `z`), and may be composed of letters, underscores, and digits (`0` through `9`).  Rule names that start with "_" will be inlined into their containing rule.
 
 Rule definitions can be extended to the next line by using the OR operator (signified by a pipe: `|` ).
 
-An alias is a name for the specific rule alternative. It affects tree construction.
+An alias is a name for the specific rule alternative. It affects tree construction (see [Shaping the tree](tree_construction#shaping_the_tree).
 
+The affect of a rule on the parse tree can be specified by modifiers.  The `!` modifier causes the rule to keep all its tokens, regardless of whether they are named or not.  The `?` modifier causes the rule to be inlined if it only has a single child.  The `?` modifier cannot be used on rules that are named starting with an underscore.
 
 Each item is one of:
 
 * `rule`
 * `TERMINAL`
-* `"string literal"` or `/regexp literal/`
+* `"string literal"` - Literal, to be matched as-is.
+* `"string literal"i` - Literal, to be matched case-insensitively.
+* `/regexp literal/` - Regular expression literal.  Can inclde flags.
+* `"character".."character"` - Literal range.  The range represends all values between the two literals, inclusively.
+* template(parameter1, parameter2, ..) - A template to be  expanded with the specified parameters.
 * `(item item ..)` - Group items
+* `(item | item | ..)` - Alternate items.  Note that the items cannot have aliases.
 * `[item item ..]` - Maybe. Same as `(item item ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.
-* `item?` - Zero or one instances of item ("maybe")
+* `[item | item | ..]` - Maybe with alternates. Same as `(item | item | ..)?`, but when `maybe_placeholders=True`, generates `None` if there is no match.  Note that the items cannot have aliases.
+* `item?` - Zero or one instances of item (a "maybe")
 * `item*` - Zero or more instances of item
 * `item+` - One or more instances of item
 * `item ~ n` - Exactly *n* instances of item
@@ -297,12 +315,24 @@ Note that `%ignore` directives cannot be imported. Imported rules will abide by
 
 Declare a terminal without defining it. Useful for plugins.
 
+**Syntax:**
+```html
+%declare <TERMINAL>
+%declare <rule>
+```
+
 ### %override
 
 Override a rule or terminals, affecting all references to it, even in imported grammars.
 
 Useful for implementing an inheritance pattern when importing grammars.
 
+**Syntax:**
+```html
+%override <TERMINAL> ... terminal definition ...
+%override <rule> ... rule definition ...
+```
+
 **Example:**
 ```perl
 %import my_grammar (start, number, NUMBER)
@@ -319,6 +349,12 @@ Useful for splitting up a definition of a complex rule with many different optio
 
 Can also be used to implement a plugin system where a core grammar is extended by others.
 
+**Syntax:**
+```html
+%extend <TERMINAL> ... additional terminal alternate ...
+%extend <rule> ... additional rule alternate ...
+```
+
 
 **Example:**
 ```perl

diff --git a/docs/tree_construction.md b/docs/tree_construction.md
@@ -74,6 +74,7 @@ Lark will parse "((hello world))" as:
 The brackets do not appear in the tree by design. The words appear because they are matched by a named terminal.
 
 
+<a name="shaping_the_tree"></a>
 ## Shaping the tree
 
 Users can alter the automatic construction of the tree using a collection of grammar features.

diff --git a/lark/grammars/lark.lark b/lark/grammars/lark.lark
@@ -7,46 +7,66 @@ _item: rule
      | token
      | statement
 
-rule: RULE rule_params priority? ":" expansions
-token: TOKEN token_params priority? ":" expansions
+rule: RULE_MODIFIERS? RULE rule_params priority? ":" rule_expansions
+token: TOKEN priority? ":" token_expansions
 
 rule_params: ["{" RULE ("," RULE)* "}"]
-token_params: ["{" TOKEN ("," TOKEN)* "}"]
 
 priority: "." NUMBER
 
-statement: "%ignore" expansions                    -> ignore
+statement: "%ignore" ignore_token                  -> ignore
          | "%import" import_path ["->" name]       -> import
          | "%import" import_path name_list         -> multi_import
          | "%override" rule                        -> override_rule
+         | "%override" token                       -> override_token
          | "%declare" name+                        -> declare
+         | "%extend" rule                          -> extend_rule
+         | "%extend" token                         -> extend_token
+
+ignore_token: ignore_item [ OP | "~" NUMBER [".." NUMBER]]
+ignore_item: STRING | TOKEN | REGEXP
 
 !import_path: "."? name ("." name)*
 name_list: "(" name ("," name)* ")"
 
-?expansions: alias (_VBAR alias)*
+?rule_expansions: rule_alias (_VBAR rule_alias)*
+
+?rule_inner_expansions: rule_expansion (_VBAR rule_expansion)*
+
+?rule_alias: rule_expansion ["->" RULE]
+
+?rule_expansion: rule_expr*
+
+?rule_expr: rule_atom [OP | "~" NUMBER [".." NUMBER]]
+?rule_atom: "(" rule_inner_expansions ")"
+          | "[" rule_inner_expansions "]" -> rule_maybe
+          | rule_value
+
+?rule_value: RULE "{" rule_value ("," rule_value)* "}" -> rule_template_usage
+           | RULE
+           | token_value
 
-?alias: expansion ["->" RULE]
+?token_expansions: token_expansion (_VBAR token_expansion)*
 
-?expansion: expr*
+?token_expansion: token_expr*
 
-?expr: atom [OP | "~" NUMBER [".." NUMBER]]
+?token_expr: token_atom [OP | "~" NUMBER [".." NUMBER]]
 
-?atom: "(" expansions ")"
-     | "[" expansions "]" -> maybe
-     | value
+?token_atom: "(" token_expansions ")"
+           | "[" token_expansions "]" -> token_maybe
+           | token_value
 
-?value: STRING ".." STRING -> literal_range
-      | name
-      | (REGEXP | STRING) -> literal
-      | name "{" value ("," value)* "}" -> template_usage
+?token_value: STRING ".." STRING -> literal_range
+            | TOKEN
+            | (REGEXP | STRING) -> literal
 
 name: RULE
     | TOKEN
 
 _VBAR: _NL? "|"
 OP: /[+*]|[?](?![a-z])/
-RULE: /!?[_?]?[a-z][_a-z0-9]*/
+RULE: /_?[a-z][_a-z0-9]*/
+RULE_MODIFIERS: /!|![?](?=[a-z])|[?]!?(?=[a-z])/
 TOKEN: /_?[A-Z][_A-Z0-9]*/
 STRING: _STRING "i"?
 REGEXP: /\/(?!\/)(\\\/|\\\\|[^\/])*?\/[imslux]*/

diff --git a/tests/test_grammar_formal.py b/tests/test_grammar_formal.py
@@ -0,0 +1,169 @@
+from __future__ import absolute_import
+
+import os
+from unittest import TestCase, main
+
+from lark import lark, Lark, UnexpectedToken
+from lark.load_grammar import GrammarError
+
+
+# Based on TestGrammar, with lots of tests that can't be run elided.
+class TestGrammarFormal(TestCase):
+    def setUp(self):
+        lark_path = os.path.join(os.path.dirname(lark.__file__), 'grammars/lark.lark')
+        # lark_path = os.path.join(os.path.dirname(lark.__file__), 'grammars/lark.lark-ORIG')
+        with open(lark_path, 'r') as f:
+            self.lark_grammar = "\n".join(f.readlines())
+
+    def test_errors(self):
+        # raise NotImplementedError("Doesn't work yet.")
+        l = Lark(self.lark_grammar, parser="lalr")
+
+        # This is an unrolled form of the test_grammar.py:GRAMMAR_ERRORS tests, because the lark.lark messages vary.
+
+        # 'Incorrect type of value', 'a: 1\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..NUMBER., .1..', l.parse, 'a: 1\n')
+        # 'Unclosed parenthesis', 'a: (\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token.._NL.,', l.parse, 'a: (\n')
+        # 'Unmatched closing parenthesis', 'a: )\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..RPAR.', l.parse, 'a: )\n')
+        # 'Unmatched closing parenthesis', 'a: )\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..RPAR.,', l.parse, 'a: )\n')
+        # 'Unmatched closing parenthesis', 'a: (\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token.._NL.,', l.parse, 'a: (\n')
+        # 'Expecting rule or terminal definition (missing colon)', 'a\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token.._NL.,', l.parse, 'a\n')
+        # 'Expecting rule or terminal definition (missing colon)', 'A\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token.._NL.,', l.parse, 'A\n')
+        # 'Expecting rule or terminal definition (missing colon)', 'a->\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..__ANON_0., .->', l.parse, 'a->\n')
+        # 'Expecting rule or terminal definition (missing colon)', 'A->\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..__ANON_0., .->', l.parse, 'A->\n')
+        # 'Expecting rule or terminal definition (missing colon)', 'a A\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..TOKEN., .A..', l.parse, 'a A\n')
+        # 'Illegal name for rules or terminals', 'Aa:\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..RULE., .a..', l.parse, 'Aa:\n')
+        # 'Alias expects lowercase name', 'a: -> "a"\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..STRING., ."a"..', l.parse, 'a: -> "a"\n')
+        # 'Unexpected colon', 'a::\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..COLON.,', l.parse, 'a::\n')
+        # 'Unexpected colon', 'a: b:\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..COLON.,', l.parse, 'a: b:\n')
+        # 'Unexpected colon', 'a: B:\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..COLON.,', l.parse, 'a: B:\n')
+        # 'Unexpected colon', 'a: "a":\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..COLON.,', l.parse, 'a: "a":\n')
+        # 'Misplaced operator', 'a: b??'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..OP., .\?..', l.parse, 'a: b??')
+        # 'Misplaced operator', 'a: b(?)'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..OP., .\?..', l.parse, 'a: b(?)')
+        # 'Misplaced operator', 'a:+\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..OP., .\+..', l.parse, 'a:+\n')
+        # 'Misplaced operator', 'a:?\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..OP., .\?..', l.parse, 'a:?\n')
+        # 'Misplaced operator', 'a:*\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..OP., .\*..', l.parse, 'a:*\n')
+        # 'Misplaced operator', 'a:|*\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..OP., .\*..', l.parse, 'a:|*\n')
+        # 'Expecting option ("|") or a new rule or terminal definition', 'a:a\n()\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..LPAR.,', l.parse, 'a:a\n()\n')
+        # 'Terminal names cannot contain dots', 'A.B\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..TOKEN., .B..', l.parse, 'A.B\n')
+        # 'Expecting rule or terminal definition', '"a"\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..STRING., ."a"..', l.parse, '"a"\n')
+        # '%import expects a name', '%import "a"\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..STRING., ."a"..', l.parse, '%import "a"\n')
+        # '%ignore expects a value', '%ignore %import\n'
+        self.assertRaisesRegex(UnexpectedToken, 'Unexpected token Token..__ANON_2., .%import..', l.parse, '%ignore %import\n')
+
+    # def test_empty_literal(self):
+        # raise NotImplementedError("Breaks tests/test_parser.py:_TestParser:test_backslash2().")
+
+    # def test_ignore_name(self):
+        # raise NotImplementedError("Can't parse using parsed grammar.")
+
+    # def test_override_rule_1(self):
+        # raise NotImplementedError("Can't parse using parsed grammar.")
+
+    # def test_override_rule_2(self):
+        # raise NotImplementedError("Can't test semantics of grammar, only syntax.")
+
+    # def test_override_rule_3(self):
+        # raise NotImplementedError("Can't test semantics of grammar, only syntax.")
+
+    # def test_override_terminal(self):
+        # raise NotImplementedError("Can't parse using parsed grammar.")
+
+    # def test_extend_rule_1(self):
+        # raise NotImplementedError("Can't parse using parsed grammar.")
+
+    # def test_extend_rule_2(self):
+        # raise NotImplementedError("Can't test semantics of grammar, only syntax.")
+
+    # def test_extend_term(self):
+        # raise NotImplementedError("Can't parse using parsed grammar.")
+
+    # def test_extend_twice(self):
+        # raise NotImplementedError("Can't parse using parsed grammar.")
+
+    # def test_undefined_ignore(self):
+        # raise NotImplementedError("Can't parse using parsed grammar.")
+
+    def test_alias_in_terminal(self):
+        l = Lark(self.lark_grammar, parser="lalr")
+        g = """start: TERM
+            TERM: "a" -> alias
+            """
+        # self.assertRaisesRegex( GrammarError, "Aliasing not allowed in terminals", Lark, g)
+        self.assertRaisesRegex( UnexpectedToken, "Unexpected token Token.'__ANON_0', '->'.", l.parse, g)
+
+    # def test_undefined_rule(self):
+        # raise NotImplementedError("Can't test semantics of grammar, only syntax.")
+
+    # def test_undefined_term(self):
+        # raise NotImplementedError("Can't test semantics of grammar, only syntax.")
+
+    # def test_token_multiline_only_works_with_x_flag(self):
+        # raise NotImplementedError("Can't test regex flags in Lark grammar.")
+
+    # def test_import_custom_sources(self):
+        # raise NotImplementedError("Can't parse using parsed grammar.")
+
+    # def test_import_custom_sources2(self):
+        # raise NotImplementedError("Can't parse using parsed grammar.")
+
+    # def test_import_custom_sources3(self):
+        # raise NotImplementedError("Can't parse using parsed grammar.")
+
+    # def test_my_find_grammar_errors(self):
+        # raise NotImplementedError("Can't parse using parsed grammar.")
+
+    # def test_ranged_repeat_terms(self):
+        # raise NotImplementedError("Can't parse using parsed grammar.")
+
+    # def test_ranged_repeat_large(self):
+        # raise NotImplementedError("Can't parse using parsed grammar.")
+
+    # def test_large_terminal(self):
+        # raise NotImplementedError("Can't parse using parsed grammar.")
+
+    # def test_list_grammar_imports(self):
+        # raise NotImplementedError("Can't test semantics of grammar, only syntax.")
+
+    def test_inline_with_expand_single(self):
+        l = Lark(self.lark_grammar, parser="lalr")
+        grammar = r"""
+        start: _a
+        !?_a: "A"
+        """
+        # self.assertRaisesRegex(GrammarError, "Inlined rules (_rule) cannot use the ?rule modifier.", l.parse, grammar)
+        # TODO Is this really catching the right problem?
+        self.assertRaisesRegex(UnexpectedToken, "Unexpected token Token.'OP', '?'.", l.parse, grammar)
+
+
+    # def test_line_breaks(self):
+        # raise NotImplementedError("Can't parse using parsed grammar.")
+
+
+if __name__ == '__main__':
+    main()