-
-
Notifications
You must be signed in to change notification settings - Fork 222
Description
In Ohm today, we don't have public API for accessing the concrete syntax tree (CST). When you write an operation or attribute, the arguments to each action are wrapper nodes, through which the properties of the underlying CST node are exposed.
In v18, I'm planning on doing away with the wrappers and directly exposing the CST nodes. The primary motivation is that the Ohm v18 will support compiling grammars to WebAssembly (see #511), and the language-specific APIs for processing a match result should be as thin as possible.
The current CST also has some properties that make it a bit awkward to work with:
- Iteration nodes. If the grammar has repetition or optionals, and you do a naive traversal of a node's children, you'll visit nodes "out of order" with respect to where they appear in the input. For example:
start = (letter digit)+
for the inputa1b2
. The action forstart
will be passed two arguments: anIterNode
with all the letters, and anIterNode
with all the digits. - Lookahead. Positive lookahead (
&
) creates a binding, which means that the same input text can be captured by two different nodes. E.g.,start = &letter any
with the inputa
. The action forstart
will be passed two arguments: one for theletter
, and one for theany
. At runtime, there's no way easy way to know that a node comes from a lookahead. - No easy way to access skipped spaces. There's no way to access the CST nodes that come from implicit space skipping.
In v18, I'm planning to address all of these issues:
- Repetition and optionals will produce their own nodes — they will no longer be "flattened" into the nearest nonterminal. See New API for optionals #531 for some discussion about possible API.
- Lookahead will not produce a binding (same as negative lookahead).
- Terminal and Nonterminal nodes will have a
leadingSpaces?: NonterminalNode
property. To access the final trailing spaces, the current idea is that you could write a_root
action, which would give you access to the top-level sequence. (When you dog.match(someInput, 'Start')
, Ohm actually tries to match a sequence ofStart
followed byend
. So far there has been no way to access the node associated withend
.)
A few other related changes I'm planning on making:
- Introduce a notion of "pseudo-rules" or "macros". Right now, some of the built-in rules (
any
,end
,caseInsensitive
) produce a TerminalNode, which is a bit weird — since they look like applications, you might think they produce nonterminal nodes. It's currently also possible to override them, and that means that whenever we introduce new built-in rules, it's a breaking change (because any user grammar that uses that rule now needs to use:=
to override the rule, rather than just declaring a rule).
To clear up the confusion, I'm planning on making these pseudo-rules look syntactically different: @any
, @end
, @caseInsensitive
, which also means that they cannot be overridden by user-written rules.