[v18] Changes to CST representation

In Ohm today, we don't have public API for accessing the concrete syntax tree (CST). When you write an operation or attribute, the arguments to each action are _wrapper nodes_, through which the properties of the underlying CST node are exposed.

In v18, I'm planning on doing away with the wrappers and directly exposing the CST nodes. The primary motivation is that the Ohm v18 will support compiling grammars to WebAssembly (see #511), and the language-specific APIs for processing a match result should be as thin as possible.

The current CST also has some properties that make it a bit awkward to work with:

1. **Iteration nodes.** If the grammar has repetition or optionals, and you do a naive traversal of a node's children, you'll visit nodes "out of order" with respect to where they appear in the input. For example: `start = (letter digit)+` for the input `a1b2`. The action for `start` will be passed two arguments: an `IterNode` with all the letters, and an `IterNode` with all the digits.
2. **Lookahead.** Positive lookahead (`&`) creates a binding, which means that the same input text can be captured by two different nodes. E.g., `start = &letter any` with the input `a`. The action for `start` will be passed two arguments: one for the `letter`, and one for the `any`. At runtime, there's no way easy way to know that a node comes from a lookahead.
3. **No easy way to access skipped spaces.** There's no way to access the CST nodes that come from implicit space skipping.

In v18, I'm planning to address all of these issues:
1. Repetition and optionals will produce their own nodes — they will no longer be "flattened" into the nearest nonterminal. See #531 for some discussion about possible API.
2. Lookahead will not produce a binding (same as negative lookahead).
3. Terminal and Nonterminal nodes will have a `leadingSpaces?: NonterminalNode` property. To access the final trailing spaces, the current idea is that you could write a `_root` action, which would give you access to the top-level sequence. (When you do `g.match(someInput, 'Start')`, Ohm actually tries to match a sequence of `Start` followed by `end`. So far there has been no way to access the node associated with `end`.)

A few other related changes I'm planning on making:
- Introduce a notion of "pseudo-rules" or "macros". Right now, some of the built-in rules (`any`, `end`, `caseInsensitive`) produce a TerminalNode, which is a bit weird — since they look like applications, you might think they produce nonterminal nodes. It's currently also possible to override them, and that means that whenever we introduce new built-in rules, it's a breaking change (because any user grammar that uses that rule now needs to use `:=` to _override_ the rule, rather than just declaring a rule).

To clear up the confusion, I'm planning on making these pseudo-rules look syntactically different: `@any`, `@end`, `@caseInsensitive`, which also means that they cannot be overridden by user-written rules.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[v18] Changes to CST representation #532

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[v18] Changes to CST representation #532

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions