Skip to content

[v18] Changes to CST representation #532

@pdubroy

Description

@pdubroy

In Ohm today, we don't have public API for accessing the concrete syntax tree (CST). When you write an operation or attribute, the arguments to each action are wrapper nodes, through which the properties of the underlying CST node are exposed.

In v18, I'm planning on doing away with the wrappers and directly exposing the CST nodes. The primary motivation is that the Ohm v18 will support compiling grammars to WebAssembly (see #511), and the language-specific APIs for processing a match result should be as thin as possible.

The current CST also has some properties that make it a bit awkward to work with:

  1. Iteration nodes. If the grammar has repetition or optionals, and you do a naive traversal of a node's children, you'll visit nodes "out of order" with respect to where they appear in the input. For example: start = (letter digit)+ for the input a1b2. The action for start will be passed two arguments: an IterNode with all the letters, and an IterNode with all the digits.
  2. Lookahead. Positive lookahead (&) creates a binding, which means that the same input text can be captured by two different nodes. E.g., start = &letter any with the input a. The action for start will be passed two arguments: one for the letter, and one for the any. At runtime, there's no way easy way to know that a node comes from a lookahead.
  3. No easy way to access skipped spaces. There's no way to access the CST nodes that come from implicit space skipping.

In v18, I'm planning to address all of these issues:

  1. Repetition and optionals will produce their own nodes — they will no longer be "flattened" into the nearest nonterminal. See New API for optionals #531 for some discussion about possible API.
  2. Lookahead will not produce a binding (same as negative lookahead).
  3. Terminal and Nonterminal nodes will have a leadingSpaces?: NonterminalNode property. To access the final trailing spaces, the current idea is that you could write a _root action, which would give you access to the top-level sequence. (When you do g.match(someInput, 'Start'), Ohm actually tries to match a sequence of Start followed by end. So far there has been no way to access the node associated with end.)

A few other related changes I'm planning on making:

  • Introduce a notion of "pseudo-rules" or "macros". Right now, some of the built-in rules (any, end, caseInsensitive) produce a TerminalNode, which is a bit weird — since they look like applications, you might think they produce nonterminal nodes. It's currently also possible to override them, and that means that whenever we introduce new built-in rules, it's a breaking change (because any user grammar that uses that rule now needs to use := to override the rule, rather than just declaring a rule).

To clear up the confusion, I'm planning on making these pseudo-rules look syntactically different: @any, @end, @caseInsensitive, which also means that they cannot be overridden by user-written rules.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions