- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.7k
Parsing Expression Grammars
jq contains all the ingredients required for conveniently writing PEGs and PEG parsers. In effect, one can write a PEG parser for a PEG grammar simply by transcribing the grammar into jq, though some care with respect to jq's "declare before use" rule may be required.
Other tasks, such as evaluation of arithmetic expressions defined by a PEG grammar, can be achieved with either no or only minimal changes to the transcribed grammar. This is illustrated in the second example below.
jq also has the machinery required to add parameterized non-terminals, as illustrated in Example 3 below.
For an introduction to Parsing Expression Grammars, see for example https://en.wikipedia.org/wiki/Parsing_expression_grammar
- Transcribing a PEG grammar to jq
- Example 0: Balanced Square Brackets
- Example 1: a^n b^n c^n : n >= 1
- Example 2: arithmetic
- Example 3: [a-z]+ing
- Example 4: CSV-to-JSON
- Appendix - Helper Functions
Transcription to jq is possible because each of the seven basic[^1] PEG operations can be mapped directly to jq filters with the same semantics. Here is the mapping that will be used in this article:
 Sequence: e1 e2             e1 | e2
 Ordered choice: e1 / e2     e1 // e2
 Zero-or-more: e*            star(E)
 One-or-more: e+             plus(E)
 Optional: e?                optional(E)
 And-predicate: &e           amp(E) 
 Not-predicate: !e           neg(E)
where:
def star(E): (E | star(E)) // .;
def plus(E): E | (plus(E) // . );
def optional(E): E // .;
def amp(E): . as $in | E | $in;
def neg(E): select( [E] == [] );
A string consisting of nothing but square brackets ("[" and "]") may be said to be "balanced" if
the repeated application of gsub("[[][]]"; "") yields the empty string.  To obtain a jq program
that checks whether or not a string of square brackets is balanced can be directly derived from a literal
transcription of a PEG grammar for such strings as follows.
The definitions of the generic eos (end-of-string) and consume filters are given in the Appendix.
# a single expression
def balancedExpression: 
  consume("[[]") | optional(balancedExpression) | consume("[]]") ;
# The empty string is balanced:
def balanced:
  star(balancedExpression) | eos;
def isBalanced:
  {remainder: .} | (balanced | true) // false;
Example:
("", "[][]", "[[[[]", "]]][[[" )
| isBalanced
Following the article on PEGs,
the PEG for the string expressions a^n b^n c^n where n >= 1 can be
written as follows:
 S ← &(A 'c') 'a'+ B !.
 A ← 'a' A? 'b'
 B ← 'b' B? 'c'
This is a nice example as it uses five of the seven basic PEG
operations, including the two look-ahead operations & and !. In
addition, it uses '.' to signify that the input is nonempty.
The PEG can be expressed in jq by simple transcription:
def A: literal("a") | optional(A) | literal("b");
def B: literal("b") | optional(B) | literal("c");
def S: amp(A | literal("c")) | plus(literal("a")) | B | neg(nonempty);
assuming that literal and nonempty have been suitably defined.
Their implementations will depend on the task at hand.
In the following we consider two specific tasks: recognition, and tokenization.
For recognition of an input string, the following would suffice:
def literal($s): select(startswith($s)) | .[$s|length :];
def nonempty: select(length > 0);
If we want the output to be true or false depending on whether
the input string conforms to the grammar, we could write:
(S | true) // false
Example:
Assuming all the above jq code has been assembled into a file, parse.jq:
   $ jq -f parse.jq <<< abc
   true
Let's suppose we wish to build up an array of the recognized terminals.
In that case, we need only modify the functions that consume or
examine the string input, i.e. literal and nonempty.
For convenience, on the jq side, we'll pass a JSON object
of the form {"remainder": _, "result": _} through the filter.
The definitions of literal and nonempty thus become:
def literal($s):
  select(.remainder | startswith($s))
  | .result += [$s]
  | .remainder |= .[$s | length :] ;
def nonempty: select( (.remainder | length) > 0 );
And for convenience, we could define:
def parse: {remainder: .} | S | .result;
So, with parse as our main program, a run would now look like this:
jq -R -cf abc.jq <<< abc
["a","b","c"]
In jq, a filter can only be invoked after it has been declared. This means that when transcribing PEG rules into jq filters, the order is important. To handle circularly-defined PEG expressions, one must take advantage of jq's support for subfunctions, that is, defs within defs. This is illustrated in the next example, also taken from the Wikipedia article.
As presented in the aforementioned Wikipedia article, a PEG for recognizing simple arithmetic expressions for addition, subtraction, multiplication and division applied to non-negative integers is as follows:
Expr    ← Sum
Value   ← [0-9]+ / '(' Expr ')'
Product ← Value (('*' / '/') Value)*
Sum     ← Product (('+' / '-') Product)*
In this case, for the reason mentioned in the previous section, care must be taken with transcription. Since the main goal is in any case to define Expr, it makes sense to define all the other PEG expressions as subfunctions of Expr:
# Expr    ← Sum
def Expr:
  # Value   ← [0-9]+ / '(' Expr ')'
  def Value: parseNumber( "[0-9]+" ) // (literal("(") | Expr | literal(")"));
   
  # Product ← Value (('*' / '/') Value)*
  def Product: (Value | star( (literal("*") // literal("/")) | Value));
  # Sum     ← Product (('+' / '-') Product)*
  def Sum: (Product | star( (literal("+") // literal("-")) | Product));
  Sum ;
With literal as previously defined for a recognition engine, and parseNumber as defined in the Appendix, we thus have a recognition engine.
In this section, we'll illustrate how the transcribed PEG can be adapted to evaluate the recognized arithmetic expressions.
The strategy will be to pass along an object {"remainder":, "result":} as with
the previous example, except that when a parenthesized subexpression is encountered, the result will
be augmented by an array.  Thus the def of Value above will be modified to:
def Value: parseNumber( "[0-9]+" ) // (consume("[(]") | box(Expr) | consume("[)["));
where consume/1 simply consumes the string matching the regex input,
and box/1 records Expr in an array.  These helper functions,
together with an eval function for evaluating the parsed result, are
given in the Appendix below.
We can then define a jq function to evalute a string representing strings defined by the PEG as follows:
def evaluate:
  {remainder: .}
  | Expr.result
  | eval ;
Example:
   $ jq -f arithmetic.jq <<< "(1+2)*(3+4)+(1*1)"
   22
Because a PEG parser does not backtrack, the task of determining the first match of the regex [a-z]+ing in a string cannot be accomplished simply by writing a PEG rule such as:
plus(parse("[a-z]")) | literal("ing")
because the first expression will gobble up all the "i-n-g" characters, resulting in failure.
An appropriate PEG grammar for the task at hand could be achieved in jq as follows:
def atoz: parse("[a-z]");
def ing:  literal("ing");
plus(neg(ing) | atoz) | ing
Better yet, we could define a parameterized non-terminal to capture the pattern for reusability:
def Aplus_B(A; B): plus(neg(B) | A) | B;
So the PEG grammar now looks like the corresponding regular expression:
Aplus_B( atoz; ing )
Here is a PEG-based program for converting a string of comma-separated values to a JSON array of the corresponding fields.
This program can be used as simple CSV-to-TSV translator by adding a call to @tsv.  The definitions of the
helper functions (parse and consume) are given in the Appendix.
# White space
def ws: consume(" *");
def quoted_field_content:
  parse("((\"\")|([^\"]))*")
  | .result[-1] |= gsub("\"\""; "\"");
def unquoted_field: parse("[^,\"]*");
def quoted_field: consume("\"") | quoted_field_content | consume("\"");
def field: (ws | quoted_field | ws) // unquoted_field;
def record: field | star(consume(",") | field);
def csv2json:
  {remainder: .} | record | .result;
# end of string
def eos:
  select((.remainder|length == 0));
def eval:
  if type == "array" then
    if length == 0 then null
    else .[-1] |= eval
    | if length == 1 then .[0]
      else (.[:-2] | eval) as $v
      | if   .[-2] == "*" then $v * .[-1]
        elif .[-2] == "/" then $v / .[-1]
        elif .[-2] == "+" then $v + .[-1] 
        elif .[-2] == "-" then $v - .[-1] 
        else tostring|error
	end
      end
    end
  else .
  end ;
def box(E):
   ((.result = null) | E) as $e
   | .remainder = $e.remainder
   | .result += [$e.result]  # the magic sauce
   ;
# Consume a regular expression rooted at the start of .remainder, or emit empty;
# on success, update .remainder and set .match but do NOT update .result
def consume($re):
  # on failure, match yields empty
  (.remainder | match("^" + $re)) as $match
  | .remainder |= .[$match.length :]
  | .match = $match.string ;
def parse($re):
  consume($re)
  | .result = .result + [.match] ;
def parseNumber($re):
  consume($re)
  | .result = .result + [.match|tonumber] ;
[^1]: Note that some of these can be written in terms of the others, e.g. E+ is just EE*, so "basic" is not meant in the sense of "fundamental".
- Home
- FAQ
- jq Language Description
- Cookbook
- Modules
- Parsing Expression Grammars
- Docs for Oniguruma Regular Expressions (RE.txt)
- Advanced Topics
- Guide for Contributors
- How To
- C API
- jq Internals
- Tips
- Development