Skip to content

Add JSON Lines support #152

@dloss

Description

@dloss

The JSON Lines text format (aka JSONL or newline-delimited JSON) has one JSON object per line. It's often used for structured log files or as a well-specified alternative to CSV.

Here are some ideas how the JSON Lines format could be supported in GoAWK. To be honest I'm not completely sure if this is a good idea, but I've found it interesting to think about. This write-up captures some of my thoughts.

I can imagine different levels of sophistication. We could start simple and then in later versions support more complex input data and ways to interact with it.

One JSON array of scalars per line

["Name", "Session", "Score", "Completed"]
["Gilbert", "2013", 24, true]
["Alexa", "2013", 29, true]
["May", "2012B", 14, null]
["Deloise", "2012A", 19, true] 

Suggestions:

  • Add a jsonl input mode.
  • Columns could be parsed to $1, $2, $3, ...
  • Error handling like with CSV

Questions:

  • How to handle JSON booleans (true/false) and null?
  • Does Unicode cause some problems?

One JSON object per line, with pairs of keys and scalar values

This is used by the Graylog Extended Log Format (GELF).

{"version":"1.1", "host":"example.org", "short_message": "A log message", "facility":"test", "_foo":"bar"}
{"version":"1.1", "host":"test.example.org", "short_message": "Another msg", "facility":"test", "_foo":"baz"}

Users wanting to parse Logfmt messages (like myself, see #149) should be able to convert their data into this format quite easily.

Suggestions:

  • Re-use existing named-field syntax to get the fields (e.g. @"short_message")
  • Update FIELDS array for each line. Don't expect all lines to have the same number or order of fields.

Nested data

{"one": 1, "four": [1,2,3,4], "five": {"alpha": ["fo", "fum"], "beta": {"hey": "How's tricks?"}}}
{"one": 1, "four": [4], "five": {"alpha": ["fa", "fim"], "beta": {"hi": "How's tracks?"}}}

Suggestions:

  • I guess we want to keep the syntax simple and not support something sophisticated like jsonpath or jmespath syntax to extract fields.
  • Maybe just return nested data as JSON strings: `@"four" -> "[1,2,3,4]"
  • Enhance named-field syntax with dots and square brackets to get subfields and array elements: @"five.alpha[1] returns "fum", "five.beta returns "{"hey": "How's tricks?"}" (Quoting issue, see below).
  • Add a function to map JSON array elements to AWK fields, e.g. getjsonarr("five.alpha"). Now $1 is fo, $2 is fum.
  • Maybe add a function to set a new root for field extraction, e.g. setjsonroot("five.beta"); print @"hey". returns How's tricks?
  • Use gron's collection of JSON testdata.

Questions:

  • How to escape double quotes in returned JSON strings?
  • How to map the first element of a JSON arrays to an AWK field? JSON arrays are 0-based, AWK-fields are 1-based.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions