-
-
Notifications
You must be signed in to change notification settings - Fork 89
Description
The JSON Lines text format (aka JSONL or newline-delimited JSON) has one JSON object per line. It's often used for structured log files or as a well-specified alternative to CSV.
Here are some ideas how the JSON Lines format could be supported in GoAWK. To be honest I'm not completely sure if this is a good idea, but I've found it interesting to think about. This write-up captures some of my thoughts.
I can imagine different levels of sophistication. We could start simple and then in later versions support more complex input data and ways to interact with it.
One JSON array of scalars per line
["Name", "Session", "Score", "Completed"]
["Gilbert", "2013", 24, true]
["Alexa", "2013", 29, true]
["May", "2012B", 14, null]
["Deloise", "2012A", 19, true]
Suggestions:
- Add a
jsonl
input mode. - Columns could be parsed to $1, $2, $3, ...
- Error handling like with CSV
Questions:
- How to handle JSON booleans (true/false) and null?
- Does Unicode cause some problems?
One JSON object per line, with pairs of keys and scalar values
This is used by the Graylog Extended Log Format (GELF).
{"version":"1.1", "host":"example.org", "short_message": "A log message", "facility":"test", "_foo":"bar"}
{"version":"1.1", "host":"test.example.org", "short_message": "Another msg", "facility":"test", "_foo":"baz"}
Users wanting to parse Logfmt messages (like myself, see #149) should be able to convert their data into this format quite easily.
Suggestions:
- Re-use existing named-field syntax to get the fields (e.g.
@"short_message"
) - Update FIELDS array for each line. Don't expect all lines to have the same number or order of fields.
Nested data
{"one": 1, "four": [1,2,3,4], "five": {"alpha": ["fo", "fum"], "beta": {"hey": "How's tricks?"}}}
{"one": 1, "four": [4], "five": {"alpha": ["fa", "fim"], "beta": {"hi": "How's tracks?"}}}
Suggestions:
- I guess we want to keep the syntax simple and not support something sophisticated like jsonpath or jmespath syntax to extract fields.
- Maybe just return nested data as JSON strings: `@"four" -> "[1,2,3,4]"
- Enhance named-field syntax with dots and square brackets to get subfields and array elements:
@"five.alpha[1]
returns"fum"
,"five.beta
returns"{"hey": "How's tricks?"}"
(Quoting issue, see below). - Add a function to map JSON array elements to AWK fields, e.g.
getjsonarr("five.alpha")
. Now $1 isfo
, $2 isfum
. - Maybe add a function to set a new root for field extraction, e.g.
setjsonroot("five.beta"); print @"hey"
. returnsHow's tricks?
- Use gron's collection of JSON testdata.
Questions:
- How to escape double quotes in returned JSON strings?
- How to map the first element of a JSON arrays to an AWK field? JSON arrays are 0-based, AWK-fields are 1-based.