feat: optimized json parsing #118

roeap · 2024-02-15T19:34:11Z

Picking up the discussion where we left off around json parsing.

This PR tries to leverage the optimization in Arrow json parsing at the cost of having more complex code that is a bit harder to reason about.

ryan-johnson-databricks · 2024-02-15T21:47:48Z

kernel/src/client/json.rs

+        }
+        decoder.flush()
+    };
+    std::iter::from_fn(move || next().transpose()).collect::<Result<Vec<_>, _>>()


Given that this is directly the return value, is it possible that collect could Just Work?

Suggested change

std::iter::from_fn(move || next().transpose()).collect::<Result<Vec<_>, _>>()

std::iter::from_fn(move || next().transpose()).collect()

(it seems to work in the playground, at least)

ryan-johnson-databricks · 2024-02-15T21:51:21Z

kernel/src/client/json.rs

+                mark = it;
+            }
+        }
+        batches.extend(emit_run(json_strings.len() - mark)?);


Maybe worth a comment to explain: There is always a run in progress, because the loop only emits the previous run after it detects that a new run has started. Corollary: If the batch only contains one run, this will be the only emit_run call.

ryan-johnson-databricks · 2024-02-15T21:53:42Z

kernel/src/client/json.rs

+            }
+        };
+
+        for it in 1..json_strings.len() {


nit: It's not an iterator...

Suggested change

for it in 1..json_strings.len() {

for index in 1..json_strings.len() {

ryan-johnson-databricks · 2024-02-15T22:16:23Z

kernel/src/client/json.rs

+        let result_schema = output_schema.clone();
+        let mut emit_run = move |run_len: usize| {
+            if run_is_null {
+                get_nulls(run_len, output_schema.clone())


This is the only call site for get_nulls. Should we delete the function and inline the logic here? Especially because new_null_array calls ArrayData::new_null, which is nearly identical to this code here. So we can probably simplify it down to just:

Suggested change

get_nulls(run_len, output_schema.clone())

let nulls = ArrayData::new_null(schema.into(), run_len);

Ok(vec![StructArray::from(nulls).into()])

(that's a combination of SchemaRef into DataType and StructArray from ArrayData into RecordBatch

Erm, I totally misread the From trait chains involving Schema and DataType... let me try that again:

Suggested change

get_nulls(run_len, output_schema.clone())

let nulls = ArrayData::new_null(DataType::Struct(schema.fields), run_len)

Ok(vec![StructArray::from(nulls).into()])

https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Struct

ryan-johnson-databricks · 2024-02-15T22:38:25Z

kernel/src/client/json.rs

+                let batch = read_from_json(get_reader(slice.value_data()), &mut decoder).unwrap();
+                Ok(batch)


Can we replace that unwrap call with a map_err to produce a compatible error type?

read_from_json(get_reader(slice.value_data()), &mut decoder).map_err(|e| { ... })

ryan-johnson-databricks · 2024-02-15T22:40:57Z

kernel/src/client/json.rs

Two things we need to figure out before we can trust json parsing in runs like this (as opposed to parsing each row's json record individually):

If the JSON is truly invalid (malformed, type conflict, etc), is that a hard error that fails the query? Or would it be preferable to treat it as NULL (e.g. disabling data skipping for a file after failing to parse its invalid add.stats)?

We don't know how often invalid stats occur in the wild, because spark JSON parsing is "permissive" (= errors become nulls) unless specifically configured with mode=FAILFAST option (see Databricks docs because Spark docs are woefully incomplete here). Delta spark currently relies on the defaults when parsing stats.

Note: The Delta spec doesn't say anything about how clients should handle invalid stats.

Because stats are a JSON object literal embedded in add.stats string field, there are multiple ways to encode missing stats:

Column just plain missing: { "add": { } }

JSON null: { "add": { "stats": null } }

JSON object literal for an empty object: { "add": { "stats": "\{\}" } }

JSON literal for a JSON null (scary case): { "add": { "stats": "null" } }

A run of these would cause parsing failure due to invalid an JSON such as nullnullnull

Empty string (scary and arguably illegal case): { "add": { "stats": "" } }

The parser would find no object at all, and would consume the next object in the buffer instead, producing both wrong offsets and wrong number of stats.

I don't know if this "should" count as NULL, or error, but either way it would mess up the run-based JSON parsing proposed here.

(adding as a file comment because github has really poor handling of top-level PR comments)

roeap force-pushed the engine-json branch 2 times, most recently from 4dfc123 to ea165ed Compare February 15, 2024 19:35

feat: optimized json parsing

2b8a224

roeap force-pushed the engine-json branch from ea165ed to 2b8a224 Compare February 15, 2024 19:36

ryan-johnson-databricks reviewed Feb 15, 2024

View reviewed changes

nicklan closed this Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: optimized json parsing #118

feat: optimized json parsing #118

roeap commented Feb 15, 2024

ryan-johnson-databricks Feb 15, 2024

ryan-johnson-databricks Feb 15, 2024

ryan-johnson-databricks Feb 15, 2024

ryan-johnson-databricks Feb 15, 2024 •

edited

Loading

ryan-johnson-databricks Feb 15, 2024

ryan-johnson-databricks Feb 15, 2024

ryan-johnson-databricks Feb 15, 2024

	std::iter::from_fn(move \|\| next().transpose()).collect::<Result<Vec<_>, _>>()
	std::iter::from_fn(move \|\| next().transpose()).collect()

	for it in 1..json_strings.len() {
	for index in 1..json_strings.len() {

	get_nulls(run_len, output_schema.clone())
	let nulls = ArrayData::new_null(schema.into(), run_len);
	Ok(vec![StructArray::from(nulls).into()])

	get_nulls(run_len, output_schema.clone())
	let nulls = ArrayData::new_null(DataType::Struct(schema.fields), run_len)
	Ok(vec![StructArray::from(nulls).into()])

		let batch = read_from_json(get_reader(slice.value_data()), &mut decoder).unwrap();
		Ok(batch)

feat: optimized json parsing #118

feat: optimized json parsing #118

Conversation

roeap commented Feb 15, 2024

ryan-johnson-databricks Feb 15, 2024

Choose a reason for hiding this comment

ryan-johnson-databricks Feb 15, 2024

Choose a reason for hiding this comment

ryan-johnson-databricks Feb 15, 2024

Choose a reason for hiding this comment

ryan-johnson-databricks Feb 15, 2024 • edited Loading

Choose a reason for hiding this comment

ryan-johnson-databricks Feb 15, 2024

Choose a reason for hiding this comment

ryan-johnson-databricks Feb 15, 2024

Choose a reason for hiding this comment

ryan-johnson-databricks Feb 15, 2024

Choose a reason for hiding this comment

ryan-johnson-databricks Feb 15, 2024 •

edited

Loading