Support JSON arrays reader/parse for datafusion by zhuqi-lucas · Pull Request #19924 · apache/datafusion

zhuqi-lucas · 2026-01-21T08:40:35Z

Which issue does this PR close?

Closes #19920

Rationale for this change

DataFusion currently only supports line-delimited JSON (NDJSON) format. Many data sources provide JSON in array format [{...}, {...}], which cannot be parsed by the existing implementation.

What changes are included in this PR?

Add newline_delimited option to JsonOptions (default true for backward compatibility)
Implement streaming JSON array to NDJSON conversion via JsonArrayToNdjsonReader
Support both file-based and stream-based (e.g., S3) reading with memory-efficient streaming
Add ChannelReader for async-to-sync byte transfer in object store streaming scenarios
Add protobuf serialization support for the new option
Rename NdJsonReadOptions to JsonReadOptions (with deprecation alias)
SQL support via OPTIONS ('format.newline_delimited' 'false')

Architecture

JSON Array File (e.g., 33GB)
        │
        ▼ read chunks via ChannelReader (for streams) or BufReader (for files)
┌───────────────────┐
│ JsonArrayToNdjson │  ← streaming character substitution:
│      Reader       │    '[' skip, ',' → '\n', ']' stop
└───────────────────┘
        │
        ▼ outputs NDJSON format
┌───────────────────┐
│   Arrow Reader    │  ← batch parsing
└───────────────────┘
        │
        ▼ RecordBatch

Memory Efficiency

Approach	Memory for 33GB file	Parse count
Load entire file + serde_json	~100GB+	3x
Streaming with JsonArrayToNdjsonReader	~32MB	1x

Are these changes tested?

Yes:

Unit tests for JsonArrayToNdjsonReader (nested objects, escaped strings, empty arrays, buffer boundaries)
Unit tests for ChannelReader
Integration tests for JsonOpener (file-based, stream-based, large files, cancellation)
Schema inference tests (normal, empty, nested struct, list types)
End-to-end query tests with SQL
SQLLogicTest for SQL validation

Are there any user-facing changes?

Yes. Users can now read JSON array format files:

Via SQL:

CREATE EXTERNAL TABLE my_table
STORED AS JSON
OPTIONS ('format.newline_delimited' 'false')
LOCATION 'path/to/array.json';

Via API:

let options = JsonReadOptions::default().newline_delimited(false);
ctx.register_json("my_table", "path/to/array.json", options).await?;

Note: NdJsonReadOptions is deprecated in favor of JsonReadOptions.

Limitation: JSON array format does not support range-based file scanning (repartition_file_scans). Users will see a clear error message if this is attempted.

Copilot

Pull request overview

This PR extends DataFusion’s JSON support to handle files where the top-level value is a JSON array ([{...}, {...}]), in addition to the existing newline-delimited JSON (NDJSON) format.

Changes:

Adds a format_array: bool option to JsonOptions (config, protobuf, and JSON (de)serialization) and wires it through JsonFormat/JsonSource into the JSON execution path.
Implements array-aware schema inference and reading in datasource-json, including helper functions to infer schemas and read array JSON into RecordBatches, plus updates examples and tests (unit tests and sqllogictests).
Adds new test data (json_array.json, json_empty_array.json) and SQLLogic tests to validate array format behavior and the new OPTIONS ('format.format_array' 'true') flag.

Reviewed changes

Copilot reviewed 13 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`datafusion/sqllogictest/test_files/json.slt`	Adds end-to-end SQL tests for JSON array input, including a failure case when `format_array` is not set.
`datafusion/proto/src/logical_plan/file_formats.rs`	Extends `JsonOptionsProto` ↔ `JsonOptions` mapping to include the `format_array` flag for logical plan file format serialization.
`datafusion/proto/src/generated/datafusion_proto_common.rs`	Regenerated protobuf Rust bindings to add `format_array` to `JsonOptions`.
`datafusion/proto-common/src/to_proto/mod.rs`	Includes `format_array` when converting `JsonOptions` to protobuf for common proto utilities.
`datafusion/proto-common/src/generated/prost.rs`	Regenerated prost definitions to add the `format_array` field to `JsonOptions`.
`datafusion/proto-common/src/generated/pbjson.rs`	Extends JSON (de)serialization of `JsonOptions` to handle the `format_array` field.
`datafusion/proto-common/src/from_proto/mod.rs`	Maps the protobuf `format_array` flag back into `JsonOptions`.
`datafusion/proto-common/proto/datafusion_common.proto`	Adds `bool format_array = 4;` to the `JsonOptions` message definition.
`datafusion/datasource-json/src/source.rs`	Threads `format_array` through `JsonOpener`/`JsonSource` and adds `read_json_array_to_batches`; array files are read by loading the full array, converting to NDJSON, and delegating to Arrow’s JSON reader.
`datafusion/datasource-json/src/file_format.rs`	Updates `JsonFormat` docs and behavior to support array mode, adds `with_format_array`, implements `infer_json_schema_from_json_array`, and passes `format_array` into `JsonSource`.
`datafusion/datasource-json/Cargo.toml`	Adds `serde_json` as a dependency to support JSON array parsing.
`datafusion/core/tests/data/json_empty_array.json`	Provides an empty JSON array test file used by schema inference tests.
`datafusion/core/tests/data/json_array.json`	Provides a sample JSON array file used by sqllogictests.
`datafusion/core/src/datasource/file_format/json.rs`	Adds tests covering array-format JSON: schema inference, empty array behavior, inference limit, data reading, and projection handling.
`datafusion/common/src/config.rs`	Extends `JsonOptions` config namespace with a documented `format_array` flag and describes NDJSON vs array formats.
`datafusion-examples/examples/custom_data_source/csv_json_opener.rs`	Updates the custom `JsonOpener::new` example to pass the new `format_array` parameter (set to `false` for NDJSON).
`Cargo.lock`	Updates lockfile to account for the new `serde_json` dependency in `datasource-json`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

datafusion/datasource-json/src/file_format.rs

datafusion/datasource-json/src/source.rs

alamb

Thank you @zhuqi-lucas -- I think this code looks good and well tested

I think we should reconsider:

The name of the option
Performance

My reading of the code is that this PR will parse an input JSON array 3 times (once for schema inference, one to convert to NDJSON, and then once during the actual parsing)

I personally recommend we look into avoiding this overhead

You might be able to re-use the same approach as the arrow-rs parser if you can figure out how to trim the input to remove the first [ and last ] rather than having an entirely different code path

It might be better to add such skipping directly in the arrow reader

That all being said, I think it would be ok to merge this code in as is and file a ticket to improve it as a follow on

alamb · 2026-01-22T15:22:29Z

datafusion/common/src/config.rs

+        ///     {"key1": 2, "key2": "vals"}
+        /// ]
+        /// ```
+        pub format_array: bool, default = false


I think format_array will be hard to discover / find and we should call this parameter something more standard.

I looked at what other systems did and there is no consistency.

I reviewed Spark's doc and they seem to use 'multiLine =truefor what you have labelledformat_array`
https://spark.apache.org/docs/latest/sql-data-sources-json.html

Duckdb seems to call it format=newline_delimited: https://duckdb.org/docs/stable/data/json/loading_json#parameters

postgres seems to have two separate functions row_to_json and array_to_json
https://www.postgresql.org/docs/9.5/functions-json.html

I think I prefer the duckdb style newline_delimited of the options, though maybe the spark multiline would be more widely understood

IMO it would be better to use an enum here, e.g. JSON_FORMAT {NDJSON, ARRAY}.
It will be more clear than true/false and also easier to extend with a third, fourth, ... formats later

I chosen the Duckdb style in latest PR.

alamb · 2026-01-22T15:25:41Z

datafusion/core/src/datasource/file_format/json.rs

+        let session = SessionContext::new();
+        let ctx = session.state();
+        let store = Arc::new(LocalFileSystem::new()) as _;
+
+        // Create a temporary file with JSON array format
+        let tmp_dir = tempfile::TempDir::new()?;
+        let path = format!("{}/array.json", tmp_dir.path().to_string_lossy());
+        std::fs::write(
+            &path,
+            r#"[
+                {"a": 1, "b": 2.0, "c": true},
+                {"a": 2, "b": 3.5, "c": false},
+                {"a": 3, "b": 4.0, "c": true}
+            ]"#,
+        )?;
+


I think this standard preamble is could be reduced so there were fewer test lines (and thus it was easier to veriy what was being tested)

For example, it looks like you maybe could make a function like

let file_schema = create_json_with_format({..}", format);

I bet the tests would be less than half the size

Good suggestion! Addressed in latest PR.

alamb · 2026-01-22T15:27:03Z

datafusion/datasource-json/src/file_format.rs

+///
+/// ## Line-Delimited JSON (default)
+/// ```text
+/// {"key1": 1, "key2": "val"}


Just to confirm, is this a change in default behavior? I don't think so but I wanted to double check

alamb · 2026-01-22T15:27:55Z

datafusion/datasource-json/src/file_format.rs

+    })?;
+
+    // Parse as JSON array using serde_json
+    let values: Vec<serde_json::Value> = serde_json::from_str(&content)


this is likely to be super slow -- it parses the entire JSON file (and then throws the parsed results away) -- if there is some way to avoid the whole thing it is probably better (maybe as a follow on PR)

Good point @alamb , i redesigned the PR now, i hope it will have better performance, and i will try to port to arrow-rs as a follow-up if it's valid changes.

zhuqi-lucas · 2026-01-23T08:09:21Z

Thank you @zhuqi-lucas -- I think this code looks good and well tested

I think we should reconsider:

The name of the option

Performance

My reading of the code is that this PR will parse an input JSON array 3 times (once for schema inference, one to convert to NDJSON, and then once during the actual parsing)

I personally recommend we look into avoiding this overhead

You might be able to re-use the same approach as the arrow-rs parser if you can figure out how to trim the input to remove the first [ and last ] rather than having an entirely different code path

It might be better to add such skipping directly in the arrow reader

That all being said, I think it would be ok to merge this code in as is and file a ticket to improve it as a follow on

Thank you @alamb for review, good suggestion, i will address comments and redesign this PR to make the performance better, also optimize the name.

martin-g · 2026-01-24T12:00:33Z

datafusion/datasource-json/src/source.rs

+    reader.read_to_string(&mut content)?;
+
+    // Parse JSON array
+    let values: Vec<serde_json::Value> = serde_json::from_str(&content)


Why don't you use serde_json::from_reader(reader) ?
No need to read into String first

Actually this method seems to be very memory hungry.
It should use streaming parsing instead.
Here is a demo that you could use for inspiration - https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=e5a75081c5623eefc8a5eecd07f8e924

Good suggestion @martin-g , i will try to address this idea!

datafusion/proto-common/src/generated/pbjson.rs

martin-g · 2026-01-24T12:53:12Z

datafusion/datasource-json/src/file_format.rs

+                    } else {
+                        // JSON array format: read content and extract records
+                        let mut content = String::new();
+                        reader.read_to_string(&mut content)?;


This also should use streaming parsing

martin-g · 2026-01-24T12:55:22Z

datafusion/datasource-json/src/file_format.rs

+                        // JSON array format: read content and extract records
+                        let mut content = String::new();
+                        reader.read_to_string(&mut content)?;
+                        infer_schema_from_json_array_content(&content, records_to_read)?


records_to_read should be decremented for each processed file

martin-g · 2026-01-24T13:01:14Z

datafusion/datasource-json/src/file_format.rs

+            _ => {}
+        }
+    }
+


There are no checks for unbalanced braces or non-closed strings.

Suggested change

if depth != 0 || in_string || record_start.is_some() {

return Err(DataFusionError::Execution(

"Malformed JSON array: unbalanced object braces or unterminated string"

.to_string(),

));

}

zhuqi-lucas · 2026-01-25T12:41:33Z

Thank you @alamb and @martin-g for review, i have redesigned the PR now, it has readable name now, and it should use less memory and also performance should be better.

martin-g · 2026-01-26T08:51:13Z

datafusion/core/src/datasource/file_format/options.rs

-/// Options that control the reading of Line-delimited JSON files (NDJson)
+/// Options that control the reading of JSON files.
+///
+/// Supports both newline-delimited JSON (NDJSON) and JSON array formats.


The struct name is still NdJsonReadOptions. I guess it is not renamed (e.g. to JsonReadOptions) to prevent API breaks.
Maybe you could rename it and add a deprecated type alias that will be usable for a while ?!

Good point @martin-g , i changed to JsonReadOptions in latest PR, also add NdJsonReadOptions for deprecated type alias!

martin-g · 2026-01-26T09:00:06Z

datafusion/datasource-json/src/utils.rs

+    /// - Unbalanced braces/brackets (depth != 0)
+    /// - Unterminated string
+    /// - Missing closing `]`
+    pub fn validate_complete(&self) -> std::io::Result<()> {


This new method seems unused.
Please add some tests with invalid JSON to trigger the failures.

Yes @martin-g , added more unit tests in latest PR.

datafusion/common/src/config.rs

martin-g · 2026-01-26T09:24:49Z

datafusion/datasource-json/src/utils.rs

+        reader.read_to_string(&mut output).unwrap();
+        // Top-level whitespace is skipped, internal whitespace preserved
+        assert_eq!(output, "{\"a\":1}\n{\"a\":2}");
+    }


Please also add a test case for let input = r#" [ {"a":1} , {"a":2} ] some { junk [ here ] "#;

Added this test in latest PR.

datafusion/proto-common/proto/datafusion_common.proto

datafusion/datasource-json/src/mod.rs

datafusion/proto-common/src/from_proto/mod.rs

martin-g · 2026-01-26T12:32:01Z

datafusion/datasource-json/src/source.rs

+                        // JSON array format from stream: collect bytes first, then use streaming converter
+                        // Note: We still need to collect for streams, but the converter avoids
+                        // additional memory overhead from serde_json parsing
+                        let bytes = s


This still loads the JSON array completely in memory, so processing big/huge JSON files still may lead to out of memory errors.
You need to use file_compression_type.convert_stream() as for NDJSON instead of consuming the whole file in memory and use std::io::Read.

I changed to streaming in latest PR, thanks @martin-g !

zhuqi-lucas · 2026-01-26T13:21:05Z

Thank you @martin-g for review, i addressed your comments in latest commit.

zhuqi-lucas · 2026-02-02T14:46:31Z

I redesign this PR according to our prod large dataset testing, the new design works well.

zhuqi-lucas · 2026-02-03T06:42:19Z

I will merge this solution if no new comments in two days, thanks @alamb @martin-g!

datafusion/datasource-json/src/source.rs

datafusion/datasource-json/src/file_format.rs

datafusion/datasource-json/src/source.rs

datafusion/datasource-json/Cargo.toml

datafusion/core/src/datasource/file_format/json.rs

datafusion/datasource-json/src/utils.rs

1. Forward stream read errors to result_tx instead of only logging them 2. Switch byte channel from std::sync::mpsc to tokio::sync::mpsc to avoid blocking a tokio worker thread when the buffer is full 3. Add validate_complete() after parsing to detect malformed JSON 4. Use workspace = true for serde_json dep; remove unused log dep 5. Use PathBuf::join() instead of hardcoded "/" in test paths 6. Detect leading non-whitespace content before '[' in JSON array parser https://claude.ai/code/session_01UGW59c5Fu8isLiNhU191uy

zhuqi-lucas · 2026-02-10T02:20:42Z

Thank you @martin-g for new review comments, addressed in latest PR.

zhuqi-lucas · 2026-02-11T04:11:59Z

Merged to main branch, thanks @alamb and @martin-g for review, if any new issues i will create a follow-up ticket.

alamb · 2026-02-13T19:26:31Z

Thanks -- epic work @zhuqi-lucas

Copilot AI review requested due to automatic review settings January 21, 2026 08:40

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate datasource Changes to the datasource crate labels Jan 21, 2026

Copilot started reviewing on behalf of zhuqi-lucas January 21, 2026 08:45 View session

Copilot AI reviewed Jan 21, 2026

View reviewed changes

datafusion/datasource-json/src/file_format.rs Outdated Show resolved Hide resolved

datafusion/datasource-json/src/source.rs Outdated Show resolved Hide resolved

zhuqi-lucas requested review from alamb and xudong963 January 21, 2026 09:22

zhuqi-lucas mentioned this pull request Jan 21, 2026

Support JSON arrays reader/parse for datafusion #19920

Closed

zhuqi-lucas force-pushed the json_array_support branch from b6f3d8a to 590f97e Compare January 22, 2026 03:54

zhuqi-lucas mentioned this pull request Jan 22, 2026

array json support for datafusion massive-com/arrow-datafusion#27

Merged

alamb approved these changes Jan 22, 2026

View reviewed changes

martin-g reviewed Jan 24, 2026

View reviewed changes

zhuqi-lucas requested review from alamb and martin-g January 26, 2026 02:42

martin-g reviewed Jan 26, 2026

View reviewed changes

datafusion/datasource-json/src/mod.rs Outdated Show resolved Hide resolved

martin-g reviewed Jan 26, 2026

View reviewed changes

datafusion/proto-common/src/from_proto/mod.rs Outdated Show resolved Hide resolved

martin-g reviewed Jan 26, 2026

View reviewed changes

zhuqi-lucas requested a review from martin-g January 26, 2026 13:23

Json array support

f6e769b

zhuqi-lucas force-pushed the json_array_support branch from 00c7155 to f6e769b Compare January 30, 2026 02:33

fmt

81acb37

zhuqi-lucas mentioned this pull request Jan 30, 2026

High performance json massive-com/arrow-datafusion#30

Merged

Redesign the solution for array json

f7176ef

zhuqi-lucas and others added 3 commits February 2, 2026 22:53

fix

d3d57d4

Merge branch 'main' into json_array_support

ecf3b1e

fix

828deb2

zhuqi-lucas mentioned this pull request Feb 3, 2026

Redesign json array streaming for datafusion massive-com/arrow-datafusion#31

Merged

martin-g reviewed Feb 3, 2026

View reviewed changes

zhuqi-lucas added 3 commits February 10, 2026 10:09

Address martin-g review comments

cad0844

Merge remote-tracking branch 'upstream/main' into json_array_support

b3f4526

fmt

c8e2e4b

zhuqi-lucas requested a review from martin-g February 10, 2026 02:20

Merge branch 'main' into json_array_support

67e1862

zhuqi-lucas added this pull request to the merge queue Feb 11, 2026

Merged via the queue into apache:main with commit 69d0f44 Feb 11, 2026
31 checks passed

alamb mentioned this pull request Feb 13, 2026

Release DataFusion 53.0.0 (Feb 2026 / Mar 2026) #19692

Open

25 tasks

+if depth != 0 || in_string || record_start.is_some() {
+        return Err(DataFusionError::Execution(
+            "Malformed JSON array: unbalanced object braces or unterminated string"
+                .to_string(),
+        ));
+    }

Conversation

zhuqi-lucas commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Architecture

Memory Efficiency

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented Jan 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-g Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented Jan 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented Jan 26, 2026

Uh oh!

zhuqi-lucas commented Feb 2, 2026

Uh oh!

zhuqi-lucas commented Jan 21, 2026 •

edited

Loading

martin-g Jan 24, 2026 •

edited

Loading