Skip to content

Conversation

@zhuqi-lucas
Copy link
Contributor

Which issue does this PR close?

Closes #19920

Rationale for this change

DataFusion currently only supports line-delimited JSON (NDJSON) format. Many data sources provide JSON in array format [{...}, {...}], which cannot be parsed by the existing implementation.

What changes are included in this PR?

  • Add format_array option to JsonOptions to support JSON array format
  • Implement schema inference for JSON array format
  • Implement data reading for JSON array format
  • Add protobuf serialization support for the new option
  • SQL support via OPTIONS ('format.format_array' 'true')

Are these changes tested?

Yes:

  • Unit tests for schema inference (normal, empty array, with limit)
  • Unit tests for data reading and projection
  • SQLLogicTest for end-to-end SQL validation

Are there any user-facing changes?

Yes. Users can now read JSON array format files by specifying the format.format_array option:

CREATE EXTERNAL TABLE my_table
STORED AS JSON
OPTIONS ('format.format_array' 'true')
LOCATION 'path/to/array.json';

Copilot AI review requested due to automatic review settings January 21, 2026 08:40
@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate datasource Changes to the datasource crate labels Jan 21, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends DataFusion’s JSON support to handle files where the top-level value is a JSON array ([{...}, {...}]), in addition to the existing newline-delimited JSON (NDJSON) format.

Changes:

  • Adds a format_array: bool option to JsonOptions (config, protobuf, and JSON (de)serialization) and wires it through JsonFormat/JsonSource into the JSON execution path.
  • Implements array-aware schema inference and reading in datasource-json, including helper functions to infer schemas and read array JSON into RecordBatches, plus updates examples and tests (unit tests and sqllogictests).
  • Adds new test data (json_array.json, json_empty_array.json) and SQLLogic tests to validate array format behavior and the new OPTIONS ('format.format_array' 'true') flag.

Reviewed changes

Copilot reviewed 13 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
datafusion/sqllogictest/test_files/json.slt Adds end-to-end SQL tests for JSON array input, including a failure case when format_array is not set.
datafusion/proto/src/logical_plan/file_formats.rs Extends JsonOptionsProtoJsonOptions mapping to include the format_array flag for logical plan file format serialization.
datafusion/proto/src/generated/datafusion_proto_common.rs Regenerated protobuf Rust bindings to add format_array to JsonOptions.
datafusion/proto-common/src/to_proto/mod.rs Includes format_array when converting JsonOptions to protobuf for common proto utilities.
datafusion/proto-common/src/generated/prost.rs Regenerated prost definitions to add the format_array field to JsonOptions.
datafusion/proto-common/src/generated/pbjson.rs Extends JSON (de)serialization of JsonOptions to handle the format_array field.
datafusion/proto-common/src/from_proto/mod.rs Maps the protobuf format_array flag back into JsonOptions.
datafusion/proto-common/proto/datafusion_common.proto Adds bool format_array = 4; to the JsonOptions message definition.
datafusion/datasource-json/src/source.rs Threads format_array through JsonOpener/JsonSource and adds read_json_array_to_batches; array files are read by loading the full array, converting to NDJSON, and delegating to Arrow’s JSON reader.
datafusion/datasource-json/src/file_format.rs Updates JsonFormat docs and behavior to support array mode, adds with_format_array, implements infer_json_schema_from_json_array, and passes format_array into JsonSource.
datafusion/datasource-json/Cargo.toml Adds serde_json as a dependency to support JSON array parsing.
datafusion/core/tests/data/json_empty_array.json Provides an empty JSON array test file used by schema inference tests.
datafusion/core/tests/data/json_array.json Provides a sample JSON array file used by sqllogictests.
datafusion/core/src/datasource/file_format/json.rs Adds tests covering array-format JSON: schema inference, empty array behavior, inference limit, data reading, and projection handling.
datafusion/common/src/config.rs Extends JsonOptions config namespace with a documented format_array flag and describes NDJSON vs array formats.
datafusion-examples/examples/custom_data_source/csv_json_opener.rs Updates the custom JsonOpener::new example to pass the new format_array parameter (set to false for NDJSON).
Cargo.lock Updates lockfile to account for the new serde_json dependency in datasource-json.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuqi-lucas -- I think this code looks good and well tested

I think we should reconsider:

  1. The name of the option
  2. Performance

My reading of the code is that this PR will parse an input JSON array 3 times (once for schema inference, one to convert to NDJSON, and then once during the actual parsing)

I personally recommend we look into avoiding this overhead

You might be able to re-use the same approach as the arrow-rs parser if you can figure out how to trim the input to remove the first [ and last ] rather than having an entirely different code path

It might be better to add such skipping directly in the arrow reader

That all being said, I think it would be ok to merge this code in as is and file a ticket to improve it as a follow on

/// {"key1": 2, "key2": "vals"}
/// ]
/// ```
pub format_array: bool, default = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think format_array will be hard to discover / find and we should call this parameter something more standard.

I looked at what other systems did and there is no consistency.

I reviewed Spark's doc and they seem to use 'multiLine =truefor what you have labelledformat_array`
https://spark.apache.org/docs/latest/sql-data-sources-json.html

Duckdb seems to call it format=newline_delimited: https://duckdb.org/docs/stable/data/json/loading_json#parameters

postgres seems to have two separate functions row_to_json and array_to_json
https://www.postgresql.org/docs/9.5/functions-json.html

I think I prefer the duckdb style newline_delimited of the options, though maybe the spark multiline would be more widely understood

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be better to use an enum here, e.g. JSON_FORMAT {NDJSON, ARRAY}.
It will be more clear than true/false and also easier to extend with a third, fourth, ... formats later

Comment on lines +398 to +413
let session = SessionContext::new();
let ctx = session.state();
let store = Arc::new(LocalFileSystem::new()) as _;

// Create a temporary file with JSON array format
let tmp_dir = tempfile::TempDir::new()?;
let path = format!("{}/array.json", tmp_dir.path().to_string_lossy());
std::fs::write(
&path,
r#"[
{"a": 1, "b": 2.0, "c": true},
{"a": 2, "b": 3.5, "c": false},
{"a": 3, "b": 4.0, "c": true}
]"#,
)?;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this standard preamble is could be reduced so there were fewer test lines (and thus it was easier to veriy what was being tested)

For example, it looks like you maybe could make a function like

let file_schema = create_json_with_format({..}", format);

I bet the tests would be less than half the size

///
/// ## Line-Delimited JSON (default)
/// ```text
/// {"key1": 1, "key2": "val"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, is this a change in default behavior? I don't think so but I wanted to double check

})?;

// Parse as JSON array using serde_json
let values: Vec<serde_json::Value> = serde_json::from_str(&content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is likely to be super slow -- it parses the entire JSON file (and then throws the parsed results away) -- if there is some way to avoid the whole thing it is probably better (maybe as a follow on PR)

@zhuqi-lucas
Copy link
Contributor Author

Thank you @zhuqi-lucas -- I think this code looks good and well tested

I think we should reconsider:

  1. The name of the option
  2. Performance

My reading of the code is that this PR will parse an input JSON array 3 times (once for schema inference, one to convert to NDJSON, and then once during the actual parsing)

I personally recommend we look into avoiding this overhead

You might be able to re-use the same approach as the arrow-rs parser if you can figure out how to trim the input to remove the first [ and last ] rather than having an entirely different code path

It might be better to add such skipping directly in the arrow reader

That all being said, I think it would be ok to merge this code in as is and file a ticket to improve it as a follow on

Thank you @alamb for review, good suggestion, i will address comments and redesign this PR to make the performance better, also optimize the name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support JSON arrays reader/parse for datafusion

3 participants