-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Support JSON arrays reader/parse for datafusion #19924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR extends DataFusion’s JSON support to handle files where the top-level value is a JSON array ([{...}, {...}]), in addition to the existing newline-delimited JSON (NDJSON) format.
Changes:
- Adds a
format_array: booloption toJsonOptions(config, protobuf, and JSON (de)serialization) and wires it throughJsonFormat/JsonSourceinto the JSON execution path. - Implements array-aware schema inference and reading in
datasource-json, including helper functions to infer schemas and read array JSON intoRecordBatches, plus updates examples and tests (unit tests and sqllogictests). - Adds new test data (
json_array.json,json_empty_array.json) and SQLLogic tests to validate array format behavior and the newOPTIONS ('format.format_array' 'true')flag.
Reviewed changes
Copilot reviewed 13 out of 17 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
datafusion/sqllogictest/test_files/json.slt |
Adds end-to-end SQL tests for JSON array input, including a failure case when format_array is not set. |
datafusion/proto/src/logical_plan/file_formats.rs |
Extends JsonOptionsProto ↔ JsonOptions mapping to include the format_array flag for logical plan file format serialization. |
datafusion/proto/src/generated/datafusion_proto_common.rs |
Regenerated protobuf Rust bindings to add format_array to JsonOptions. |
datafusion/proto-common/src/to_proto/mod.rs |
Includes format_array when converting JsonOptions to protobuf for common proto utilities. |
datafusion/proto-common/src/generated/prost.rs |
Regenerated prost definitions to add the format_array field to JsonOptions. |
datafusion/proto-common/src/generated/pbjson.rs |
Extends JSON (de)serialization of JsonOptions to handle the format_array field. |
datafusion/proto-common/src/from_proto/mod.rs |
Maps the protobuf format_array flag back into JsonOptions. |
datafusion/proto-common/proto/datafusion_common.proto |
Adds bool format_array = 4; to the JsonOptions message definition. |
datafusion/datasource-json/src/source.rs |
Threads format_array through JsonOpener/JsonSource and adds read_json_array_to_batches; array files are read by loading the full array, converting to NDJSON, and delegating to Arrow’s JSON reader. |
datafusion/datasource-json/src/file_format.rs |
Updates JsonFormat docs and behavior to support array mode, adds with_format_array, implements infer_json_schema_from_json_array, and passes format_array into JsonSource. |
datafusion/datasource-json/Cargo.toml |
Adds serde_json as a dependency to support JSON array parsing. |
datafusion/core/tests/data/json_empty_array.json |
Provides an empty JSON array test file used by schema inference tests. |
datafusion/core/tests/data/json_array.json |
Provides a sample JSON array file used by sqllogictests. |
datafusion/core/src/datasource/file_format/json.rs |
Adds tests covering array-format JSON: schema inference, empty array behavior, inference limit, data reading, and projection handling. |
datafusion/common/src/config.rs |
Extends JsonOptions config namespace with a documented format_array flag and describes NDJSON vs array formats. |
datafusion-examples/examples/custom_data_source/csv_json_opener.rs |
Updates the custom JsonOpener::new example to pass the new format_array parameter (set to false for NDJSON). |
Cargo.lock |
Updates lockfile to account for the new serde_json dependency in datasource-json. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b6f3d8a to
590f97e
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zhuqi-lucas -- I think this code looks good and well tested
I think we should reconsider:
- The name of the option
- Performance
My reading of the code is that this PR will parse an input JSON array 3 times (once for schema inference, one to convert to NDJSON, and then once during the actual parsing)
I personally recommend we look into avoiding this overhead
You might be able to re-use the same approach as the arrow-rs parser if you can figure out how to trim the input to remove the first [ and last ] rather than having an entirely different code path
It might be better to add such skipping directly in the arrow reader
That all being said, I think it would be ok to merge this code in as is and file a ticket to improve it as a follow on
| /// {"key1": 2, "key2": "vals"} | ||
| /// ] | ||
| /// ``` | ||
| pub format_array: bool, default = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think format_array will be hard to discover / find and we should call this parameter something more standard.
I looked at what other systems did and there is no consistency.
I reviewed Spark's doc and they seem to use 'multiLine =truefor what you have labelledformat_array`
https://spark.apache.org/docs/latest/sql-data-sources-json.html
Duckdb seems to call it format=newline_delimited: https://duckdb.org/docs/stable/data/json/loading_json#parameters
postgres seems to have two separate functions row_to_json and array_to_json
https://www.postgresql.org/docs/9.5/functions-json.html
I think I prefer the duckdb style newline_delimited of the options, though maybe the spark multiline would be more widely understood
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it would be better to use an enum here, e.g. JSON_FORMAT {NDJSON, ARRAY}.
It will be more clear than true/false and also easier to extend with a third, fourth, ... formats later
| let session = SessionContext::new(); | ||
| let ctx = session.state(); | ||
| let store = Arc::new(LocalFileSystem::new()) as _; | ||
|
|
||
| // Create a temporary file with JSON array format | ||
| let tmp_dir = tempfile::TempDir::new()?; | ||
| let path = format!("{}/array.json", tmp_dir.path().to_string_lossy()); | ||
| std::fs::write( | ||
| &path, | ||
| r#"[ | ||
| {"a": 1, "b": 2.0, "c": true}, | ||
| {"a": 2, "b": 3.5, "c": false}, | ||
| {"a": 3, "b": 4.0, "c": true} | ||
| ]"#, | ||
| )?; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this standard preamble is could be reduced so there were fewer test lines (and thus it was easier to veriy what was being tested)
For example, it looks like you maybe could make a function like
let file_schema = create_json_with_format({..}", format);I bet the tests would be less than half the size
| /// | ||
| /// ## Line-Delimited JSON (default) | ||
| /// ```text | ||
| /// {"key1": 1, "key2": "val"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, is this a change in default behavior? I don't think so but I wanted to double check
| })?; | ||
|
|
||
| // Parse as JSON array using serde_json | ||
| let values: Vec<serde_json::Value> = serde_json::from_str(&content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is likely to be super slow -- it parses the entire JSON file (and then throws the parsed results away) -- if there is some way to avoid the whole thing it is probably better (maybe as a follow on PR)
Thank you @alamb for review, good suggestion, i will address comments and redesign this PR to make the performance better, also optimize the name. |
Which issue does this PR close?
Closes #19920
Rationale for this change
DataFusion currently only supports line-delimited JSON (NDJSON) format. Many data sources provide JSON in array format
[{...}, {...}], which cannot be parsed by the existing implementation.What changes are included in this PR?
format_arrayoption toJsonOptionsto support JSON array formatOPTIONS ('format.format_array' 'true')Are these changes tested?
Yes:
Are there any user-facing changes?
Yes. Users can now read JSON array format files by specifying the
format.format_arrayoption: