Skip to content

Support JSON arrays reader/parse for datafusion #19920

@zhuqi-lucas

Description

@zhuqi-lucas

Is your feature request related to a problem or challenge?

DataFusion currently uses arrow-json's LineDelimitedReader, which is optimized for NDJSON format. When we encounter data sources that provide JSON arrays (i.e., [{...}, {...}]), we run into parsing issues.

Describe the solution you'd like

Add a format_array option to JsonOptions to support reading JSON array format:

CREATE EXTERNAL TABLE my_table
STORED AS JSON
OPTIONS ('format.format_array' 'true')
LOCATION 'path/to/array.json';
  • Backward compatible - existing code continues to work unchanged (default is line-delimited)
  • Explicit control - users can specify which format they're working with

Implementation approach

Since arrow-json's ReaderBuilder only supports line-delimited JSON directly, the implementation:

  1. Parses JSON array [{...}, {...}] with serde_json
  2. Converts to NDJSON format for arrow-json's ReaderBuilder to process

Note: JSON array format does not support range-based file scanning (repartition_file_scans) since the entire array must be read to parse correctly.

Describe alternatives you've considered

  • Auto-detection of format: Rejected due to potential errors with large files and added complexity
  • Waiting for arrow-json native support: No timeline for this feature upstream

Additional context

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions