-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
DataFusion currently uses arrow-json's LineDelimitedReader, which is optimized for NDJSON format. When we encounter data sources that provide JSON arrays (i.e., [{...}, {...}]), we run into parsing issues.
Describe the solution you'd like
Add a format_array option to JsonOptions to support reading JSON array format:
CREATE EXTERNAL TABLE my_table
STORED AS JSON
OPTIONS ('format.format_array' 'true')
LOCATION 'path/to/array.json';- Backward compatible - existing code continues to work unchanged (default is line-delimited)
- Explicit control - users can specify which format they're working with
Implementation approach
Since arrow-json's ReaderBuilder only supports line-delimited JSON directly, the implementation:
- Parses JSON array
[{...}, {...}]withserde_json - Converts to NDJSON format for arrow-json's
ReaderBuilderto process
Note: JSON array format does not support range-based file scanning (repartition_file_scans) since the entire array must be read to parse correctly.
Describe alternatives you've considered
- Auto-detection of format: Rejected due to potential errors with large files and added complexity
- Waiting for arrow-json native support: No timeline for this feature upstream
Additional context
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request