-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Open
Description
Arrow currently supports JSON reading in C++, Python, Java, etc. But it currently lacks any equivalent of a JSON writer. While Rust (arrow_json::writer) and Go (arrjson) implement their own serialization, they do not leverage the shared C++ core.
This results in these limitations:
- Python users (e.g., BigQuery→Arrow→Postgres JSONB, which is my particular use case) must fall back to slow, Python-level loops or fallback to orjson, missing C++‑level performance.
- No feature parity with Rust and Go, which already provide fast JSON serialization.
- Large-scale pipelines suffer from marshalling overhead and poor scaling.
Proposal Overview
1. C++ Core: Add JSON Writer API
- Mirror the existing
arrow::json::TableReader
with a newarrow::json::TableWriter
orRecordBatchWriter
. - Support both output formats:
- NDJSON (newline-delimited)
- JSON array
- Configurable via builder-pattern options:
- Include or omit nulls
- Binary types encoding (e.g., Base64)
- Formatting (pretty, flat)
2. Bindings
- PyArrow: add
pyarrow.json.write_json(table_or_batch, sink=None, ndjson=False, include_nulls=True, binary_encoding="base64")
, wrapping the new C++ API. - Arrow Java: introduce a corresponding
JsonWriter
class to maintain cross-language feature consistency.
3. Functionality & Performance
- Full support for Arrow types: scalars, nested structs/lists, binary, timestamps, dictionaries, nulls.
- Streaming output row-by-row to avoid in-memory buffering.
- Benchmark target: achieve near-native performance, comparable to Rust’s
LineDelimitedWriter
.
*I had this request edited by an LLM, as I'm not very familiar with the Arrow backend architecture. I checked all the claims, but some inaccuracies might has slipped through, if so, sorry.
Component(s)
C++