Skip to content

[C++] Add JSON/NDJSON writer to Arrow C++ core — then expose via PyArrow/Java #47173

@Guthman

Description

@Guthman

Arrow currently supports JSON reading in C++, Python, Java, etc. But it currently lacks any equivalent of a JSON writer. While Rust (arrow_json::writer) and Go (arrjson) implement their own serialization, they do not leverage the shared C++ core.

This results in these limitations:

  • Python users (e.g., BigQuery→Arrow→Postgres JSONB, which is my particular use case) must fall back to slow, Python-level loops or fallback to orjson, missing C++‑level performance.
  • No feature parity with Rust and Go, which already provide fast JSON serialization.
  • Large-scale pipelines suffer from marshalling overhead and poor scaling.

Proposal Overview

1. C++ Core: Add JSON Writer API

  • Mirror the existing arrow::json::TableReader with a new arrow::json::TableWriter or RecordBatchWriter.
  • Support both output formats:
    • NDJSON (newline-delimited)
    • JSON array
  • Configurable via builder-pattern options:
    • Include or omit nulls
    • Binary types encoding (e.g., Base64)
    • Formatting (pretty, flat)

2. Bindings

  • PyArrow: add pyarrow.json.write_json(table_or_batch, sink=None, ndjson=False, include_nulls=True, binary_encoding="base64"), wrapping the new C++ API.
  • Arrow Java: introduce a corresponding JsonWriter class to maintain cross-language feature consistency.

3. Functionality & Performance

  • Full support for Arrow types: scalars, nested structs/lists, binary, timestamps, dictionaries, nulls.
  • Streaming output row-by-row to avoid in-memory buffering.
  • Benchmark target: achieve near-native performance, comparable to Rust’s LineDelimitedWriter.

*I had this request edited by an LLM, as I'm not very familiar with the Arrow backend architecture. I checked all the claims, but some inaccuracies might has slipped through, if so, sorry.

Component(s)

C++

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions