Skip to content

[C++] Capacity Error when writing large list columns #47169

@wklimowicz

Description

@wklimowicz

Describe the bug, including details regarding any error messages, version, and platform.

In R, when writing large list columns to parquet, arrow errors out with:

Error: Capacity error: List array cannot contain more than 2147483646 elements, have 1200

Reproducible example, works with the CRAN arrow version (20.0.0.2), and the current git version (21.0.0.9000).

library(tibble)
library(arrow)

rows <- 2e6L
elements_each <- 1200L

tbl <- tibble(
  id = seq_len(rows),
  b = replicate(rows, list(seq_len(elements_each)), simplify = FALSE)
)

write_parquet(tbl, "big_list.parquet")

Actual behaviour: Error: Capacity error: List array cannot contain more than 2147483646 elements, have 1200.

Expected behaviour: Automatically chunking behind the scenes, or a suggestion of how the user should chunk manually.

I think this is a similar bug to #10776, but happens with writing rather than reading. I'm looking for clarity whether this can be automatically chunked in the spirit of spirit of the vignette:

An important thing to note is that “chunking” is not semantically meaningful. It is an implementation detail only: users should never treat the chunk as a meaningful unit.

Alternatively a workaround would be good: I've tried some with write_dataset, but I don't understand the internals well enough. Two things which didn't work (same error):

# Approach 1:
# Group by + write_dataset
tbl |>
  dplyr::group_by(id = id %% 10L) |> # Create many groups by ID
  write_dataset("big_list")

# Approach 2:
# max_rows...
tbl |>
  write_dataset(
    "big_list",
    max_rows_per_file = 5000L,
    max_rows_per_group = 5000L
  )
session_info()
─ Session info ──────────────────────
 setting  value
 version  R version 4.5.0 (2025-04-11)
 os       Fedora Linux 42 (Workstation Edition)
 system   x86_64, linux-gnu
 ui       X11
 language (EN)
 collate  en_GB.UTF-8
 ctype    en_GB.UTF-8
 tz       Europe/London
 date     2025-07-22
 pandoc   3.1.11.1 @ /usr/bin/pandoc
 quarto   99.9.9 @ /home/wojtek/.local/bin/quarto

─ Packages ───────────────────────────
 package     * version     date (UTC) lib source
 arrow       * 21.0.0.9000 2025-07-22 [1] local
 assertthat    0.2.1       2019-03-21 [1] CRAN (R 4.5.0)
 bit           4.6.0       2025-03-06 [1] CRAN (R 4.5.0)
 bit64         4.6.0-1     2025-01-16 [1] CRAN (R 4.5.0)
 cli           3.6.5       2025-04-23 [1] CRAN (R 4.5.0)
 glue          1.8.0       2024-09-30 [1] CRAN (R 4.5.0)
 lifecycle     1.0.4       2023-11-07 [1] CRAN (R 4.5.0)
 magrittr      2.0.3       2022-03-30 [1] CRAN (R 4.5.0)
 pillar        1.11.0      2025-07-04 [1] CRAN (R 4.5.0)
 pkgconfig     2.0.3       2019-09-22 [1] CRAN (R 4.5.0)
 purrr         1.1.0       2025-07-10 [1] CRAN (R 4.5.0)
 R6            2.6.1       2025-02-15 [1] CRAN (R 4.5.0)
 rlang         1.1.6       2025-04-11 [1] CRAN (R 4.5.0)
 sessioninfo   1.2.3       2025-02-05 [1] CRAN (R 4.5.0)
 tibble      * 3.3.0       2025-06-08 [1] CRAN (R 4.5.0)
 tidyselect    1.2.1       2024-03-11 [1] CRAN (R 4.5.0)
 vctrs         0.6.5       2023-12-01 [1] CRAN (R 4.5.0)

 [1] /home/wojtek/.local/share/R/x86_64-pc-linux-gnu-library/4.5
 [2] /opt/R/4.5.0/lib64/R/library
 * ── Packages attached to the search path. 

Component(s)

C++, R

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions