-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Description
Describe the bug, including details regarding any error messages, version, and platform.
In R, when writing large list columns to parquet, arrow errors out with:
Error: Capacity error: List array cannot contain more than 2147483646 elements, have 1200
Reproducible example, works with the CRAN arrow
version (20.0.0.2), and the current git version (21.0.0.9000).
library(tibble)
library(arrow)
rows <- 2e6L
elements_each <- 1200L
tbl <- tibble(
id = seq_len(rows),
b = replicate(rows, list(seq_len(elements_each)), simplify = FALSE)
)
write_parquet(tbl, "big_list.parquet")
Actual behaviour: Error: Capacity error: List array cannot contain more than 2147483646 elements, have 1200
.
Expected behaviour: Automatically chunking behind the scenes, or a suggestion of how the user should chunk manually.
I think this is a similar bug to #10776, but happens with writing rather than reading. I'm looking for clarity whether this can be automatically chunked in the spirit of spirit of the vignette:
An important thing to note is that “chunking” is not semantically meaningful. It is an implementation detail only: users should never treat the chunk as a meaningful unit.
Alternatively a workaround would be good: I've tried some with write_dataset
, but I don't understand the internals well enough. Two things which didn't work (same error):
# Approach 1:
# Group by + write_dataset
tbl |>
dplyr::group_by(id = id %% 10L) |> # Create many groups by ID
write_dataset("big_list")
# Approach 2:
# max_rows...
tbl |>
write_dataset(
"big_list",
max_rows_per_file = 5000L,
max_rows_per_group = 5000L
)
session_info()
─ Session info ──────────────────────
setting value
version R version 4.5.0 (2025-04-11)
os Fedora Linux 42 (Workstation Edition)
system x86_64, linux-gnu
ui X11
language (EN)
collate en_GB.UTF-8
ctype en_GB.UTF-8
tz Europe/London
date 2025-07-22
pandoc 3.1.11.1 @ /usr/bin/pandoc
quarto 99.9.9 @ /home/wojtek/.local/bin/quarto
─ Packages ───────────────────────────
package * version date (UTC) lib source
arrow * 21.0.0.9000 2025-07-22 [1] local
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.5.0)
bit 4.6.0 2025-03-06 [1] CRAN (R 4.5.0)
bit64 4.6.0-1 2025-01-16 [1] CRAN (R 4.5.0)
cli 3.6.5 2025-04-23 [1] CRAN (R 4.5.0)
glue 1.8.0 2024-09-30 [1] CRAN (R 4.5.0)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.5.0)
magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.5.0)
pillar 1.11.0 2025-07-04 [1] CRAN (R 4.5.0)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.5.0)
purrr 1.1.0 2025-07-10 [1] CRAN (R 4.5.0)
R6 2.6.1 2025-02-15 [1] CRAN (R 4.5.0)
rlang 1.1.6 2025-04-11 [1] CRAN (R 4.5.0)
sessioninfo 1.2.3 2025-02-05 [1] CRAN (R 4.5.0)
tibble * 3.3.0 2025-06-08 [1] CRAN (R 4.5.0)
tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.5.0)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.5.0)
[1] /home/wojtek/.local/share/R/x86_64-pc-linux-gnu-library/4.5
[2] /opt/R/4.5.0/lib64/R/library
* ── Packages attached to the search path.
Component(s)
C++, R