-
Notifications
You must be signed in to change notification settings - Fork 41
Description
Problem
I'm using {duckdb} as a dependency in my package {censobr}. I use {duckdb} to mege large data sets, and there is one particular join which DuckDB is not being able to handle due to RAM limits. The left join is based on 8 key columns and the two tables have 20.635.472
and 6.192.332
rows. Please see reprex below.
ps. I just wanted to add that {duckdb} is an incredible package and that the R community really appreciates your work on it ! Thanks !
Reprex
# install.packages('censobr')
library(censobr)
library(duckdb)
library(dplyr)
library(glue)
# get original tables
# these two lines of codes download and cache the data
pop <- censobr::read_population(year = 2010)
hh <- censobr::read_households(year = 2010)
## the reprex code works with this smaller table
# pop <- censobr::read_mortality(year = 2010)
# define key columns
key_vars <- c('code_muni', 'code_state', 'abbrev_state','name_state',
'code_region', 'name_region', 'code_weighting', 'V0300')
# rename household weight column
hh <- dplyr::rename(hh, 'V0010_household' = 'V0010') |>
dplyr::compute()
# drop repeated columns
all_common_vars <- names(pop)[names(pop) %in% names(hh)]
vars_to_drop <- setdiff(all_common_vars, key_vars)
hh <- dplyr::select(hh, -all_of(vars_to_drop)) |>
dplyr::compute()
# create db connection on a temp file
db_path <- tempfile(pattern = 'censobr', fileext = '.duckdb')
con <- duckdb::dbConnect(
duckdb::duckdb(),
dbdir = db_path
)
# # Even if I configure memory limit for duckdb to use
# # memory usage goes way up higher than the limit
# memory_limit <- "15GB"
# query_setup <- stringr::str_glue("SET memory_limit = '{memory_limit}';")
# DBI::dbExecute(con, query_setup)
# register data to db
duckdb::duckdb_register_arrow(con, 'pop', pop)
duckdb::duckdb_register_arrow(con, 'hh', hh)
# Create the JOIN condition by concatenating the key columns
join_condition <- paste(
glue::glue("pop.{key_vars} = hh.{key_vars}"),
collapse = ' AND '
)
query_match <- glue::glue(
"SELECT *
FROM pop
LEFT JOIN hh
ON {join_condition};"
)
merge_query <- duckdb::dbSendQuery(
conn = con,
query_match,
arrow = TRUE
)
# get result of the left join as an arrow table
# this is where things go wrong / out of memory
df_geo <- duckdb::duckdb_fetch_arrow(merge_query)
# check result
head(df_geo) |> collect()
## remove files from cache
# censobr::censobr_cache(delete_file = 'all')
Whenever I run the code above on a machine with 16GB or 32GB of RAM, I get the following error message. Even if I configure memory limit for duckdb, the memory usage goes way up higher than the limit and I still get the error. The code does work when I run it on a machine with 250GB of RAM.
Error in duckdb_result():
! rapi_execute: Failed to run query
Error: Out of Memory Error: Allocation failure
Run rlang::last_trace() to see where the error occurred
Environment
I'm using the latest version of {duckdb} in a Windows OS. See below.
> sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)
Matrix products: default
locale:
[1] LC_COLLATE=Portuguese_Brazil.utf8 LC_CTYPE=Portuguese_Brazil.utf8
[3] LC_MONETARY=Portuguese_Brazil.utf8 LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.utf8
time zone: America/Sao_Paulo
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] glue_1.8.0 dplyr_1.1.4 duckdb_1.1.3-2 DBI_1.2.3 censobr_0.4.0
loaded via a namespace (and not attached):
[1] R6_2.6.1 tidyselect_1.2.1 magrittr_2.0.3 tibble_3.2.1 pkgconfig_2.0.3
[6] generics_0.1.3 lifecycle_1.0.4 cli_3.6.4 vctrs_0.6.5 compiler_4.4.1
[11] rstudioapi_0.17.1 tools_4.4.1 pillar_1.10.1 rlang_1.1.5 fs_1.6.5