Skip to content

duckdb_fetch_arrow() consuming way too much memory in R #1065

@rafapereirabr

Description

@rafapereirabr

Problem

I'm using {duckdb} as a dependency in my package {censobr}. I use {duckdb} to mege large data sets, and there is one particular join which DuckDB is not being able to handle due to RAM limits. The left join is based on 8 key columns and the two tables have 20.635.472 and 6.192.332 rows. Please see reprex below.

ps. I just wanted to add that {duckdb} is an incredible package and that the R community really appreciates your work on it ! Thanks !

Reprex

# install.packages('censobr')

library(censobr)
library(duckdb)
library(dplyr)
library(glue)

# get original tables
# these two lines of codes download and cache the data
pop <- censobr::read_population(year = 2010)
hh <- censobr::read_households(year = 2010)

## the reprex code works with this smaller table
# pop <- censobr::read_mortality(year = 2010)


# define key columns
key_vars <- c('code_muni', 'code_state', 'abbrev_state','name_state',
              'code_region', 'name_region', 'code_weighting', 'V0300')

# rename household weight column
hh <- dplyr::rename(hh, 'V0010_household' = 'V0010') |>
  dplyr::compute()

# drop repeated columns
all_common_vars <- names(pop)[names(pop) %in% names(hh)]
vars_to_drop <- setdiff(all_common_vars, key_vars)
hh <- dplyr::select(hh, -all_of(vars_to_drop)) |>
  dplyr::compute()


# create db connection on a temp file
db_path <- tempfile(pattern = 'censobr', fileext = '.duckdb')

con <- duckdb::dbConnect(
  duckdb::duckdb(),
  dbdir = db_path
)

# # Even if I configure memory limit for duckdb to use
# # memory usage goes way up higher than the limit
# memory_limit <- "15GB"
# query_setup <- stringr::str_glue("SET memory_limit = '{memory_limit}';")
# DBI::dbExecute(con, query_setup)

# register data to db
duckdb::duckdb_register_arrow(con, 'pop', pop)
duckdb::duckdb_register_arrow(con, 'hh', hh)


# Create the JOIN condition by concatenating the key columns
join_condition <- paste(
  glue::glue("pop.{key_vars} = hh.{key_vars}"),
  collapse = ' AND '
)

query_match <- glue::glue(
  "SELECT *
    FROM pop
    LEFT JOIN hh
    ON {join_condition};"
)

merge_query <- duckdb::dbSendQuery(
  conn = con,
  query_match, 
  arrow = TRUE
)

# get result of the left join as an arrow table
# this is where things go wrong / out of memory
df_geo <- duckdb::duckdb_fetch_arrow(merge_query)

# check result
head(df_geo) |> collect()

## remove files from cache
# censobr::censobr_cache(delete_file = 'all')

Whenever I run the code above on a machine with 16GB or 32GB of RAM, I get the following error message. Even if I configure memory limit for duckdb, the memory usage goes way up higher than the limit and I still get the error. The code does work when I run it on a machine with 250GB of RAM.

Error in duckdb_result():
! rapi_execute: Failed to run query
Error: Out of Memory Error: Allocation failure
Run rlang::last_trace() to see where the error occurred

Environment

I'm using the latest version of {duckdb} in a Windows OS. See below.

> sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 22631)

Matrix products: default


locale:
[1] LC_COLLATE=Portuguese_Brazil.utf8  LC_CTYPE=Portuguese_Brazil.utf8   
[3] LC_MONETARY=Portuguese_Brazil.utf8 LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.utf8    

time zone: America/Sao_Paulo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] glue_1.8.0     dplyr_1.1.4    duckdb_1.1.3-2 DBI_1.2.3      censobr_0.4.0 

loaded via a namespace (and not attached):
 [1] R6_2.6.1          tidyselect_1.2.1  magrittr_2.0.3    tibble_3.2.1      pkgconfig_2.0.3  
 [6] generics_0.1.3    lifecycle_1.0.4   cli_3.6.4         vctrs_0.6.5       compiler_4.4.1   
[11] rstudioapi_0.17.1 tools_4.4.1       pillar_1.10.1     rlang_1.1.5       fs_1.6.5  

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions