`dbplyr_uncount()` on Redshift tables fails with an opaque error #1601

owenjonesuob · 2025-04-01T15:52:26Z

Description

When using dbplyr_uncount() on a table within a ~~Postgres or~~ AWS Redshift database, we always fail with an error which mentions the seq2() function. It's not immediately obvious why this happens.

Minimal example

The first few lines will need to be adjusted to create a valid connection/table.

# Establish a connection to a Redshift database
con <- RPostgres::dbConnect(...)

# Create a simple demo query - I'm using a "calendar" table I happen to have available here
q <- dplyr::tbl(con, I("lookups.calendar")) |> 
  dplyr::select(date) |> 
  dplyr::filter(date >= "2025-04-01", date <= "2025-04-03")

# Show current results
dplyr::collect(q)
#> # A tibble: 3 × 1
#>   date      
#>   <date>    
#> 1 2025-04-01
#> 2 2025-04-02
#> 3 2025-04-03

# All fine so far!

# Now suppose we want to duplicate every row...
dbplyr::dbplyr_uncount(q, 2)
#> Error in `seq2()`:
#> ! `to` must be length one.

^{Created on 2025-04-01 with reprex v2.1.1}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.2 (2023-10-31)
#>  os       Amazon Linux 2023.6.20250123
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  C.UTF-8
#>  ctype    C.UTF-8
#>  tz       Europe/London
#>  date     2025-04-01
#>  pandoc   3.2 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bit           4.5.0.1 2024-12-03 [2] CRAN (R 4.3.2)
#>  bit64         4.6.0-1 2025-01-16 [2] CRAN (R 4.3.2)
#>  blob          1.2.4   2023-03-17 [2] CRAN (R 4.3.2)
#>  box           1.2.0   2024-02-06 [2] CRAN (R 4.3.2)
#>  class         7.3-23  2025-01-01 [3] CRAN (R 4.3.2)
#>  classInt      0.4-11  2025-01-08 [2] CRAN (R 4.3.2)
#>  cli           3.6.3   2024-06-21 [2] CRAN (R 4.3.2)
#>  DBI           1.2.3   2024-06-02 [2] CRAN (R 4.3.2)
#>  dbplyr        2.5.0   2024-03-19 [2] CRAN (R 4.3.2)
#>  digest        0.6.37  2024-08-19 [2] CRAN (R 4.3.2)
#>  dplyr         1.1.4   2023-11-17 [2] CRAN (R 4.3.2)
#>  e1071         1.7-16  2024-09-16 [2] CRAN (R 4.3.2)
#>  evaluate      1.0.3   2025-01-10 [2] CRAN (R 4.3.2)
#>  fastmap       1.2.0   2024-05-15 [2] CRAN (R 4.3.2)
#>  fs            1.6.5   2024-10-30 [2] CRAN (R 4.3.2)
#>  generics      0.1.3   2022-07-05 [2] CRAN (R 4.3.2)
#>  glue          1.8.0   2024-09-30 [2] CRAN (R 4.3.2)
#>  hms           1.1.3   2023-03-21 [2] CRAN (R 4.3.2)
#>  htmltools     0.5.8.1 2024-04-04 [2] CRAN (R 4.3.2)
#>  KernSmooth    2.23-26 2025-01-01 [3] CRAN (R 4.3.2)
#>  knitr         1.49    2024-11-08 [2] CRAN (R 4.3.2)
#>  lifecycle     1.0.4   2023-11-07 [2] CRAN (R 4.3.2)
#>  lubridate     1.9.4   2024-12-08 [2] CRAN (R 4.3.2)
#>  magrittr      2.0.3   2022-03-30 [2] CRAN (R 4.3.2)
#>  pillar        1.10.1  2025-01-07 [2] CRAN (R 4.3.2)
#>  pkgconfig     2.0.3   2019-09-22 [2] CRAN (R 4.3.2)
#>  proxy         0.4-27  2022-06-09 [2] CRAN (R 4.3.2)
#>  purrr         1.0.2   2023-08-10 [2] CRAN (R 4.3.2)
#>  R6            2.5.1   2021-08-19 [2] CRAN (R 4.3.2)
#>  Rcpp          1.0.14  2025-01-12 [2] CRAN (R 4.3.2)
#>  reprex        2.1.1   2024-07-06 [2] CRAN (R 4.3.2)
#>  rlang         1.1.5   2025-01-17 [2] CRAN (R 4.3.2)
#>  rmarkdown     2.29    2024-11-04 [2] CRAN (R 4.3.2)
#>  RPostgres     1.4.7   2024-05-27 [2] CRAN (R 4.3.2)
#>  rstudioapi    0.17.1  2024-10-22 [2] CRAN (R 4.3.2)
#>  sessioninfo   1.2.2   2021-12-06 [2] CRAN (R 4.3.2)
#>  sf            1.0-19  2024-11-05 [2] CRAN (R 4.3.2)
#>  tibble        3.2.1   2023-03-20 [2] CRAN (R 4.3.2)
#>  tidyselect    1.2.1   2024-03-11 [2] CRAN (R 4.3.2)
#>  timechange    0.3.0   2024-01-18 [2] CRAN (R 4.3.2)
#>  units         0.8-5   2023-11-28 [2] CRAN (R 4.3.2)
#>  vctrs         0.6.5   2023-12-01 [2] CRAN (R 4.3.2)
#>  withr         3.0.2   2024-10-28 [2] CRAN (R 4.3.2)
#>  xfun          0.50    2025-01-07 [2] CRAN (R 4.3.2)
#>  yaml          2.3.10  2024-07-26 [2] CRAN (R 4.3.2)
#> 
#>  [1] /home/owen10004/efs/R/x86_64-amazon-linux-gnu-library/4.3
#>  [2] /usr/lib64/R/site-library
#>  [3] /usr/lib64/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

This example uses a constant weights value, but we see the same error if we pass a column name for weights.

Cause

~~Postgres converts all (unquoted) column names to lowercase.~~ Redshift converts all column names in results, always, to lowercase. Within the query that's executed in order to calculate n_max, this conversion leads to a name mismatch, resulting in an unexpected NULL result.

Details

Early on in dbplyr::dbplyr_uncount(), a query is constructed calculate the max number of repetitions, and is then executed implicitly via dplyr::pull():

dbplyr/R/verb-uncount.R

Line 50 in 5726930

n_max <- pull(summarise(ungroup(data), max(!!sym(weights_col), na.rm = TRUE)))

Note that the argument provided to the summarise() query isn't named, so it's given a name using the expression itself (I think via set_names()?) - which looks like:

max(..., na.rm = TRUE)

The call is to pull() is dispatched to dbplyr::pull.tbl_sql(), which:

Derives the name of the variable to extract:

dbplyr/R/verb-pull.R

Line 20 in 5726930

var <- tidyselect::vars_pull(vars, !!enquo(var), error_arg = "var")

As we just saw, that variable is called something like max(..., na.rm = TRUE).
Collects query results from the database:

dbplyr/R/verb-pull.R

Line 29 in 5726930

.data <- collect(.data)

However, ~~Postgres converts all (unquoted) column names to lowercase~~ Redshift converts all column names, always, in results to lowercase:

There are two types of identifiers, standard identifiers and quoted or delimited identifiers. Identifiers must consist of only UTF-8 printable characters. ASCII letters in standard and delimited identifiers are case-insensitive and are folded to lowercase in the database. In query results, column names are returned as lowercase by default. -- Redshift docs

This means that the collected results contain a column called something like max(..., na.rm = true).
Uses the name to extract the appropriate column from the collected results:

dbplyr/R/verb-pull.R

Line 30 in 5726930

out <- .data[[var]]

Due to the mismatch in names, we end up with out <- NULL

So we end up with n_max <- NULL, hence the error in the seq2() call a little later:

dbplyr/R/verb-uncount.R

Line 63 in 5726930

helper_table <- copy_inline(con, tibble(!!row_id_col := seq2(1, n_max)))

Output from `debug()`

> debugonce(dbplyr:::pull.tbl_sql)
> dbplyr::dbplyr_uncount(q, 2)
debugging in: pull.tbl_sql(summarise(ungroup(data), max(!!sym(weights_col), 
    na.rm = TRUE)))
debug: {
    vars <- tbl_vars(.data)
    var <- tidyselect::vars_pull(vars, !!enquo(var), error_arg = "var")
    name_quo <- enquo(name)
    if (!quo_is_null(name_quo)) {
        name <- tidyselect::vars_pull(vars, !!name_quo, error_arg = "name")
    }
    .data <- ungroup(.data)
    .data <- select(.data, all_of(c(var, name)))
    .data <- collect(.data)
    out <- .data[[var]]
    if (!is.null(name)) {
        out <- set_names(out, nm = .data[[name]])
    }
    out
}
Browse[2]> n
debug: vars <- tbl_vars(.data)
Browse[2]> n
debug: var <- tidyselect::vars_pull(vars, !!enquo(var), error_arg = "var")
Browse[2]> n
debug: name_quo <- enquo(name)
Browse[2]> n
debug: if (!quo_is_null(name_quo)) {
    name <- tidyselect::vars_pull(vars, !!name_quo, error_arg = "name")
}
Browse[2]> n
debug: .data <- ungroup(.data)
Browse[2]> n
debug: .data <- select(.data, all_of(c(var, name)))
Browse[2]> n
debug: .data <- collect(.data)
Browse[2]> n
debug: out <- .data[[var]]
Browse[2]> n
debug: if (!is.null(name)) {
    out <- set_names(out, nm = .data[[name]])
}
Browse[2]> .data
# A tibble: 1 × 1
  `max(..dbplyr_weight_col, na.rm = true)`
                                     <dbl>
1                                        2
Browse[2]> var
[1] "max(..dbplyr_weight_col, na.rm = TRUE)"
Browse[2]> out
NULL

The text was updated successfully, but these errors were encountered:

owenjonesuob · 2025-04-01T16:59:59Z

I should add for transparency that I'm actually using an AWS Redshift database; and from some further research, I think there's a slight difference between Postgres and Redshift when it comes to lowecase conversion:

Quoting an identifier also makes it case-sensitive, whereas unquoted names are always folded to lower case. For example, the identifiers FOO, foo, and "foo" are considered the same by PostgreSQL, but "Foo" and "FOO" are different from these three and each other. -- Postgres docs

There are two types of identifiers, standard identifiers and quoted or delimited identifiers. Identifiers must consist of only UTF-8 printable characters. ASCII letters in standard and delimited identifiers are case-insensitive and are folded to lowercase in the database. In query results, column names are returned as lowercase by default. -- Redshift docs

In other words, Redshift coerces names to lowercase in results, even if they are quoted in the query - unless a certain database configuration value has been changed from the default, in which case it behaves identically to Postgres (as far as I can tell).

I'm getting a bit lost in the weeds trying to figure out whether/where names get quoted during {dbplyr}'s query building process, but assuming they are quoted, that might narrow this issue down to just a Redshift issue, rather than Postgres too.

owenjonesuob changed the title ~~dbplyr_uncount() on Postgres/Redshift tables fails with an opaque error~~ dbplyr_uncount() on Redshift tables fails with an opaque error Apr 2, 2025

owenjonesuob pushed a commit to owenjonesuob/dbplyr that referenced this issue Apr 2, 2025

avoid creating column name with uppercase characters (tidyverse#1601)

7a4d9b5

owenjonesuob linked a pull request Apr 2, 2025 that will close this issue

Avoid creating column name with uppercase characters in dbplyr_uncount() #1603

Open

owenjonesuob added a commit to owenjonesuob/dbplyr that referenced this issue Apr 2, 2025

avoid creating column name with uppercase characters (tidyverse#1601)

abbce87

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`dbplyr_uncount()` on Redshift tables fails with an opaque error #1601

`dbplyr_uncount()` on Redshift tables fails with an opaque error #1601

owenjonesuob commented Apr 1, 2025 •

edited

Loading

owenjonesuob commented Apr 1, 2025 •

edited

Loading

Uh oh!

dbplyr_uncount() on Redshift tables fails with an opaque error #1601

dbplyr_uncount() on Redshift tables fails with an opaque error #1601

Comments

owenjonesuob commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Minimal example

Cause

Details

owenjonesuob commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

`dbplyr_uncount()` on Redshift tables fails with an opaque error #1601

`dbplyr_uncount()` on Redshift tables fails with an opaque error #1601

owenjonesuob commented Apr 1, 2025 •

edited

Loading

owenjonesuob commented Apr 1, 2025 •

edited

Loading