Skip to content

dbplyr_uncount() on Redshift tables fails with an opaque error #1601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
owenjonesuob opened this issue Apr 1, 2025 · 1 comment · May be fixed by #1603
Open

dbplyr_uncount() on Redshift tables fails with an opaque error #1601

owenjonesuob opened this issue Apr 1, 2025 · 1 comment · May be fixed by #1603

Comments

@owenjonesuob
Copy link
Contributor

owenjonesuob commented Apr 1, 2025

Description

When using dbplyr_uncount() on a table within a Postgres or AWS Redshift database, we always fail with an error which mentions the seq2() function. It's not immediately obvious why this happens.

Minimal example

The first few lines will need to be adjusted to create a valid connection/table.

# Establish a connection to a Redshift database
con <- RPostgres::dbConnect(...)

# Create a simple demo query - I'm using a "calendar" table I happen to have available here
q <- dplyr::tbl(con, I("lookups.calendar")) |> 
  dplyr::select(date) |> 
  dplyr::filter(date >= "2025-04-01", date <= "2025-04-03")

# Show current results
dplyr::collect(q)
#> # A tibble: 3 × 1
#>   date      
#>   <date>    
#> 1 2025-04-01
#> 2 2025-04-02
#> 3 2025-04-03

# All fine so far!

# Now suppose we want to duplicate every row...
dbplyr::dbplyr_uncount(q, 2)
#> Error in `seq2()`:
#> ! `to` must be length one.

Created on 2025-04-01 with reprex v2.1.1

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.2 (2023-10-31)
#>  os       Amazon Linux 2023.6.20250123
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  C.UTF-8
#>  ctype    C.UTF-8
#>  tz       Europe/London
#>  date     2025-04-01
#>  pandoc   3.2 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bit           4.5.0.1 2024-12-03 [2] CRAN (R 4.3.2)
#>  bit64         4.6.0-1 2025-01-16 [2] CRAN (R 4.3.2)
#>  blob          1.2.4   2023-03-17 [2] CRAN (R 4.3.2)
#>  box           1.2.0   2024-02-06 [2] CRAN (R 4.3.2)
#>  class         7.3-23  2025-01-01 [3] CRAN (R 4.3.2)
#>  classInt      0.4-11  2025-01-08 [2] CRAN (R 4.3.2)
#>  cli           3.6.3   2024-06-21 [2] CRAN (R 4.3.2)
#>  DBI           1.2.3   2024-06-02 [2] CRAN (R 4.3.2)
#>  dbplyr        2.5.0   2024-03-19 [2] CRAN (R 4.3.2)
#>  digest        0.6.37  2024-08-19 [2] CRAN (R 4.3.2)
#>  dplyr         1.1.4   2023-11-17 [2] CRAN (R 4.3.2)
#>  e1071         1.7-16  2024-09-16 [2] CRAN (R 4.3.2)
#>  evaluate      1.0.3   2025-01-10 [2] CRAN (R 4.3.2)
#>  fastmap       1.2.0   2024-05-15 [2] CRAN (R 4.3.2)
#>  fs            1.6.5   2024-10-30 [2] CRAN (R 4.3.2)
#>  generics      0.1.3   2022-07-05 [2] CRAN (R 4.3.2)
#>  glue          1.8.0   2024-09-30 [2] CRAN (R 4.3.2)
#>  hms           1.1.3   2023-03-21 [2] CRAN (R 4.3.2)
#>  htmltools     0.5.8.1 2024-04-04 [2] CRAN (R 4.3.2)
#>  KernSmooth    2.23-26 2025-01-01 [3] CRAN (R 4.3.2)
#>  knitr         1.49    2024-11-08 [2] CRAN (R 4.3.2)
#>  lifecycle     1.0.4   2023-11-07 [2] CRAN (R 4.3.2)
#>  lubridate     1.9.4   2024-12-08 [2] CRAN (R 4.3.2)
#>  magrittr      2.0.3   2022-03-30 [2] CRAN (R 4.3.2)
#>  pillar        1.10.1  2025-01-07 [2] CRAN (R 4.3.2)
#>  pkgconfig     2.0.3   2019-09-22 [2] CRAN (R 4.3.2)
#>  proxy         0.4-27  2022-06-09 [2] CRAN (R 4.3.2)
#>  purrr         1.0.2   2023-08-10 [2] CRAN (R 4.3.2)
#>  R6            2.5.1   2021-08-19 [2] CRAN (R 4.3.2)
#>  Rcpp          1.0.14  2025-01-12 [2] CRAN (R 4.3.2)
#>  reprex        2.1.1   2024-07-06 [2] CRAN (R 4.3.2)
#>  rlang         1.1.5   2025-01-17 [2] CRAN (R 4.3.2)
#>  rmarkdown     2.29    2024-11-04 [2] CRAN (R 4.3.2)
#>  RPostgres     1.4.7   2024-05-27 [2] CRAN (R 4.3.2)
#>  rstudioapi    0.17.1  2024-10-22 [2] CRAN (R 4.3.2)
#>  sessioninfo   1.2.2   2021-12-06 [2] CRAN (R 4.3.2)
#>  sf            1.0-19  2024-11-05 [2] CRAN (R 4.3.2)
#>  tibble        3.2.1   2023-03-20 [2] CRAN (R 4.3.2)
#>  tidyselect    1.2.1   2024-03-11 [2] CRAN (R 4.3.2)
#>  timechange    0.3.0   2024-01-18 [2] CRAN (R 4.3.2)
#>  units         0.8-5   2023-11-28 [2] CRAN (R 4.3.2)
#>  vctrs         0.6.5   2023-12-01 [2] CRAN (R 4.3.2)
#>  withr         3.0.2   2024-10-28 [2] CRAN (R 4.3.2)
#>  xfun          0.50    2025-01-07 [2] CRAN (R 4.3.2)
#>  yaml          2.3.10  2024-07-26 [2] CRAN (R 4.3.2)
#> 
#>  [1] /home/owen10004/efs/R/x86_64-amazon-linux-gnu-library/4.3
#>  [2] /usr/lib64/R/site-library
#>  [3] /usr/lib64/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

This example uses a constant weights value, but we see the same error if we pass a column name for weights.

Cause

Postgres converts all (unquoted) column names to lowercase. Redshift converts all column names in results, always, to lowercase. Within the query that's executed in order to calculate n_max, this conversion leads to a name mismatch, resulting in an unexpected NULL result.

Details

Early on in dbplyr::dbplyr_uncount(), a query is constructed calculate the max number of repetitions, and is then executed implicitly via dplyr::pull():

n_max <- pull(summarise(ungroup(data), max(!!sym(weights_col), na.rm = TRUE)))

Note that the argument provided to the summarise() query isn't named, so it's given a name using the expression itself (I think via set_names()?) - which looks like:

max(..., na.rm = TRUE)

The call is to pull() is dispatched to dbplyr::pull.tbl_sql(), which:

  1. Derives the name of the variable to extract:

    var <- tidyselect::vars_pull(vars, !!enquo(var), error_arg = "var")

    As we just saw, that variable is called something like max(..., na.rm = TRUE).

  2. Collects query results from the database:

    .data <- collect(.data)

    However, Postgres converts all (unquoted) column names to lowercase Redshift converts all column names, always, in results to lowercase:

    There are two types of identifiers, standard identifiers and quoted or delimited identifiers. Identifiers must consist of only UTF-8 printable characters. ASCII letters in standard and delimited identifiers are case-insensitive and are folded to lowercase in the database. In query results, column names are returned as lowercase by default. -- Redshift docs

    This means that the collected results contain a column called something like max(..., na.rm = true).

  3. Uses the name to extract the appropriate column from the collected results:

    out <- .data[[var]]

    Due to the mismatch in names, we end up with out <- NULL

So we end up with n_max <- NULL, hence the error in the seq2() call a little later:

helper_table <- copy_inline(con, tibble(!!row_id_col := seq2(1, n_max)))

Output from `debug()`
> debugonce(dbplyr:::pull.tbl_sql)
> dbplyr::dbplyr_uncount(q, 2)
debugging in: pull.tbl_sql(summarise(ungroup(data), max(!!sym(weights_col), 
    na.rm = TRUE)))
debug: {
    vars <- tbl_vars(.data)
    var <- tidyselect::vars_pull(vars, !!enquo(var), error_arg = "var")
    name_quo <- enquo(name)
    if (!quo_is_null(name_quo)) {
        name <- tidyselect::vars_pull(vars, !!name_quo, error_arg = "name")
    }
    .data <- ungroup(.data)
    .data <- select(.data, all_of(c(var, name)))
    .data <- collect(.data)
    out <- .data[[var]]
    if (!is.null(name)) {
        out <- set_names(out, nm = .data[[name]])
    }
    out
}
Browse[2]> n
debug: vars <- tbl_vars(.data)
Browse[2]> n
debug: var <- tidyselect::vars_pull(vars, !!enquo(var), error_arg = "var")
Browse[2]> n
debug: name_quo <- enquo(name)
Browse[2]> n
debug: if (!quo_is_null(name_quo)) {
    name <- tidyselect::vars_pull(vars, !!name_quo, error_arg = "name")
}
Browse[2]> n
debug: .data <- ungroup(.data)
Browse[2]> n
debug: .data <- select(.data, all_of(c(var, name)))
Browse[2]> n
debug: .data <- collect(.data)
Browse[2]> n
debug: out <- .data[[var]]
Browse[2]> n
debug: if (!is.null(name)) {
    out <- set_names(out, nm = .data[[name]])
}
Browse[2]> .data
# A tibble: 1 × 1
  `max(..dbplyr_weight_col, na.rm = true)`
                                     <dbl>
1                                        2
Browse[2]> var
[1] "max(..dbplyr_weight_col, na.rm = TRUE)"
Browse[2]> out
NULL
@owenjonesuob
Copy link
Contributor Author

owenjonesuob commented Apr 1, 2025

I should add for transparency that I'm actually using an AWS Redshift database; and from some further research, I think there's a slight difference between Postgres and Redshift when it comes to lowecase conversion:

Quoting an identifier also makes it case-sensitive, whereas unquoted names are always folded to lower case. For example, the identifiers FOO, foo, and "foo" are considered the same by PostgreSQL, but "Foo" and "FOO" are different from these three and each other. -- Postgres docs

There are two types of identifiers, standard identifiers and quoted or delimited identifiers. Identifiers must consist of only UTF-8 printable characters. ASCII letters in standard and delimited identifiers are case-insensitive and are folded to lowercase in the database. In query results, column names are returned as lowercase by default. -- Redshift docs

In other words, Redshift coerces names to lowercase in results, even if they are quoted in the query - unless a certain database configuration value has been changed from the default, in which case it behaves identically to Postgres (as far as I can tell).

I'm getting a bit lost in the weeds trying to figure out whether/where names get quoted during {dbplyr}'s query building process, but assuming they are quoted, that might narrow this issue down to just a Redshift issue, rather than Postgres too.

@owenjonesuob owenjonesuob changed the title dbplyr_uncount() on Postgres/Redshift tables fails with an opaque error dbplyr_uncount() on Redshift tables fails with an opaque error Apr 2, 2025
owenjonesuob pushed a commit to owenjonesuob/dbplyr that referenced this issue Apr 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant