Skip to content

Possible regression from joins refactor in 2.3.0 #1346

@MichaelChirico

Description

@MichaelChirico

I'm having a bear of a time trying to debug this issue that's arising when updating 2.2.1 -> 2.3.3. Genuinely hard to disentangle where the issue's coming from since there's so many layers where things may have gone wrong.

The query in the test being broken is pretty simple: inner_join() on some 3-row input tables:

personal_data <- tibble(
  name = c("Alice", "Nick", "Bob"),
  age = c(32, 25, 44)
)

employee_data <- tibble(
  name = c("Alice", "Bob", "Michael"),
  employee_id = c(1, 2, 3)
)

# Copy these tables to DB backend
tables <- list(personal_data, employee_data)
table_names <- paste0("r_tmp_", seq_along(tables))

db_conn <- withr::local_db_connection(MakeTestDBIConnection())
db_names <- lapply(
  table_names,
  \(table_name) DBI::Id(namespace = "datascape", table_name = table_name)
)
db_tables <- mapply(dplyr::copy_to,
  df = tables, name = db_names,
  MoreArgs = list(dest = db_conn, temporary = FALSE),
  SIMPLIFY = FALSE
)

dplyr::inner_join(db_tables[[1]], db_tables[[2]], by = "name")

This test (which compares this join's output to the local dplyr equivalent) works as expected on 2.2.1 but breaks on 2.3.3:

Error in `dplyr::collect(x)`: Failed to collect lazy table.
Caused by error in `doTryCatch()`:
! INVALID_ARGUMENT: SQL_ANALYSIS_ERROR: Syntax error: Expected end of input but got "." [at 2:46]
FROM `datascape`.`r_tmp_1` AS `\`datascape\``.`\`r_tmp_1\``

The issue is the already-escaped name `datascape`.`r_tmp_1` is re-escaped unsuccessfully.

Poking around in debugging I'm not able to tell what went wrong. It's possible our own connection methods are doing something unexpected, for example.

Just one observation:

Here, IIUC, we should respect the pre-escaped nature of the input when constructing by$x_as:

dbplyr/R/lazy-join-query.R

Lines 171 to 178 in 5fa4410

op$joins$by <- purrr::map2(
op$joins$by, seq_along(op$joins$by),
function(by, i) {
by$x_as <- table_names_out[op$joins$by_x_table_id[[i]]]
by$y_as <- table_names_out[i + 1L]
by
}
)

Debugging, I see this around that step:

dput(op$joins$by)
# list(list(
#   x = structure("name", class = c("ident", "character")),
#   y = structure("name", class = c("ident", "character")),
#   condition = "==", 
#   on = structure(character(0), class = c("sql", "character")),
#   na_matches = "never"
# ))
dput(table_names_out)
# c("`datascape`.`r_tmp_1`", "`datascape`.`r_tmp_2`")
dput(op$joins$by_x_table_id)
# list(1L)

# but also
dput(op$x$x)
# structure("`datascape`.`r_tmp_1`", class = c("ident_q", "ident", "character"))
dput(op$joins$table)
# list(structure("`datascape`.`r_tmp_2`", class = c("ident_q", "ident", "character")))

Perhaps table_names_out should be ident_q at this step, but even if so, it would have the same result:

identical(
  ident(table_names_out[1L]),
  ident(ident_q(table_names_out[1L]))
)
# [1] TRUE

Should ident() have an escape for "ident" input? And then we should make sure table_names_out reflects the same ident_q class as the input x$x and joins$table?

Maybe I'm barking up the wrong tree.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions