-
Notifications
You must be signed in to change notification settings - Fork 41
Description
I've recently come across a weird bug, in which I'm unable to SET
empty strings to NULL
. This issue only happens with duckdb 1.2.0, when using 1.1.3 I have no problems with this operation.
I've tried creating a small reproducible example, but didn't manage to do it. Instead, I've created a table with 4 million rows in which the issue seems to come up quite frequently (please note that sometimes it works as expected, although it seems to be a small minority of the cases), which is attached to this issue. This table has 2 columns, cbo1994
, the one in which the problem appears, and id_estab
, that holds 14-digit ids. If id_estab
is missing, the issue doesn't come up - i.e. the problem seems to be related with this column, for some strange reason.
Another strange behavior is that the issue only comes up when using the full table, with 4 million rows. If I subset to use only the first 3 million, the code runs fine. The issue is not related to the last 1 million per se, however - I also tried to run the code using only the last million, but it worked fine.
Reprex:
library(data.table)
tmpdb <- tempfile(fileext = ".duckdb")
con <- duckdb::dbConnect(duckdb::duckdb(), dbdir = tmpdb)
test <- readRDS("path/to/test.rds")
duckdb::dbWriteTable(con, "test", test, overwrite = TRUE)
query <- "UPDATE test SET cbo1994 = NULL WHERE cbo1994 = '';"
n_linhas_afetadas <- DBI::dbExecute(con, query)
dplyr::tbl(con, dplyr::sql("SELECT * FROM test"))
#> # Source: SQL [?? x 2]
#> # Database: DuckDB v1.2.0 [dhersz@Windows Server x64:R 4.4.2/C:\Users\dhersz\AppData\Local\Temp\7\RtmpOEiIFH\filed0cc461710fc.duckdb]
#> cbo1994 id_estab
#> <chr> <chr>
#> 1 "58990" 11111111111111
#> 2 "58330" 11111111111111
#> 3 "58330" 11111111111111
#> 4 "98540" 11111111111111
#> 5 "84350" 11111111111111
#> 6 "42190" 11111111111111
#> 7 "" 11111111111111
#> 8 "98590" 11111111111111
#> 9 "73290" 11111111111111
#> 10 "73290" 11111111111111
#> # ℹ more rows
test <- test[1:3000000,]
duckdb::dbWriteTable(con, "test", test, overwrite = TRUE)
query <- "UPDATE test SET cbo1994 = NULL WHERE cbo1994 = '';"
n_linhas_afetadas <- DBI::dbExecute(con, query)
dplyr::tbl(con, dplyr::sql("SELECT * FROM test"))
#> # Source: SQL [?? x 2]
#> # Database: DuckDB v1.2.0 [dhersz@Windows Server x64:R 4.4.2/C:\Users\dhersz\AppData\Local\Temp\7\RtmpOEiIFH\filed0cc461710fc.duckdb]
#> cbo1994 id_estab
#> <chr> <chr>
#> 1 58990 11111111111111
#> 2 58330 11111111111111
#> 3 58330 11111111111111
#> 4 98540 11111111111111
#> 5 84350 11111111111111
#> 6 42190 11111111111111
#> 7 <NA> 11111111111111
#> 8 98590 11111111111111
#> 9 73290 11111111111111
#> 10 73290 11111111111111
#> # ℹ more rows
The exact same code works as expected with duckdb 1.1.3:
library(data.table)
tmpdb <- tempfile(fileext = ".duckdb")
con <- duckdb::dbConnect(duckdb::duckdb(), dbdir = tmpdb)
test <- readRDS("path/to/test.rds")
duckdb::dbWriteTable(con, "test", test, overwrite = TRUE)
query <- "UPDATE test SET cbo1994 = NULL WHERE cbo1994 = '';"
n_linhas_afetadas <- DBI::dbExecute(con, query)
dplyr::tbl(con, dplyr::sql("SELECT * FROM test"))
#> # Source: SQL [?? x 2]
#> # Database: DuckDB v1.1.3 [dhersz@Windows Server x64:R 4.4.2/C:\Users\dhersz\AppData\Local\Temp\7\RtmpAFcMMe\filec71467c348db.duckdb]
#> cbo1994 id_estab
#> <chr> <chr>
#> 1 58990 11111111111111
#> 2 58330 11111111111111
#> 3 58330 11111111111111
#> 4 98540 11111111111111
#> 5 84350 11111111111111
#> 6 42190 11111111111111
#> 7 <NA> 11111111111111
#> 8 98590 11111111111111
#> 9 73290 11111111111111
#> 10 73290 11111111111111
#> # ℹ more rows
test <- test[1:3000000,]
duckdb::dbWriteTable(con, "test", test, overwrite = TRUE)
query <- "UPDATE test SET cbo1994 = NULL WHERE cbo1994 = '';"
n_linhas_afetadas <- DBI::dbExecute(con, query)
dplyr::tbl(con, dplyr::sql("SELECT * FROM test"))
#> # Source: SQL [?? x 2]
#> # Database: DuckDB v1.1.3 [dhersz@Windows Server x64:R 4.4.2/C:\Users\dhersz\AppData\Local\Temp\7\RtmpAFcMMe\filec71467c348db.duckdb]
#> cbo1994 id_estab
#> <chr> <chr>
#> 1 58990 11111111111111
#> 2 58330 11111111111111
#> 3 58330 11111111111111
#> 4 98540 11111111111111
#> 5 84350 11111111111111
#> 6 42190 11111111111111
#> 7 <NA> 11111111111111
#> 8 98590 11111111111111
#> 9 73290 11111111111111
#> 10 73290 11111111111111
#> # ℹ more rows
Again, I've tried reproducing the behavior with a smaller dataset, but wasn't able to, so please see attached a zip file with the required data in .rds
format.
Best, glad to help with anything needed!