Skip to content

Unable to SET empty strings to NULL with duckdb 1.2.0 #1082

@dhersz

Description

@dhersz

I've recently come across a weird bug, in which I'm unable to SET empty strings to NULL. This issue only happens with duckdb 1.2.0, when using 1.1.3 I have no problems with this operation.

I've tried creating a small reproducible example, but didn't manage to do it. Instead, I've created a table with 4 million rows in which the issue seems to come up quite frequently (please note that sometimes it works as expected, although it seems to be a small minority of the cases), which is attached to this issue. This table has 2 columns, cbo1994, the one in which the problem appears, and id_estab, that holds 14-digit ids. If id_estab is missing, the issue doesn't come up - i.e. the problem seems to be related with this column, for some strange reason.

Another strange behavior is that the issue only comes up when using the full table, with 4 million rows. If I subset to use only the first 3 million, the code runs fine. The issue is not related to the last 1 million per se, however - I also tried to run the code using only the last million, but it worked fine.

Reprex:

library(data.table)

tmpdb <- tempfile(fileext = ".duckdb")
con <- duckdb::dbConnect(duckdb::duckdb(), dbdir = tmpdb)

test <- readRDS("path/to/test.rds")
  
duckdb::dbWriteTable(con, "test", test, overwrite = TRUE)
query <- "UPDATE test SET cbo1994 = NULL WHERE cbo1994 = '';"
n_linhas_afetadas <- DBI::dbExecute(con, query)
dplyr::tbl(con, dplyr::sql("SELECT * FROM test"))
#> # Source:   SQL [?? x 2]
#> # Database: DuckDB v1.2.0 [dhersz@Windows Server x64:R 4.4.2/C:\Users\dhersz\AppData\Local\Temp\7\RtmpOEiIFH\filed0cc461710fc.duckdb]
#>    cbo1994 id_estab      
#>    <chr>   <chr>         
#>  1 "58990" 11111111111111
#>  2 "58330" 11111111111111
#>  3 "58330" 11111111111111
#>  4 "98540" 11111111111111
#>  5 "84350" 11111111111111
#>  6 "42190" 11111111111111
#>  7 ""      11111111111111
#>  8 "98590" 11111111111111
#>  9 "73290" 11111111111111
#> 10 "73290" 11111111111111
#> # ℹ more rows

test <- test[1:3000000,]
duckdb::dbWriteTable(con, "test", test, overwrite = TRUE)
query <- "UPDATE test SET cbo1994 = NULL WHERE cbo1994 = '';"
n_linhas_afetadas <- DBI::dbExecute(con, query)
dplyr::tbl(con, dplyr::sql("SELECT * FROM test"))
#> # Source:   SQL [?? x 2]
#> # Database: DuckDB v1.2.0 [dhersz@Windows Server x64:R 4.4.2/C:\Users\dhersz\AppData\Local\Temp\7\RtmpOEiIFH\filed0cc461710fc.duckdb]
#>    cbo1994 id_estab      
#>    <chr>   <chr>         
#>  1 58990   11111111111111
#>  2 58330   11111111111111
#>  3 58330   11111111111111
#>  4 98540   11111111111111
#>  5 84350   11111111111111
#>  6 42190   11111111111111
#>  7 <NA>    11111111111111
#>  8 98590   11111111111111
#>  9 73290   11111111111111
#> 10 73290   11111111111111
#> # ℹ more rows

The exact same code works as expected with duckdb 1.1.3:

library(data.table)

tmpdb <- tempfile(fileext = ".duckdb")
con <- duckdb::dbConnect(duckdb::duckdb(), dbdir = tmpdb)

test <- readRDS("path/to/test.rds")
  
duckdb::dbWriteTable(con, "test", test, overwrite = TRUE)
query <- "UPDATE test SET cbo1994 = NULL WHERE cbo1994 = '';"
n_linhas_afetadas <- DBI::dbExecute(con, query)
dplyr::tbl(con, dplyr::sql("SELECT * FROM test"))
#> # Source:   SQL [?? x 2]
#> # Database: DuckDB v1.1.3 [dhersz@Windows Server x64:R 4.4.2/C:\Users\dhersz\AppData\Local\Temp\7\RtmpAFcMMe\filec71467c348db.duckdb]
#>    cbo1994 id_estab      
#>    <chr>   <chr>         
#>  1 58990   11111111111111
#>  2 58330   11111111111111
#>  3 58330   11111111111111
#>  4 98540   11111111111111
#>  5 84350   11111111111111
#>  6 42190   11111111111111
#>  7 <NA>    11111111111111
#>  8 98590   11111111111111
#>  9 73290   11111111111111
#> 10 73290   11111111111111
#> # ℹ more rows

test <- test[1:3000000,]
duckdb::dbWriteTable(con, "test", test, overwrite = TRUE)
query <- "UPDATE test SET cbo1994 = NULL WHERE cbo1994 = '';"
n_linhas_afetadas <- DBI::dbExecute(con, query)
dplyr::tbl(con, dplyr::sql("SELECT * FROM test"))
#> # Source:   SQL [?? x 2]
#> # Database: DuckDB v1.1.3 [dhersz@Windows Server x64:R 4.4.2/C:\Users\dhersz\AppData\Local\Temp\7\RtmpAFcMMe\filec71467c348db.duckdb]
#>    cbo1994 id_estab      
#>    <chr>   <chr>         
#>  1 58990   11111111111111
#>  2 58330   11111111111111
#>  3 58330   11111111111111
#>  4 98540   11111111111111
#>  5 84350   11111111111111
#>  6 42190   11111111111111
#>  7 <NA>    11111111111111
#>  8 98590   11111111111111
#>  9 73290   11111111111111
#> 10 73290   11111111111111
#> # ℹ more rows

Again, I've tried reproducing the behavior with a smaller dataset, but wasn't able to, so please see attached a zip file with the required data in .rds format.

Best, glad to help with anything needed!

test.zip

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions