What is the canonical way to write parquet to disk using duckdb and dbplyr without collecting first?

Scenario is working with very large (many GB) csv files and wanting to save them to parquet files after processing without reading/collecting them entirely into memory. 

I know how to do this with arrow: 

```r
library(arrow)

open_dataset("path/to/csv", format = "csv") |> 
  do_something() |>
  write_dataset("path/to/export.parquet", format = "parquet") 
```

I also know how to do this with duckdb and SQL:

```sql
COPY
    (SELECT * FROM 'path/to/csv.csv')
    TO  'path/to/export.parquet'
    (FORMAT 'parquet')
```

But I would like to figure out how to do this with duckdb and dbplyr. Context is that members of my team are unfamiliar with SQL, and are comfortable with dplyr syntax.

The closest I've got so far is:

```r
library(arrow)
library(duckdb)

con <- dbConnect(duckdb())

tbl(con, "read_csv('path/to/csv.csv')") |>  # see note below
  do_something() |>
  to_arrow() |>
  write_dataset("path/to/export", format = "parquet") 
```
`note`: in my testing you have to pass `tbl()` the read function as well as the file path as a text string, even if the file path contains the csv or parquet extension in it. Is this where `tbl_file()` comes in? if so, how do you pass `tbl_file`  parameters to the read_csv function, such as which columns are dates? (as mentioned in #159, it's unclear how to use duckdb-specific dbplyr functions from the duckdb documentation). 



I'm a bit nervous about my workaround of using `to_arrow()` because one of the reasons we're starting as a team to use duckdb in preference to arrow is that arrow's auto-detection of schemas from csvs is not anywhere near as good as duckdb and it's very fussy and slow at parsing dates. I've also noticed how arrow interprets empty strings differs from duckdb (arrow leaves them as "", duckdb makes them NA). And I guess I'm cautious about introducing a step that might complicate matters and make it unclear what type casting has occurred by transferring between the two libraries. 


I've also tried various versions of `copy_to` and `db_copy_to` with "temporary = F". I've managed to create an in-memory table called " 'path/to/export.parquet'     (FORMAT 'parquet')" (!) but not actually save anything to disk. 

If I should ask this question elsewhere, please let me know.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What is the canonical way to write parquet to disk using duckdb and dbplyr without collecting first? #207

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

What is the canonical way to write parquet to disk using duckdb and dbplyr without collecting first? #207

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions