|
1 | 1 | --- |
2 | | -engine: knitr |
3 | 2 | title: An introduction to DuckDB |
4 | 3 | --- |
5 | 4 |
|
6 | | -# ️✅ Learning objectives |
| 5 | +# ️Welcome! |
7 | 6 |
|
8 | | -## LOs for the entire book |
| 7 | +## Logistics |
| 8 | +- All the important bookclub links are under Slack's "Bookmarks" tab |
| 9 | + - [Claim a chapter](https://docs.google.com/spreadsheets/d/1B9WV0Iiv6XYDX49qKWuLkfiCrM0YUkfiCYB6NZH4DkY/edit?gid=0#gid=0) |
| 10 | + - [Github repository](https://github.com/r4ds/bookclub-duckdb) |
| 11 | +- Potential guest speakers |
9 | 12 |
|
10 | | -- FILL IN |
| 13 | +## Agenda |
| 14 | +- Review Chapter 1 |
| 15 | +- Discuss the finer points of contributing to the bookclub github repository |
11 | 16 |
|
12 | | -## LOs for this chapter |
| 17 | +# Chapter 1 |
13 | 18 |
|
14 | | -- FILL IN |
| 19 | +## Learning Goals |
15 | 20 |
|
16 | | -# Title for Group of Slides |
| 21 | +- What is DuckDB and why bother learning to use it? |
17 | 22 |
|
18 | | -## Title for Individual Slide |
| 23 | +## Basics |
19 | 24 |
|
20 | | -- Slide-like content |
| 25 | +- DuckDB is a single-node, in-memory database that intergrates well into many places in the data pipeline. |
| 26 | +- DuckDB is very fast - much faster than `dplyr` or `pandas` for data transformation. |
| 27 | +- The DuckDB IP is owned by the Netherlands Stichting (non-profit) DuckDB Foundation. |
21 | 28 |
|
22 | | -::: notes |
23 | | -- Speaker notes for this slide |
| 29 | +## How Fast? |
| 30 | + |
| 31 | + |
| 32 | +## How Fast? |
| 33 | + |
| 34 | + |
| 35 | + |
| 36 | +## Where Does DuckDB Fit In? |
| 37 | + |
| 38 | +- DuckDB is an "in process" database - it runs in another application's memory space, like R or Python. |
| 39 | +- It can speed up traditional "small data" workloads by interfacing with R or Python libraries. |
| 40 | +- It can extend local analysis to data of "a few hundred gigabytes". |
| 41 | +- It operate directly "at the edge" or "in the cloud." For instance, it can analyze data stored in a cloud S3 bucket in-memory, avoiding costly transfer operations. |
| 42 | + |
| 43 | +## Where Does DuckDB Fit In? |
| 44 | + |
| 45 | +::: {.callout-tip} |
| 46 | +# How much data? |
| 47 | +Despite being in-memory, DuckDB allows analyses to "spillover" into the hard disk. What are the limits of this fallback? |
24 | 48 | ::: |
| 49 | + |
| 50 | +::: {.callout-tip} |
| 51 | +# Which process? |
| 52 | +What counts as a "process" for in-process? The authors discuss querying S3 files from a Cloud VM instance "in process" - is this a serverless deployment of Python? What's the "process" for the DuckDB CLI? |
| 53 | +::: |
| 54 | + |
| 55 | +## Where Not to Use DuckDB? |
| 56 | + |
| 57 | +- Tranformation and analysis of very large data sets |
| 58 | +- "Steaming data" in real-time without batching first (but see this DuckDB Lab's [blog post](https://duckdb.org/2025/10/13/duckdb-streaming-patterns)) |
| 59 | +- Applications involving concurrent writes - traditional databases are still best for this. |
| 60 | + |
| 61 | +## How Can I use DuckDB? |
| 62 | + |
| 63 | +- Using SQL! |
| 64 | +- DuckDB has R and Python APIs that mirror dplyr and pandas syntax. |
| 65 | +- But the SQL API appears to get "first class" treatment. |
| 66 | + |
| 67 | +## Supported File Types |
| 68 | + |
| 69 | +- Parquet* |
| 70 | +- In-memory dataframes |
| 71 | +- CSV |
| 72 | +- JSON |
| 73 | +- Apache Arrow columnar shaped data |
| 74 | +- Cloud buckets like S3 or GCP |
| 75 | +- DuckDB database format (.duckdb) |
| 76 | + |
| 77 | +## DuckDB SQL |
| 78 | +- Supported data structures include "traditional" SQL ones: `varchar`, `numeric`, etc. |
| 79 | +- It also supports some data types not common for databases but well known in programming languaes: enums, lists, maps (dictionaries) and structs. |
| 80 | + |
| 81 | +## DuckDB SQL |
| 82 | +- DuckDB's [Friendly SQL](https://duckdb.org/docs/stable/sql/dialect/friendly_sql) eliminates some common SQL pain points. |
| 83 | +- For instance, you can select of the columns in a table using `SELECT *` instead of enumerating every column name. |
| 84 | +- DuckDB includes a range of aggregation functions, grouping functions, and support for SQL features like common table expressions. |
| 85 | + |
| 86 | +# Interacting with Bookclub Github Repository |
| 87 | + |
| 88 | +## Overview |
| 89 | +- The repo is at https://github.com/r4ds/bookclub-duckdb |
| 90 | +- We will write a "book" as we upload our presentations to this repo |
| 91 | +- Under the hood, the bookclub repository uses quarto and github actions to render a slick website |
| 92 | +- The book covers a few different technologies. Here is how I have managed the Python, R and SQL dependencies |
| 93 | + |
| 94 | +## Fork and Clone |
| 95 | + |
| 96 | +- The repo contains excellent instructions for updating the repository using R and the `usethis` package. |
| 97 | + |
| 98 | +- If you do not use R, the gist is to create a personal fork of the repository, make your changes to that fork, and then push your changges back to main repository. |
| 99 | + |
| 100 | +- Make sure your fork is up to date with the main repository! |
| 101 | + |
| 102 | +## Edit Chapter Files |
| 103 | + |
| 104 | +- You can edit the `.qmd` file for the the chapter you are presenting under `/slides/xx.qmd`. |
| 105 | +- To render in a presentation-friendly, slideshow format use revealjs. |
| 106 | +- From the command line: |
| 107 | +```{bash} |
| 108 | +#| eval: FALSE |
| 109 | +#| echo: TRUE |
| 110 | +quarto render ~/bookclub-duckdb/slides/01.qmd --to revealjs |
| 111 | +``` |
| 112 | +- Or using the `YAML` header: |
| 113 | + |
| 114 | +```{yaml} |
| 115 | +format: revealjs |
| 116 | +``` |
| 117 | + |
| 118 | +- By default, slides render to `/_site/slides/xx.html` |
| 119 | + |
| 120 | +## Dependency Management |
| 121 | + |
| 122 | +- To render the book chapters on your local machine, you need to load the R dependencies listed in the `DESCRIPTION` file. You need these even if your presentation has no R code in it. |
| 123 | + |
| 124 | +- There are no Python dependencies right now, but once we hit that point we will record them in `pyproject.toml` in the repository. |
| 125 | + |
| 126 | +## Installing R dependencies |
| 127 | + |
| 128 | +- Here is an automated way to load the correct R dependencies using `pak` and `renv`: |
| 129 | + |
| 130 | +```{r} |
| 131 | +#| echo: true |
| 132 | +#| eval: false |
| 133 | +
|
| 134 | +## Install pak and renv |
| 135 | +if (!requireNamespace("pak", quietly = TRUE)) { |
| 136 | +install.packages("pak", repos = sprintf( |
| 137 | + "https://r-lib.github.io/p/pak/stable/%s/%s/%s", |
| 138 | + .Platform$pkgType, |
| 139 | + R.Version()$os, |
| 140 | + R.Version()$arch |
| 141 | +)) |
| 142 | +} |
| 143 | +if (!requireNamespace("renv", quietly = TRUE)) { |
| 144 | + pak::pkg_install("renv") |
| 145 | +} |
| 146 | +## Configure renv to use pak to install packages |
| 147 | +renv::config$pak.enabled(TRUE) |
| 148 | +## Configure renv to snapshot dependencies from the DESCRIPTION file |
| 149 | +renv::settings$snapshot.type("explicit") |
| 150 | +
|
| 151 | +## Install the project dependencies |
| 152 | +pak::pkg_install(renv::dependencies(path = "DESCRIPTION")) |
| 153 | +``` |
| 154 | + |
| 155 | +## Installing Python dependencies |
| 156 | +- If you add any Python code to your presentation, you will need to add Python dependencies. |
| 157 | +- An easy way to do this is to [install UV](https://docs.astral.sh/uv/getting-started/installation/) |
| 158 | +- You can sync your local machine with the Python dependencies: |
| 159 | +```{bash} |
| 160 | +#| echo: true |
| 161 | +#| eval: false |
| 162 | +uv sync |
| 163 | +source .venv/bin/activate |
| 164 | +``` |
| 165 | + |
| 166 | +## Adding Packages |
| 167 | + |
| 168 | +- If you add any new R or Python packages to your local repository, you should make sure those dependencies get reflected in the main repository. |
| 169 | +- If you add R dependencies, make sure to update `DESCRIPTION` and commit the changes |
| 170 | +```{r} |
| 171 | +#| echo: TRUE |
| 172 | +#| eval: FALSE |
| 173 | +usethis::use_package("duckdb", min_version = TRUE) |
| 174 | +``` |
| 175 | + |
| 176 | +- If you add Python dependencies, make sure to update `pyproject.toml` (UV does this automatically) |
| 177 | + |
| 178 | +```{bash} |
| 179 | +#| echo: TRUE |
| 180 | +#| eval: FALSE |
| 181 | +# install AND update pyproject.toml |
| 182 | +uv add duckdb |
| 183 | +``` |
| 184 | + |
| 185 | +## Adding Packages |
| 186 | +- Generally, with this setup you should commit and push changes to `pyproject.toml` and `DESCRIPTION`, but *not* changes to `renv.lock` and `uv.lock` |
| 187 | + |
| 188 | +## SQL Chunks in Quarto |
| 189 | +- Quarto can execute SQL code chunks if we provide it an appropriate database backend (thanks to [this blog post](https://danielroelfs.com/posts/sql-notebooks-with-quarto/)). |
| 190 | +- You can define the database in R or Python, and then pass it as a quarto chunk option. |
| 191 | +- To create the database connection in R or Python: |
| 192 | +```{r} |
| 193 | +#| echo: true |
| 194 | +#| eval: false |
| 195 | +con_flights <- con_flights <- DBI::dbConnect( |
| 196 | + drv = duckdb::duckdb(), |
| 197 | + dbdir = "./data/flights.duckdb", |
| 198 | + read_only = TRUE |
| 199 | +) |
| 200 | +``` |
| 201 | + |
| 202 | +or, in Python |
| 203 | + |
| 204 | +```{python} |
| 205 | +#| echo: true |
| 206 | +#| eval: false |
| 207 | +import duckdb |
| 208 | +con_flights = duckdb.connect('flights.duckdb', read_only=True) |
| 209 | +``` |
| 210 | + |
| 211 | +## SQL Chunks in Quarto |
| 212 | +Then, to create the chunk: |
| 213 | + |
| 214 | +```{sql} |
| 215 | +#| echo: fenced |
| 216 | +#| eval: false |
| 217 | +#| connection: con_flights |
| 218 | +
|
| 219 | +SELECT name, carrier FROM airlines LIMIT 10; |
| 220 | +``` |
| 221 | + |
| 222 | +## Quarto Execution Option - Freeze |
| 223 | + |
| 224 | +- The book covers a lot different technolgies, and we may tire of troubleshooting dependencies in the main repo. |
| 225 | + |
| 226 | +- Quarto's "freeze" option is a quick fix to this issue. It will tell the rendering pipeline in the main repo not to render that chapter and instead pull the .html file from `_freeze/slides/xx/`. |
| 227 | + |
| 228 | +- In the top level YAML, use: |
| 229 | +```{yaml} |
| 230 | +# | echo: true |
| 231 | +# | eval: false |
| 232 | +execute: |
| 233 | + freeze: true |
| 234 | +``` |
| 235 | + |
| 236 | +- If you use this method, make sure sure to commit changes to `_freeze/slides/xx` to the repository! |
| 237 | + |
| 238 | +## Next week |
| 239 | +- Chapter 2 is quite short and Chapter 3 is longer - should we bundle/split? |
0 commit comments