Skip to content

Commit 7430ac9

Browse files
authored
Merge pull request #1 from coadkins/main
update chapter 1 slides
2 parents e352546 + 0cc186f commit 7430ac9

File tree

5 files changed

+236
-11
lines changed

5 files changed

+236
-11
lines changed

DESCRIPTION

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,5 +19,6 @@ Imports:
1919
glue,
2020
purrr,
2121
readr,
22+
reticulate (>= 1.44.1),
2223
rmarkdown,
2324
yaml

assets/01/duckdb-5.png

68.6 KB
Loading

assets/01/duckdb-50.png

61.3 KB
Loading

pyproject.toml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
[project]
2+
name = "bookclub-duckdb"
3+
version = "0.1.0"
4+
description = "DuckDB in Action Book Club"
5+
readme = "README.md"
6+
requires-python = ">=3.11"
7+
dependencies = [
8+
"duckdb>=1.4.2",
9+
]

slides/01.qmd

Lines changed: 226 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,239 @@
11
---
2-
engine: knitr
32
title: An introduction to DuckDB
43
---
54

6-
# ️✅ Learning objectives
5+
# ️Welcome!
76

8-
## LOs for the entire book
7+
## Logistics
8+
- All the important bookclub links are under Slack's "Bookmarks" tab
9+
- [Claim a chapter](https://docs.google.com/spreadsheets/d/1B9WV0Iiv6XYDX49qKWuLkfiCrM0YUkfiCYB6NZH4DkY/edit?gid=0#gid=0)
10+
- [Github repository](https://github.com/r4ds/bookclub-duckdb)
11+
- Potential guest speakers
912

10-
- FILL IN
13+
## Agenda
14+
- Review Chapter 1
15+
- Discuss the finer points of contributing to the bookclub github repository
1116

12-
## LOs for this chapter
17+
# Chapter 1
1318

14-
- FILL IN
19+
## Learning Goals
1520

16-
# Title for Group of Slides
21+
- What is DuckDB and why bother learning to use it?
1722

18-
## Title for Individual Slide
23+
## Basics
1924

20-
- Slide-like content
25+
- DuckDB is a single-node, in-memory database that intergrates well into many places in the data pipeline.
26+
- DuckDB is very fast - much faster than `dplyr` or `pandas` for data transformation.
27+
- The DuckDB IP is owned by the Netherlands Stichting (non-profit) DuckDB Foundation.
2128

22-
::: notes
23-
- Speaker notes for this slide
29+
## How Fast?
30+
![Source: DuckDB Labs](/assets/01/duckdb-5.png)
31+
32+
## How Fast?
33+
34+
![Source: DuckDB Labs](/assets/01/duckdb-50.png)
35+
36+
## Where Does DuckDB Fit In?
37+
38+
- DuckDB is an "in process" database - it runs in another application's memory space, like R or Python.
39+
- It can speed up traditional "small data" workloads by interfacing with R or Python libraries.
40+
- It can extend local analysis to data of "a few hundred gigabytes".
41+
- It operate directly "at the edge" or "in the cloud." For instance, it can analyze data stored in a cloud S3 bucket in-memory, avoiding costly transfer operations.
42+
43+
## Where Does DuckDB Fit In?
44+
45+
::: {.callout-tip}
46+
# How much data?
47+
Despite being in-memory, DuckDB allows analyses to "spillover" into the hard disk. What are the limits of this fallback?
2448
:::
49+
50+
::: {.callout-tip}
51+
# Which process?
52+
What counts as a "process" for in-process? The authors discuss querying S3 files from a Cloud VM instance "in process" - is this a serverless deployment of Python? What's the "process" for the DuckDB CLI?
53+
:::
54+
55+
## Where Not to Use DuckDB?
56+
57+
- Tranformation and analysis of very large data sets
58+
- "Steaming data" in real-time without batching first (but see this DuckDB Lab's [blog post](https://duckdb.org/2025/10/13/duckdb-streaming-patterns))
59+
- Applications involving concurrent writes - traditional databases are still best for this.
60+
61+
## How Can I use DuckDB?
62+
63+
- Using SQL!
64+
- DuckDB has R and Python APIs that mirror dplyr and pandas syntax.
65+
- But the SQL API appears to get "first class" treatment.
66+
67+
## Supported File Types
68+
69+
- Parquet*
70+
- In-memory dataframes
71+
- CSV
72+
- JSON
73+
- Apache Arrow columnar shaped data
74+
- Cloud buckets like S3 or GCP
75+
- DuckDB database format (.duckdb)
76+
77+
## DuckDB SQL
78+
- Supported data structures include "traditional" SQL ones: `varchar`, `numeric`, etc.
79+
- It also supports some data types not common for databases but well known in programming languaes: enums, lists, maps (dictionaries) and structs.
80+
81+
## DuckDB SQL
82+
- DuckDB's [Friendly SQL](https://duckdb.org/docs/stable/sql/dialect/friendly_sql) eliminates some common SQL pain points.
83+
- For instance, you can select of the columns in a table using `SELECT *` instead of enumerating every column name.
84+
- DuckDB includes a range of aggregation functions, grouping functions, and support for SQL features like common table expressions.
85+
86+
# Interacting with Bookclub Github Repository
87+
88+
## Overview
89+
- The repo is at https://github.com/r4ds/bookclub-duckdb
90+
- We will write a "book" as we upload our presentations to this repo
91+
- Under the hood, the bookclub repository uses quarto and github actions to render a slick website
92+
- The book covers a few different technologies. Here is how I have managed the Python, R and SQL dependencies
93+
94+
## Fork and Clone
95+
96+
- The repo contains excellent instructions for updating the repository using R and the `usethis` package.
97+
98+
- If you do not use R, the gist is to create a personal fork of the repository, make your changes to that fork, and then push your changges back to main repository.
99+
100+
- Make sure your fork is up to date with the main repository!
101+
102+
## Edit Chapter Files
103+
104+
- You can edit the `.qmd` file for the the chapter you are presenting under `/slides/xx.qmd`.
105+
- To render in a presentation-friendly, slideshow format use revealjs.
106+
- From the command line:
107+
```{bash}
108+
#| eval: FALSE
109+
#| echo: TRUE
110+
quarto render ~/bookclub-duckdb/slides/01.qmd --to revealjs
111+
```
112+
- Or using the `YAML` header:
113+
114+
```{yaml}
115+
format: revealjs
116+
```
117+
118+
- By default, slides render to `/_site/slides/xx.html`
119+
120+
## Dependency Management
121+
122+
- To render the book chapters on your local machine, you need to load the R dependencies listed in the `DESCRIPTION` file. You need these even if your presentation has no R code in it.
123+
124+
- There are no Python dependencies right now, but once we hit that point we will record them in `pyproject.toml` in the repository.
125+
126+
## Installing R dependencies
127+
128+
- Here is an automated way to load the correct R dependencies using `pak` and `renv`:
129+
130+
```{r}
131+
#| echo: true
132+
#| eval: false
133+
134+
## Install pak and renv
135+
if (!requireNamespace("pak", quietly = TRUE)) {
136+
install.packages("pak", repos = sprintf(
137+
"https://r-lib.github.io/p/pak/stable/%s/%s/%s",
138+
.Platform$pkgType,
139+
R.Version()$os,
140+
R.Version()$arch
141+
))
142+
}
143+
if (!requireNamespace("renv", quietly = TRUE)) {
144+
pak::pkg_install("renv")
145+
}
146+
## Configure renv to use pak to install packages
147+
renv::config$pak.enabled(TRUE)
148+
## Configure renv to snapshot dependencies from the DESCRIPTION file
149+
renv::settings$snapshot.type("explicit")
150+
151+
## Install the project dependencies
152+
pak::pkg_install(renv::dependencies(path = "DESCRIPTION"))
153+
```
154+
155+
## Installing Python dependencies
156+
- If you add any Python code to your presentation, you will need to add Python dependencies.
157+
- An easy way to do this is to [install UV](https://docs.astral.sh/uv/getting-started/installation/)
158+
- You can sync your local machine with the Python dependencies:
159+
```{bash}
160+
#| echo: true
161+
#| eval: false
162+
uv sync
163+
source .venv/bin/activate
164+
```
165+
166+
## Adding Packages
167+
168+
- If you add any new R or Python packages to your local repository, you should make sure those dependencies get reflected in the main repository.
169+
- If you add R dependencies, make sure to update `DESCRIPTION` and commit the changes
170+
```{r}
171+
#| echo: TRUE
172+
#| eval: FALSE
173+
usethis::use_package("duckdb", min_version = TRUE)
174+
```
175+
176+
- If you add Python dependencies, make sure to update `pyproject.toml` (UV does this automatically)
177+
178+
```{bash}
179+
#| echo: TRUE
180+
#| eval: FALSE
181+
# install AND update pyproject.toml
182+
uv add duckdb
183+
```
184+
185+
## Adding Packages
186+
- Generally, with this setup you should commit and push changes to `pyproject.toml` and `DESCRIPTION`, but *not* changes to `renv.lock` and `uv.lock`
187+
188+
## SQL Chunks in Quarto
189+
- Quarto can execute SQL code chunks if we provide it an appropriate database backend (thanks to [this blog post](https://danielroelfs.com/posts/sql-notebooks-with-quarto/)).
190+
- You can define the database in R or Python, and then pass it as a quarto chunk option.
191+
- To create the database connection in R or Python:
192+
```{r}
193+
#| echo: true
194+
#| eval: false
195+
con_flights <- con_flights <- DBI::dbConnect(
196+
drv = duckdb::duckdb(),
197+
dbdir = "./data/flights.duckdb",
198+
read_only = TRUE
199+
)
200+
```
201+
202+
or, in Python
203+
204+
```{python}
205+
#| echo: true
206+
#| eval: false
207+
import duckdb
208+
con_flights = duckdb.connect('flights.duckdb', read_only=True)
209+
```
210+
211+
## SQL Chunks in Quarto
212+
Then, to create the chunk:
213+
214+
```{sql}
215+
#| echo: fenced
216+
#| eval: false
217+
#| connection: con_flights
218+
219+
SELECT name, carrier FROM airlines LIMIT 10;
220+
```
221+
222+
## Quarto Execution Option - Freeze
223+
224+
- The book covers a lot different technolgies, and we may tire of troubleshooting dependencies in the main repo.
225+
226+
- Quarto's "freeze" option is a quick fix to this issue. It will tell the rendering pipeline in the main repo not to render that chapter and instead pull the .html file from `_freeze/slides/xx/`.
227+
228+
- In the top level YAML, use:
229+
```{yaml}
230+
# | echo: true
231+
# | eval: false
232+
execute:
233+
freeze: true
234+
```
235+
236+
- If you use this method, make sure sure to commit changes to `_freeze/slides/xx` to the repository!
237+
238+
## Next week
239+
- Chapter 2 is quite short and Chapter 3 is longer - should we bundle/split?

0 commit comments

Comments
 (0)