Skip to content

Commit 50e3b8c

Browse files
authored
Adding Chapter 12 notes. (#12)
* Adding Chapter 11 notes. * Address spelling errors, add Hmisc package to DESCRIPTION file. * Adding Chapter 12 notes. * Adding Chapter 12 notes.
1 parent 1a7629b commit 50e3b8c

File tree

7 files changed

+196
-5
lines changed

7 files changed

+196
-5
lines changed

12_working-with-large-data.Rmd

Lines changed: 194 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,200 @@
1+
---
2+
output: html_document
3+
editor_options:
4+
chunk_output_type: console
5+
---
6+
17
# Working with Large Data
28

39
**Learning objectives:**
410

5-
- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY
11+
- Retrieving Statcast (Baseball Savant) multiple seasons data
12+
- Using Apache Arrow and Parquet format
13+
- Using DuckDB
14+
- Using MySQL (PostgreSQL)
15+
- Launch Angles and Velocities, Revisited
16+
17+
```{r setup_ch_12, message = FALSE, warning = FALSE}
18+
suppressMessages(library(tidyverse))
19+
# library(RPostgres) # using PostgreSQL instead of MariaDB
20+
library(abdwr3edata)
21+
library(baseballr)
22+
library(fs)
23+
theme_set(theme_classic())
24+
25+
crcblue <- "#2905a1"
26+
27+
crc_fc <- c("#2905a1", "#e41a1c", "#4daf4a", "#984ea3")
28+
29+
options(digits = 3)
30+
31+
options(timeout = max(600, getOption("timeout")))
32+
```
33+
34+
## Introduction
35+
36+
![SQL dont's](images/bart_sql.jpg)
37+
38+
![SQL select \*](images/will_not_write_select_all.png)
39+
40+
- Chapter 11 - Introduction to MySQL for building baseball databases
41+
- How to use `abdwr3edata` package functions to retrieve mutliple seasons data
42+
- Using R's (.rds) internal data format
43+
- Using Apache arrow and parquet data formats
44+
- Using DuckDB (OLAP)
45+
46+
## Acquiring a Year's Worth of Statcast Data
47+
48+
Let's say that we want to retrieve the full 2023 season data from Statcast.
49+
50+
```{r statcast_2023}
51+
# getting 2023 season statcast data
52+
# data_dir <- "./data"
53+
# statcast_dir <- path(data_dir, "sc_2023")
54+
# if (!dir.exists(statcast_dir)) {
55+
# dir.create(statcast_dir)
56+
# }
57+
#
58+
# statcast_season(year = 2023, dir = statcast_dir)
59+
#
60+
# sc2023 <- statcast_dir |>
61+
# statcast_read_csv(pattern = "sc_2023.+\\.csv")
62+
```
63+
64+
Do the same process for the 2021 and 2022 season, changing the corresponding year.
65+
66+
Now, let's verify the validity of the 2023 season data.
67+
68+
```{r verify_2023_data}
69+
tempfile_loc <- tempfile()
70+
url <- 'https://statcast-data.atl1.digitaloceanspaces.com/statcast_2023.rds'
71+
download.file(url, tempfile_loc)
72+
73+
sc2023 <- read_rds(tempfile_loc)
74+
75+
dim(sc2023)
76+
77+
sc2023 |>
78+
head() |>
79+
glimpse()
80+
```
81+
82+
```{r sc2023}
83+
sc2023 |>
84+
group_by(game_type) |>
85+
summarize(
86+
num_games = n_distinct(game_pk),
87+
num_pitches = n(),
88+
num_hr = sum(events == "home_run", na.rm = TRUE)
89+
)
90+
```
91+
92+
## Storing Large Data Efficiently
93+
94+
A full season of Statcast data contains over 700k rows and nearly 118 variables.
95+
96+
```{r sc2023_size}
97+
sc2023 |>
98+
object.size() |>
99+
print(units = "MB")
100+
```
101+
102+
The total memory size is around 643MB. The CSVs occupy around 72% of the data stored into memory.
103+
104+
## Using R's internal data format
105+
106+
```{r statcast_rds}
107+
# disk_space_rds <- path("./data") |>
108+
# dir_info(regexp = "statcast.*\\.rds") |>
109+
# select(path, size) |>
110+
# mutate(
111+
# path = path_file(path),
112+
# format = "rds"
113+
# )
114+
#
115+
# disk_space_rds
116+
```
117+
118+
## Using Apache Arrow and Apache Parquet
119+
120+
Watch the demo in the video.
121+
122+
## Using DuckDB
123+
124+
Watch the demo in the video.
125+
126+
## Performance Comparison
127+
128+
### Computational speed
129+
130+
```{r computational_speed_results}
131+
res <- read_rds('./data/res.rds')
132+
133+
res |>
134+
select(1:8) |>
135+
knitr::kable()
136+
```
137+
138+
### Memory footprint
139+
140+
```{r memory_footprint}
141+
# tbl arrow duckdb
142+
# 2004855136 504 51352
143+
```
144+
145+
### Disk storage footprint
146+
147+
```{r disk_storage_fooprint}
148+
# A tibble: 3 × 2
149+
# format footprint
150+
# <chr> <fs::bytes>
151+
# 1 duckdb 1.95G
152+
# 2 parquet 350.46M
153+
# 3 rds 211.92M
154+
```
155+
156+
### Overall guidelines
157+
158+
- If your data is small (i.e., less than a couple hundred megabytes), just use CSV because it's easy, cross-platform, and versatile.
159+
- If your data is larger than a couple hundred megabytes and you're just working in R (either by yourself or with a few colleagues), use .rds because it's space-efficient and optimized for R.
160+
- If your data is around a gigabyte or more and you need to share your data files across different platforms (i.e., not just R but also Python, etc.) and you don't want to use a SQL-based RDBMS, store your data in the Parquet format and use the arrow package.
161+
- If you want to work in SQL with a local data store, use DuckDB, because it offers more features and better performance than RSQLite, and doesn't require a server-client architecture that can be cumbersome to set up and maintain.
162+
- If you have access to a RDBMS server (hopefully maintained by a professional database administrator), use the appropriate DBI interface (e.g., RMariaDB, RPostgreSQL, etc.) to connect to it.
163+
164+
## Launch Angles and Exit Velocities, Revisited
165+
166+
Consider what happens when we ask the database to give us all the data for a particular player, say Pete Alonso, in a particular year, say 2021.
167+
168+
```{r pete_alonso_res}
169+
read_bip_data <- function(tbl, begin, end = begin,
170+
batter_id = 624413) {
171+
x <- tbl |>
172+
mutate(year = year(game_date)) |>
173+
group_by(year) |>
174+
filter(type == "X", year >= begin, year <= end) |>
175+
select(
176+
year, game_date, batter, launch_speed, launch_angle,
177+
estimated_ba_using_speedangle,
178+
estimated_woba_using_speedangle
179+
)
180+
if (!is.null(batter_id)) {
181+
x <- x |>
182+
filter(batter == batter_id)
183+
}
184+
x |>
185+
collect()
186+
}
187+
188+
pete_alonso_res <- read_rds('./data/pete_alonso_res.rds')
189+
190+
pete_alonso_res |>
191+
knitr::kable()
192+
```
193+
194+
### Launches angles over time
195+
196+
![Fig. 12.1](images/fig_12_1_woba.png)
6197

7-
## SLIDE 1 {-}
198+
## Further reading
8199

9-
- ADD SLIDES AS SECTIONS (`##`).
10-
- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF.
200+
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. 2nd ed. Sebastapol, CA: O'Reilly Media, Inc. <https://r4ds.hadley.nz/>.

DESCRIPTION

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@ Depends:
1010
Imports:
1111
abdwr3edata,
1212
bookdown,
13-
baseballr,
13+
baseballr,
14+
fs,
1415
here,
1516
Hmisc,
1617
Lahman,

data/pete_alonso_res.rds

1.08 KB
Binary file not shown.

data/res.rds

652 KB
Binary file not shown.

images/bart_sql.jpg

98.6 KB
Loading

images/fig_12_1_woba.png

531 KB
Loading
188 KB
Loading

0 commit comments

Comments
 (0)