Skip to content

Commit b5c8bd9

Browse files
trevinflicklgibson7
andauthored
add chapter 2 materials (#3)
* add chapter 2 materials * add abdwr3edata package * run `use_tidy_description()` to tidy imports --------- Co-authored-by: lgibson7 <“[email protected]”>
1 parent 60f6db3 commit b5c8bd9

File tree

8 files changed

+407
-9
lines changed

8 files changed

+407
-9
lines changed

02_introduction-to-r.Rmd

Lines changed: 380 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,385 @@
22

33
**Learning objectives:**
44

5-
- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY
5+
- Getting started with R
6+
- Discover different ways to hold data
7+
- Reading and writing data
8+
- Tidyverse verbs
9+
- Understand basic data wrangling
10+
11+
## Downloading and using R
12+
13+
- [Download R language](https://www.r-project.org/)
14+
- [Rstudio](https://posit.co/products/open-source/rstudio/) (popular IDE)
15+
- [Positron](https://positron.posit.co/) (new IDE)
16+
17+
- [Setting up macOS as an R data science rig in 2023](https://ivelasq.rbind.io/blog/macos-rig/) by Isabella Velásquez
18+
19+
## Tidyverse
20+
21+
![tidyverse hex logo](images/tidyverse-hex.png) \
22+
23+
24+
25+
Packages that make up the Tidyverse
26+
- **dplyr**, **ggplot2**, **tibble**, **tidyr**, **readr**, **purrr**, **stringr**, **lubridate**, **forcats**
27+
28+
```{r, message=FALSE, eval=TRUE}
29+
library(tidyverse)
30+
31+
Lahman::Teams |>
32+
dplyr::filter(teamID == "DET") |>
33+
dplyr::arrange(desc(yearID)) |>
34+
dplyr::select(yearID, name, W, L) |>
35+
dplyr::slice_head(n = 10)
36+
```
37+
38+
Other packages for this book:
39+
40+
```{r, message=FALSE, eval=FALSE}
41+
remotes::install_github("beanumber/abdwr3edata")
42+
43+
library(abdwr3edata)
44+
```
45+
46+
## Data Frames
47+
48+
```{r, message=FALSE, eval=TRUE}
49+
library(abdwr3edata)
50+
51+
spahn |>
52+
dplyr::slice(1:3) |>
53+
dplyr::select(1:10)
54+
55+
spahn[1:3, 1:10]
56+
```
57+
58+
### Manipulations with Data
59+
60+
![FIP](images/FIP-book.png) \
61+
62+
```{r, message=FALSE, eval=TRUE}
63+
spahn <- spahn |>
64+
dplyr::mutate(FIP = (13 * HR + 3 * BB - 2 * SO) / IP)
65+
66+
spahn |>
67+
dplyr::arrange(FIP) |>
68+
dplyr::select(Year, Age, W, L, ERA, FIP) |>
69+
dplyr::slice_head(n = 5)
70+
```
71+
72+
What do you notice about Spahn's FIP?
73+
74+
![FIP from Fangraphs](images/FIP-fg.png)
75+
76+
[Fangraphs library](https://www.fangraphs.com/guts.aspx?type=cn)
77+
78+
[Fangraphs FIP constants](https://www.fangraphs.com/guts.aspx?type=cn)
79+
80+
You can combine data with joins.
81+
82+
```{r, message=FALSE, eval=TRUE}
83+
batting <- dplyr::bind_rows(NLbatting, ALbatting)
84+
85+
dplyr::dim_desc(NLbatting)
86+
dplyr::dim_desc(ALbatting)
87+
dplyr::dim_desc(batting)
88+
89+
NL <- dplyr::inner_join(NLbatting, NLpitching, by = "Tm")
90+
dplyr::dim_desc(NLpitching)
91+
dplyr::dim_desc(NLbatting)
92+
dplyr::dim_desc(NL)
93+
```
94+
95+
96+
## Vectors
97+
98+
A sequence of values of the **same** type (e.g. numeric or character).
99+
100+
If you include multiple types, R will automatically force same type.
101+
102+
```{r, message=FALSE, eval=TRUE}
103+
# Spahn's wins and losses after the war (this is a code comment)
104+
105+
W <- c(8, 21, 15, 21, 21, 22, 14)
106+
L <- c(5, 10, 12, 14, 17, 14, 19)
107+
108+
win_pct <- 100 * W / (W + L)
109+
Year <- seq(from = 1946, to = 1952) # Same: Year <- 1946:1952
110+
```
111+
112+
R has a lot of built-in functions for vectors
113+
114+
```{r, message=FALSE, eval=TRUE}
115+
# total wins over post-war span
116+
sum(W)
117+
118+
# number of seasons post-war
119+
length(W)
120+
121+
# avg. winning pct.
122+
mean(win_pct)
123+
```
124+
125+
Ways to select data with vector index and logicals.
126+
127+
```{r, message=FALSE, eval=TRUE}
128+
W[c(1, 2, 5)]
129+
130+
W[1 : 4]
131+
132+
W[-c(1, 6)]
133+
```
134+
135+
How many times did Spahn exceed 20 wins? What years?
136+
137+
```{r, message=FALSE, eval=TRUE}
138+
W > 20
139+
140+
sum(W > 20)
141+
142+
Year[W > 20]
143+
```
144+
145+
## Objects and Containers in R
146+
147+
Characters and data frames
148+
149+
```{r, message=FALSE, eval=TRUE}
150+
Year <- 2008 : 2017
151+
NL <- c("PHI", "PHI", "SFN", "SLN", "SFN",
152+
"SLN", "SFN", "NYN", "CHN", "LAN")
153+
AL <- c("TBA", "NYA", "TEX", "TEX", "DET",
154+
"BOS", "KCA", "KCA", "CLE", "HOU")
155+
Winner <- c("NL", "AL", "NL", "NL", "NL",
156+
"AL", "NL", "AL", "NL", "AL")
157+
N_Games <- c(5, 6, 5, 7, 4, 7, 7, 5, 7, 7)
158+
159+
WS_results <- tibble::tibble(
160+
Year = Year, NL_Team = NL, AL_Team = AL,
161+
N_Games = N_Games, Winner = Winner)
162+
163+
WS_results
164+
165+
WS <- WS_results |>
166+
dplyr::group_by(Winner) |>
167+
dplyr::summarize(N = dplyr::n())
168+
169+
WS
170+
171+
ggplot2::ggplot(WS, ggplot2::aes(x = Winner, y = N)) +
172+
ggplot2::geom_col()
173+
```
174+
175+
Factors
176+
177+
```{r, message=FALSE, eval=TRUE}
178+
# Alphabetical order
179+
WS_results |>
180+
dplyr::group_by(NL_Team) |>
181+
dplyr::summarize(N = dplyr::n())
182+
183+
# Use a factor to order by division
184+
WS_results <- WS_results |>
185+
dplyr::mutate(
186+
NL_Team = factor(
187+
NL_Team,
188+
levels = c("NYN", "PHI", "CHN", "SLN", "LAN", "SFN")
189+
)
190+
)
191+
192+
# Now ordered by division
193+
WS_results |>
194+
dplyr::group_by(NL_Team) |>
195+
dplyr::summarize(N = dplyr::n())
196+
```
197+
198+
Lists
199+
200+
```{r, message=FALSE, eval=TRUE}
201+
world_series <- list(
202+
Winner = Winner,
203+
Number_Games = N_Games,
204+
Seasons = "2008 to 2017"
205+
)
206+
207+
world_series
208+
```
209+
Many ways to pull data:
210+
211+
```{r, message=FALSE, eval=TRUE}
212+
world_series$Number_Games
213+
214+
world_series[[2]]
215+
216+
purrr::pluck(world_series, "Number_Games")
217+
218+
world_series["Number_Games"]
219+
```
220+
221+
222+
```{r, message=FALSE, eval=TRUE}
223+
WS_results$NL_Team
224+
225+
# same
226+
dplyr::pull(WS_results, NL_Team)
227+
```
228+
229+
## Collection of R Commands
230+
231+
```{r, eval=FALSE}
232+
# We can save this as a script to run later
233+
234+
library(Lahman)
235+
library(tidyverse)
236+
237+
crcblue <- "#2905a1"
238+
239+
ws <- SeriesPost |>
240+
filter(yearID >= 1903, round == "WS", wins + losses < 8)
241+
ggplot(ws, aes(x = wins + losses)) +
242+
geom_bar(fill = crcblue) +
243+
labs(x = "Number of games", y = "Frequency")
244+
```
245+
246+
247+
```{r}
248+
# running the script
249+
250+
source(here::here("scripts/WorldSeriesLength.R"), echo = TRUE)
251+
```
252+
253+
254+
```{r}
255+
source(here::here("scripts/hr_rates.R"))
256+
257+
# Mickey Mantle stats ffrom 1951 to 1961
258+
259+
HR <- c(13, 23, 21, 27, 37, 52, 34, 42, 31, 40, 54)
260+
AB <- c(341, 549, 461, 543, 517, 533, 474, 519, 541, 527, 514)
261+
Age <- 19 : 29
262+
hr_rates(Age, HR, AB)
263+
```
264+
265+
266+
## Reading and Writing Data
267+
268+
```{r, eval=FALSE}
269+
# Read data
270+
getwd()
271+
272+
spahn <- readr::read_csv(here::here("data/spahn.csv"))
273+
```
274+
275+
276+
```{r, eval=FALSE}
277+
# Write data
278+
mantle_hr_rates <- hr_rates(Age, HR, AB)
279+
Mantle <- tibble::tibble(
280+
Age, HR, AB, Rates = mantle_hr_rates$y
281+
)
282+
283+
readr::write_csv(Mantle, here::here("data/mantle.csv"))
284+
```
285+
286+
287+
## Packages
288+
289+
Currently over 20,000 packages available via CRAN
290+
291+
```{r, eval=FALSE}
292+
# Install package
293+
install.packages("Lahman")
294+
295+
library(Lahman)
296+
```
297+
298+
299+
Can also download packages via Github
300+
301+
```{r, eval=FALSE}
302+
remotes::install_github("beanumber/abdwr3edata")
303+
304+
library(abdwr3edata)
305+
```
306+
307+
Use question mark to learn more about package contents
308+
309+
```{r, eval=FALSE}
310+
# to learn more about the Batting data set in Lahman
311+
?Batting
312+
```
313+
314+
315+
## Splitting, Applying, and Combining Data
316+
317+
```{r}
318+
library(Lahman)
319+
320+
Batting |>
321+
dplyr::filter(yearID >= 1960, yearID <= 1969) |>
322+
dplyr::group_by(playerID) |>
323+
dplyr::summarize(HR = sum(HR)) |>
324+
dplyr::arrange(desc(HR)) |>
325+
dplyr::slice(1:4)
326+
```
327+
What if we want to find the top HR hitters for each decade?
328+
329+
```{r}
330+
hr_leader <- function(data) {
331+
data |>
332+
dplyr::group_by(playerID) |>
333+
dplyr::summarize(HR = sum(HR)) |>
334+
dplyr::arrange(desc(HR)) |>
335+
dplyr::slice(1)
336+
}
337+
```
338+
339+
Do you see any potential issues with this function?
340+
341+
```{r}
342+
Batting_decade <- Batting |>
343+
dplyr::mutate(decade = 10 * floor(yearID / 10)) |>
344+
dplyr::group_by(decade)
345+
346+
decades <- Batting_decade |>
347+
dplyr::group_keys() |>
348+
dplyr::pull("decade")
349+
350+
decades
351+
```
352+
353+
```{r}
354+
Batting_decade |>
355+
dplyr::group_split() |>
356+
purrr::map(hr_leader) |>
357+
purrr::set_names(decades) |>
358+
dplyr::bind_rows(.id = "decade")
359+
```
360+
361+
## Getting Help
362+
363+
```{r}
364+
# Will open documentation for the function
365+
?geom_point
366+
```
367+
368+
```{r}
369+
# find all objects that contain this character string
370+
??geom_point
371+
```
372+
373+
374+
## Further Reading
375+
376+
- [R for Data Science (2e)](https://r4ds.hadley.nz/)
377+
- [Modern Data Science with R](https://www.routledge.com/Modern-Data-Science-with-R/Baumer-Kaplan-Horton/p/book/9780367191498) $$
378+
- [A Modern Dive into R and the Tidyverse](https://moderndive.com/)
379+
- [R for the Rest of Us](https://book.rfortherestofus.com/)
380+
- [R Packages](https://r-pkgs.org/)
381+
- [Happy Git and GitHub for the useR](https://happygitwithr.com)
382+
383+
384+
6385

7-
## SLIDE 1 {-}
8386

9-
- ADD SLIDES AS SECTIONS (`##`).
10-
- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF.

0 commit comments

Comments
 (0)