|
2 | 2 |
|
3 | 3 | **Learning objectives:** |
4 | 4 |
|
5 | | -- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY |
| 5 | +- Getting started with R |
| 6 | +- Discover different ways to hold data |
| 7 | +- Reading and writing data |
| 8 | +- Tidyverse verbs |
| 9 | +- Understand basic data wrangling |
| 10 | + |
| 11 | +## Downloading and using R |
| 12 | + |
| 13 | +- [Download R language](https://www.r-project.org/) |
| 14 | +- [Rstudio](https://posit.co/products/open-source/rstudio/) (popular IDE) |
| 15 | +- [Positron](https://positron.posit.co/) (new IDE) |
| 16 | + |
| 17 | +- [Setting up macOS as an R data science rig in 2023](https://ivelasq.rbind.io/blog/macos-rig/) by Isabella Velásquez |
| 18 | + |
| 19 | +## Tidyverse |
| 20 | + |
| 21 | + \ |
| 22 | + |
| 23 | + |
| 24 | + |
| 25 | +Packages that make up the Tidyverse |
| 26 | +- **dplyr**, **ggplot2**, **tibble**, **tidyr**, **readr**, **purrr**, **stringr**, **lubridate**, **forcats** |
| 27 | + |
| 28 | +```{r, message=FALSE, eval=TRUE} |
| 29 | +library(tidyverse) |
| 30 | +
|
| 31 | +Lahman::Teams |> |
| 32 | + dplyr::filter(teamID == "DET") |> |
| 33 | + dplyr::arrange(desc(yearID)) |> |
| 34 | + dplyr::select(yearID, name, W, L) |> |
| 35 | + dplyr::slice_head(n = 10) |
| 36 | +``` |
| 37 | + |
| 38 | +Other packages for this book: |
| 39 | + |
| 40 | +```{r, message=FALSE, eval=FALSE} |
| 41 | +remotes::install_github("beanumber/abdwr3edata") |
| 42 | +
|
| 43 | +library(abdwr3edata) |
| 44 | +``` |
| 45 | + |
| 46 | +## Data Frames |
| 47 | + |
| 48 | +```{r, message=FALSE, eval=TRUE} |
| 49 | +library(abdwr3edata) |
| 50 | +
|
| 51 | +spahn |> |
| 52 | + dplyr::slice(1:3) |> |
| 53 | + dplyr::select(1:10) |
| 54 | +
|
| 55 | +spahn[1:3, 1:10] |
| 56 | +``` |
| 57 | + |
| 58 | +### Manipulations with Data |
| 59 | + |
| 60 | + \ |
| 61 | + |
| 62 | +```{r, message=FALSE, eval=TRUE} |
| 63 | +spahn <- spahn |> |
| 64 | + dplyr::mutate(FIP = (13 * HR + 3 * BB - 2 * SO) / IP) |
| 65 | +
|
| 66 | +spahn |> |
| 67 | + dplyr::arrange(FIP) |> |
| 68 | + dplyr::select(Year, Age, W, L, ERA, FIP) |> |
| 69 | + dplyr::slice_head(n = 5) |
| 70 | +``` |
| 71 | + |
| 72 | +What do you notice about Spahn's FIP? |
| 73 | + |
| 74 | + |
| 75 | + |
| 76 | +[Fangraphs library](https://www.fangraphs.com/guts.aspx?type=cn) |
| 77 | + |
| 78 | +[Fangraphs FIP constants](https://www.fangraphs.com/guts.aspx?type=cn) |
| 79 | + |
| 80 | +You can combine data with joins. |
| 81 | + |
| 82 | +```{r, message=FALSE, eval=TRUE} |
| 83 | +batting <- dplyr::bind_rows(NLbatting, ALbatting) |
| 84 | +
|
| 85 | +dplyr::dim_desc(NLbatting) |
| 86 | +dplyr::dim_desc(ALbatting) |
| 87 | +dplyr::dim_desc(batting) |
| 88 | +
|
| 89 | +NL <- dplyr::inner_join(NLbatting, NLpitching, by = "Tm") |
| 90 | +dplyr::dim_desc(NLpitching) |
| 91 | +dplyr::dim_desc(NLbatting) |
| 92 | +dplyr::dim_desc(NL) |
| 93 | +``` |
| 94 | + |
| 95 | + |
| 96 | +## Vectors |
| 97 | + |
| 98 | +A sequence of values of the **same** type (e.g. numeric or character). |
| 99 | + |
| 100 | +If you include multiple types, R will automatically force same type. |
| 101 | + |
| 102 | +```{r, message=FALSE, eval=TRUE} |
| 103 | +# Spahn's wins and losses after the war (this is a code comment) |
| 104 | +
|
| 105 | +W <- c(8, 21, 15, 21, 21, 22, 14) |
| 106 | +L <- c(5, 10, 12, 14, 17, 14, 19) |
| 107 | +
|
| 108 | +win_pct <- 100 * W / (W + L) |
| 109 | +Year <- seq(from = 1946, to = 1952) # Same: Year <- 1946:1952 |
| 110 | +``` |
| 111 | + |
| 112 | +R has a lot of built-in functions for vectors |
| 113 | + |
| 114 | +```{r, message=FALSE, eval=TRUE} |
| 115 | +# total wins over post-war span |
| 116 | +sum(W) |
| 117 | +
|
| 118 | +# number of seasons post-war |
| 119 | +length(W) |
| 120 | +
|
| 121 | +# avg. winning pct. |
| 122 | +mean(win_pct) |
| 123 | +``` |
| 124 | + |
| 125 | +Ways to select data with vector index and logicals. |
| 126 | + |
| 127 | +```{r, message=FALSE, eval=TRUE} |
| 128 | +W[c(1, 2, 5)] |
| 129 | +
|
| 130 | +W[1 : 4] |
| 131 | +
|
| 132 | +W[-c(1, 6)] |
| 133 | +``` |
| 134 | + |
| 135 | +How many times did Spahn exceed 20 wins? What years? |
| 136 | + |
| 137 | +```{r, message=FALSE, eval=TRUE} |
| 138 | +W > 20 |
| 139 | +
|
| 140 | +sum(W > 20) |
| 141 | +
|
| 142 | +Year[W > 20] |
| 143 | +``` |
| 144 | + |
| 145 | +## Objects and Containers in R |
| 146 | + |
| 147 | +Characters and data frames |
| 148 | + |
| 149 | +```{r, message=FALSE, eval=TRUE} |
| 150 | +Year <- 2008 : 2017 |
| 151 | +NL <- c("PHI", "PHI", "SFN", "SLN", "SFN", |
| 152 | + "SLN", "SFN", "NYN", "CHN", "LAN") |
| 153 | +AL <- c("TBA", "NYA", "TEX", "TEX", "DET", |
| 154 | + "BOS", "KCA", "KCA", "CLE", "HOU") |
| 155 | +Winner <- c("NL", "AL", "NL", "NL", "NL", |
| 156 | + "AL", "NL", "AL", "NL", "AL") |
| 157 | +N_Games <- c(5, 6, 5, 7, 4, 7, 7, 5, 7, 7) |
| 158 | +
|
| 159 | +WS_results <- tibble::tibble( |
| 160 | + Year = Year, NL_Team = NL, AL_Team = AL, |
| 161 | + N_Games = N_Games, Winner = Winner) |
| 162 | +
|
| 163 | +WS_results |
| 164 | +
|
| 165 | +WS <- WS_results |> |
| 166 | + dplyr::group_by(Winner) |> |
| 167 | + dplyr::summarize(N = dplyr::n()) |
| 168 | +
|
| 169 | +WS |
| 170 | +
|
| 171 | +ggplot2::ggplot(WS, ggplot2::aes(x = Winner, y = N)) + |
| 172 | + ggplot2::geom_col() |
| 173 | +``` |
| 174 | + |
| 175 | +Factors |
| 176 | + |
| 177 | +```{r, message=FALSE, eval=TRUE} |
| 178 | +# Alphabetical order |
| 179 | +WS_results |> |
| 180 | + dplyr::group_by(NL_Team) |> |
| 181 | + dplyr::summarize(N = dplyr::n()) |
| 182 | +
|
| 183 | +# Use a factor to order by division |
| 184 | +WS_results <- WS_results |> |
| 185 | + dplyr::mutate( |
| 186 | + NL_Team = factor( |
| 187 | + NL_Team, |
| 188 | + levels = c("NYN", "PHI", "CHN", "SLN", "LAN", "SFN") |
| 189 | + ) |
| 190 | + ) |
| 191 | +
|
| 192 | +# Now ordered by division |
| 193 | +WS_results |> |
| 194 | + dplyr::group_by(NL_Team) |> |
| 195 | + dplyr::summarize(N = dplyr::n()) |
| 196 | +``` |
| 197 | + |
| 198 | +Lists |
| 199 | + |
| 200 | +```{r, message=FALSE, eval=TRUE} |
| 201 | +world_series <- list( |
| 202 | + Winner = Winner, |
| 203 | + Number_Games = N_Games, |
| 204 | + Seasons = "2008 to 2017" |
| 205 | +) |
| 206 | +
|
| 207 | +world_series |
| 208 | +``` |
| 209 | +Many ways to pull data: |
| 210 | + |
| 211 | +```{r, message=FALSE, eval=TRUE} |
| 212 | +world_series$Number_Games |
| 213 | +
|
| 214 | +world_series[[2]] |
| 215 | +
|
| 216 | +purrr::pluck(world_series, "Number_Games") |
| 217 | +
|
| 218 | +world_series["Number_Games"] |
| 219 | +``` |
| 220 | + |
| 221 | + |
| 222 | +```{r, message=FALSE, eval=TRUE} |
| 223 | +WS_results$NL_Team |
| 224 | +
|
| 225 | +# same |
| 226 | +dplyr::pull(WS_results, NL_Team) |
| 227 | +``` |
| 228 | + |
| 229 | +## Collection of R Commands |
| 230 | + |
| 231 | +```{r, eval=FALSE} |
| 232 | +# We can save this as a script to run later |
| 233 | +
|
| 234 | +library(Lahman) |
| 235 | +library(tidyverse) |
| 236 | +
|
| 237 | +crcblue <- "#2905a1" |
| 238 | +
|
| 239 | +ws <- SeriesPost |> |
| 240 | + filter(yearID >= 1903, round == "WS", wins + losses < 8) |
| 241 | +ggplot(ws, aes(x = wins + losses)) + |
| 242 | + geom_bar(fill = crcblue) + |
| 243 | + labs(x = "Number of games", y = "Frequency") |
| 244 | +``` |
| 245 | + |
| 246 | + |
| 247 | +```{r} |
| 248 | +# running the script |
| 249 | +
|
| 250 | +source(here::here("scripts/WorldSeriesLength.R"), echo = TRUE) |
| 251 | +``` |
| 252 | + |
| 253 | + |
| 254 | +```{r} |
| 255 | +source(here::here("scripts/hr_rates.R")) |
| 256 | +
|
| 257 | +# Mickey Mantle stats ffrom 1951 to 1961 |
| 258 | +
|
| 259 | +HR <- c(13, 23, 21, 27, 37, 52, 34, 42, 31, 40, 54) |
| 260 | +AB <- c(341, 549, 461, 543, 517, 533, 474, 519, 541, 527, 514) |
| 261 | +Age <- 19 : 29 |
| 262 | +hr_rates(Age, HR, AB) |
| 263 | +``` |
| 264 | + |
| 265 | + |
| 266 | +## Reading and Writing Data |
| 267 | + |
| 268 | +```{r, eval=FALSE} |
| 269 | +# Read data |
| 270 | +getwd() |
| 271 | +
|
| 272 | +spahn <- readr::read_csv(here::here("data/spahn.csv")) |
| 273 | +``` |
| 274 | + |
| 275 | + |
| 276 | +```{r, eval=FALSE} |
| 277 | +# Write data |
| 278 | +mantle_hr_rates <- hr_rates(Age, HR, AB) |
| 279 | +Mantle <- tibble::tibble( |
| 280 | + Age, HR, AB, Rates = mantle_hr_rates$y |
| 281 | +) |
| 282 | +
|
| 283 | +readr::write_csv(Mantle, here::here("data/mantle.csv")) |
| 284 | +``` |
| 285 | + |
| 286 | + |
| 287 | +## Packages |
| 288 | + |
| 289 | +Currently over 20,000 packages available via CRAN |
| 290 | + |
| 291 | +```{r, eval=FALSE} |
| 292 | +# Install package |
| 293 | +install.packages("Lahman") |
| 294 | +
|
| 295 | +library(Lahman) |
| 296 | +``` |
| 297 | + |
| 298 | + |
| 299 | +Can also download packages via Github |
| 300 | + |
| 301 | +```{r, eval=FALSE} |
| 302 | +remotes::install_github("beanumber/abdwr3edata") |
| 303 | +
|
| 304 | +library(abdwr3edata) |
| 305 | +``` |
| 306 | + |
| 307 | +Use question mark to learn more about package contents |
| 308 | + |
| 309 | +```{r, eval=FALSE} |
| 310 | +# to learn more about the Batting data set in Lahman |
| 311 | +?Batting |
| 312 | +``` |
| 313 | + |
| 314 | + |
| 315 | +## Splitting, Applying, and Combining Data |
| 316 | + |
| 317 | +```{r} |
| 318 | +library(Lahman) |
| 319 | +
|
| 320 | +Batting |> |
| 321 | + dplyr::filter(yearID >= 1960, yearID <= 1969) |> |
| 322 | + dplyr::group_by(playerID) |> |
| 323 | + dplyr::summarize(HR = sum(HR)) |> |
| 324 | + dplyr::arrange(desc(HR)) |> |
| 325 | + dplyr::slice(1:4) |
| 326 | +``` |
| 327 | +What if we want to find the top HR hitters for each decade? |
| 328 | + |
| 329 | +```{r} |
| 330 | +hr_leader <- function(data) { |
| 331 | + data |> |
| 332 | + dplyr::group_by(playerID) |> |
| 333 | + dplyr::summarize(HR = sum(HR)) |> |
| 334 | + dplyr::arrange(desc(HR)) |> |
| 335 | + dplyr::slice(1) |
| 336 | +} |
| 337 | +``` |
| 338 | + |
| 339 | +Do you see any potential issues with this function? |
| 340 | + |
| 341 | +```{r} |
| 342 | +Batting_decade <- Batting |> |
| 343 | + dplyr::mutate(decade = 10 * floor(yearID / 10)) |> |
| 344 | + dplyr::group_by(decade) |
| 345 | +
|
| 346 | +decades <- Batting_decade |> |
| 347 | + dplyr::group_keys() |> |
| 348 | + dplyr::pull("decade") |
| 349 | +
|
| 350 | +decades |
| 351 | +``` |
| 352 | + |
| 353 | +```{r} |
| 354 | +Batting_decade |> |
| 355 | + dplyr::group_split() |> |
| 356 | + purrr::map(hr_leader) |> |
| 357 | + purrr::set_names(decades) |> |
| 358 | + dplyr::bind_rows(.id = "decade") |
| 359 | +``` |
| 360 | + |
| 361 | +## Getting Help |
| 362 | + |
| 363 | +```{r} |
| 364 | +# Will open documentation for the function |
| 365 | +?geom_point |
| 366 | +``` |
| 367 | + |
| 368 | +```{r} |
| 369 | +# find all objects that contain this character string |
| 370 | +??geom_point |
| 371 | +``` |
| 372 | + |
| 373 | + |
| 374 | +## Further Reading |
| 375 | + |
| 376 | +- [R for Data Science (2e)](https://r4ds.hadley.nz/) |
| 377 | +- [Modern Data Science with R](https://www.routledge.com/Modern-Data-Science-with-R/Baumer-Kaplan-Horton/p/book/9780367191498) $$ |
| 378 | +- [A Modern Dive into R and the Tidyverse](https://moderndive.com/) |
| 379 | +- [R for the Rest of Us](https://book.rfortherestofus.com/) |
| 380 | +- [R Packages](https://r-pkgs.org/) |
| 381 | +- [Happy Git and GitHub for the useR](https://happygitwithr.com) |
| 382 | + |
| 383 | + |
| 384 | + |
6 | 385 |
|
7 | | -## SLIDE 1 {-} |
8 | 386 |
|
9 | | -- ADD SLIDES AS SECTIONS (`##`). |
10 | | -- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF. |
|
0 commit comments