|
1 | 1 | # The Baseball Datasets |
2 | 2 |
|
| 3 | + |
3 | 4 | **Learning objectives:** |
4 | 5 |
|
5 | | -- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY |
| 6 | +- Bookclub overivew |
| 7 | + |
| 8 | +- Introduction to data sets we will use for the book |
| 9 | + |
| 10 | +## Overview {-} |
| 11 | + |
| 12 | +- Book: [Analyzing Baseball Data with R](https://beanumber.github.io/abdwr3e/) |
| 13 | +- Code Repo : [adbwr3edata](https://github.com/beanumber/abdwr3edata) |
| 14 | +- Goal: Learn how to answer baseball questions by learning how to: |
| 15 | + |
| 16 | + - Access baseball data |
| 17 | + |
| 18 | + - Manipulate and analyze baseball data with R and tidyverse. |
| 19 | + |
| 20 | + - Communicate results with quarto and shiny |
| 21 | + |
| 22 | + |
| 23 | +## Baseball terms {-} |
| 24 | + |
| 25 | +Some useful resources for defintions of Baseball terms and statistics: |
| 26 | + |
| 27 | +[MLB Glossary](https://www.mlb.com/glossary) |
| 28 | + |
| 29 | +[Baseball reference](https://www.baseball-reference.com/about/) |
| 30 | + |
| 31 | +[MLB Rules 2025](https://mktg.mlbstatic.com/mlb/official-information/2025-official-baseball-rules.pdf) |
| 32 | + |
| 33 | +## Lahman Databse {-} |
| 34 | + |
| 35 | +- Season by Season data from 1871 to current season |
| 36 | + |
| 37 | +- Consists of multiple tables: |
| 38 | + |
| 39 | + - People : Player names, DOB, etc |
| 40 | + |
| 41 | + - Batting : Batting statistics by player / year / 'stint' |
| 42 | + |
| 43 | + - Pitching : Pitching statistics by player / year / 'stint' |
| 44 | + |
| 45 | + - Fielding : Fielding statistics by player / year / 'stint' |
| 46 | + |
| 47 | + - Teams : Team performance by year. |
| 48 | + |
| 49 | +## Lahman from `R` {-} |
| 50 | + |
| 51 | + |
| 52 | +```{r} |
| 53 | +library(Lahman) |
| 54 | +library(tidyverse) |
| 55 | +Teams |> |
| 56 | + slice_tail(n = 3) |
| 57 | +``` |
| 58 | +## Example Uses for Lahman {-} |
| 59 | + |
| 60 | +- Good for answering questions like: |
| 61 | + - What is the average number of home runs per game recorded in each decade? Does the rate of strikeouts show any correlation with the rate of home runs? (*Teams* table) |
| 62 | + |
| 63 | + - How does the percentage of games completed by the starting pitcher from 2000 to 2010 compare to the percentage of games 100 years before? (*Pitching* table) |
| 64 | + |
| 65 | + - Which player had the most walks per plate appearance (BB%) in a given year? (*Batting* table) |
| 66 | + |
| 67 | + |
| 68 | +## Retrosheet Game-by-Game Data {-} |
| 69 | + |
| 70 | +- Game logs going back to 1871 |
| 71 | + |
| 72 | + - Teams offensive / defensive stats, starting players, etc |
| 73 | + |
| 74 | + - All 161 fields documented [here](https://www.retrosheet.org/gamelogs/glfields.txt). |
| 75 | + |
| 76 | +- [Retrosheet game logs](https://www.retrosheet.org/gamelogs/index.html) are provided for each season as zipped csv files. |
| 77 | + |
| 78 | +- Some sample data is provided in the books code repo. |
| 79 | + |
| 80 | +- Example question: In which months are home runs more likely to occur? |
| 81 | + |
| 82 | +## Retrosheet Play-by-Play Data {-} |
| 83 | + |
| 84 | +- Event files provided for each game since 1913 |
| 85 | + |
| 86 | +- Data for each play (similar to what you might find on a baseball scoresheet). |
| 87 | + |
| 88 | +- [Event files](https://www.retrosheet.org/game.htm) are provided as zipped collections by season. One file covers a teams season of games. |
| 89 | + |
| 90 | +- [Detailed description](https://www.retrosheet.org/eventfile.htm) |
| 91 | + |
| 92 | +- Example line: |
| 93 | +`play,3,1,bichb001,01,CX,D8/L89XD+.1-3` |
| 94 | + |
| 95 | +Translation: |
| 96 | + |
| 97 | +>In the 3rd inning, with the visiting team (Blue Jays) at bat, Bo Bichette hit a double to center field. The ball was lined to center and involved both the center and right fielders. Due to a misplay or error, the batter advanced beyond the expected base. Additionally, a runner who was on first base advanced to third base on the play. |
| 98 | + ` |
| 99 | + |
| 100 | +## Accessing the data {-} |
| 101 | + |
| 102 | +- Retrosheet provides some (DOS!) tools for parsing the data |
| 103 | + |
| 104 | +- Instead, book recommends using tools presented in Appendix A |
| 105 | + |
| 106 | +- Example Question: What is the Major league batting average when the ball/strike count is 0-2? What about on 2-0? |
| 107 | + |
| 108 | +## Pitch-by-Pitch Data {-} |
| 109 | + |
| 110 | + |
| 111 | + |
| 112 | +- Even more detailed data: Ball release point, trajectory, location at plate, etc. |
| 113 | + |
| 114 | +- PITCHf/x data from 2008-2017 (?). Replaced by Statcast |
| 115 | + |
| 116 | +- Example question: What are the chances of a successful steal when the pitcher throws a fastball compared to when a curve is delivered? |
| 117 | + |
| 118 | +## Statcast {-} |
| 119 | + |
| 120 | +- Tracks more than PITCHf/x including movement of the individual players |
| 121 | + |
| 122 | +- Limited data is provided to public by [Baseball Savant](https://baseballsavant.mlb.com) |
| 123 | + |
| 124 | +- [Baseballr](https://billpetti.github.io/baseballr/) package provides tools to download Baseball Savant data (and more!) |
| 125 | + |
| 126 | +At the moment, this doesn't seem to be working though with the CRAN version. Need to install the development version: |
| 127 | + |
| 128 | +``` |
| 129 | +# Install the remotes package if you haven't already |
| 130 | +install.packages("remotes") |
| 131 | +remotes::install_github("BillPetti/baseballr") |
| 132 | +``` |
| 133 | + |
| 134 | +```{r, eval = FALSE} |
| 135 | +library(baseballr) |
| 136 | +noah <- statcast_search(start_date = "2016-04-06", |
| 137 | + end_date = "2016-04-15", |
| 138 | + playerid = 592789, |
| 139 | + player_type = 'pitcher') |
| 140 | +``` |
| 141 | + |
| 142 | +- Example Question: How frequently do MLB teams employ infield shifts? |
| 143 | + |
| 144 | +## Other data on baseballr {-} |
| 145 | + |
| 146 | +- Scrape extensive data frorm [Baseball Reference](https://www.baseball-reference.com) and [FanGraphs](https://www.fangraphs.com) |
| 147 | + |
| 148 | +```{r, eval = FALSE} |
| 149 | +library(baseballr) |
| 150 | +bref_standings_on_date("2025-04-12", "NL East", from = FALSE) |
| 151 | +``` |
| 152 | + |
| 153 | +- Retrosheet - but needs a special CLI (See Appendix A) |
| 154 | + |
| 155 | +## Data used in the book {-} |
| 156 | + |
| 157 | +- Lahman data from `Lahman` package |
| 158 | +- Small examples for other data are in `abdwr3edata` package |
| 159 | +- Large examples will require downloading data seperately. |
| 160 | + |
| 161 | +## Exercises {-} |
| 162 | + |
| 163 | +### Exercise 1: {-} |
| 164 | + |
| 165 | +This chapter has given an overview of the Lahman database, the Retrosheet game logs, the Retrosheet play-by-play files, the PITCHf/x database, and the Statcast database. Describe the relevant data among these four databases that can be used to answer the following baseball questions. |
| 166 | + |
| 167 | +How has the rate of walks (per team for nine innings) changed over the history of baseball? |
| 168 | +What fraction of baseball games in 1968 were shutouts? Compare this fraction with the fraction of shutouts in the 2012 baseball season. |
| 169 | + |
| 170 | +What percentage of first pitches are strikes? If the count is 2-0, what fraction of the pitches are strikes? |
| 171 | + |
| 172 | +Which players are most likely to hit groundballs? Of these players, compare the speeds at which these groundballs are hit. |
6 | 173 |
|
7 | | -## SLIDE 1 {-} |
| 174 | +Is it easier to steal second base or third base? (Compare the fraction of successful steals of second base with the fraction of successful steals of third base.) |
8 | 175 |
|
9 | | -- ADD SLIDES AS SECTIONS (`##`). |
10 | | -- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF. |
|
0 commit comments