Skip to content

Commit 60f6db3

Browse files
authored
Chapter1 (#2)
* Add titles * add tidyverse * add more slides * add 1 exercise * mark baseballr code blocks eval false
1 parent c1d6866 commit 60f6db3

File tree

4 files changed

+174
-6
lines changed

4 files changed

+174
-6
lines changed

01_the-baseball-datasets.Rmd

Lines changed: 169 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,175 @@
11
# The Baseball Datasets
22

3+
34
**Learning objectives:**
45

5-
- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY
6+
- Bookclub overivew
7+
8+
- Introduction to data sets we will use for the book
9+
10+
## Overview {-}
11+
12+
- Book: [Analyzing Baseball Data with R](https://beanumber.github.io/abdwr3e/)
13+
- Code Repo : [adbwr3edata](https://github.com/beanumber/abdwr3edata)
14+
- Goal: Learn how to answer baseball questions by learning how to:
15+
16+
- Access baseball data
17+
18+
- Manipulate and analyze baseball data with R and tidyverse.
19+
20+
- Communicate results with quarto and shiny
21+
22+
23+
## Baseball terms {-}
24+
25+
Some useful resources for defintions of Baseball terms and statistics:
26+
27+
[MLB Glossary](https://www.mlb.com/glossary)
28+
29+
[Baseball reference](https://www.baseball-reference.com/about/)
30+
31+
[MLB Rules 2025](https://mktg.mlbstatic.com/mlb/official-information/2025-official-baseball-rules.pdf)
32+
33+
## Lahman Databse {-}
34+
35+
- Season by Season data from 1871 to current season
36+
37+
- Consists of multiple tables:
38+
39+
- People : Player names, DOB, etc
40+
41+
- Batting : Batting statistics by player / year / 'stint'
42+
43+
- Pitching : Pitching statistics by player / year / 'stint'
44+
45+
- Fielding : Fielding statistics by player / year / 'stint'
46+
47+
- Teams : Team performance by year.
48+
49+
## Lahman from `R` {-}
50+
51+
52+
```{r}
53+
library(Lahman)
54+
library(tidyverse)
55+
Teams |>
56+
slice_tail(n = 3)
57+
```
58+
## Example Uses for Lahman {-}
59+
60+
- Good for answering questions like:
61+
- What is the average number of home runs per game recorded in each decade? Does the rate of strikeouts show any correlation with the rate of home runs? (*Teams* table)
62+
63+
- How does the percentage of games completed by the starting pitcher from 2000 to 2010 compare to the percentage of games 100 years before? (*Pitching* table)
64+
65+
- Which player had the most walks per plate appearance (BB%) in a given year? (*Batting* table)
66+
67+
68+
## Retrosheet Game-by-Game Data {-}
69+
70+
- Game logs going back to 1871
71+
72+
- Teams offensive / defensive stats, starting players, etc
73+
74+
- All 161 fields documented [here](https://www.retrosheet.org/gamelogs/glfields.txt).
75+
76+
- [Retrosheet game logs](https://www.retrosheet.org/gamelogs/index.html) are provided for each season as zipped csv files.
77+
78+
- Some sample data is provided in the books code repo.
79+
80+
- Example question: In which months are home runs more likely to occur?
81+
82+
## Retrosheet Play-by-Play Data {-}
83+
84+
- Event files provided for each game since 1913
85+
86+
- Data for each play (similar to what you might find on a baseball scoresheet).
87+
88+
- [Event files](https://www.retrosheet.org/game.htm) are provided as zipped collections by season. One file covers a teams season of games.
89+
90+
- [Detailed description](https://www.retrosheet.org/eventfile.htm)
91+
92+
- Example line:
93+
`play,3,1,bichb001,01,CX,D8/L89XD+.1-3`
94+
95+
Translation:
96+
97+
>In the 3rd inning, with the visiting team (Blue Jays) at bat, Bo Bichette hit a double to center field. The ball was lined to center and involved both the center and right fielders. Due to a misplay or error, the batter advanced beyond the expected base. Additionally, a runner who was on first base advanced to third base on the play.
98+
`
99+
100+
## Accessing the data {-}
101+
102+
- Retrosheet provides some (DOS!) tools for parsing the data
103+
104+
- Instead, book recommends using tools presented in Appendix A
105+
106+
- Example Question: What is the Major league batting average when the ball/strike count is 0-2? What about on 2-0?
107+
108+
## Pitch-by-Pitch Data {-}
109+
110+
![PitchLocation](images/pitchlocation.png)
111+
112+
- Even more detailed data: Ball release point, trajectory, location at plate, etc.
113+
114+
- PITCHf/x data from 2008-2017 (?). Replaced by Statcast
115+
116+
- Example question: What are the chances of a successful steal when the pitcher throws a fastball compared to when a curve is delivered?
117+
118+
## Statcast {-}
119+
120+
- Tracks more than PITCHf/x including movement of the individual players
121+
122+
- Limited data is provided to public by [Baseball Savant](https://baseballsavant.mlb.com)
123+
124+
- [Baseballr](https://billpetti.github.io/baseballr/) package provides tools to download Baseball Savant data (and more!)
125+
126+
At the moment, this doesn't seem to be working though with the CRAN version. Need to install the development version:
127+
128+
```
129+
# Install the remotes package if you haven't already
130+
install.packages("remotes")
131+
remotes::install_github("BillPetti/baseballr")
132+
```
133+
134+
```{r, eval = FALSE}
135+
library(baseballr)
136+
noah <- statcast_search(start_date = "2016-04-06",
137+
end_date = "2016-04-15",
138+
playerid = 592789,
139+
player_type = 'pitcher')
140+
```
141+
142+
- Example Question: How frequently do MLB teams employ infield shifts?
143+
144+
## Other data on baseballr {-}
145+
146+
- Scrape extensive data frorm [Baseball Reference](https://www.baseball-reference.com) and [FanGraphs](https://www.fangraphs.com)
147+
148+
```{r, eval = FALSE}
149+
library(baseballr)
150+
bref_standings_on_date("2025-04-12", "NL East", from = FALSE)
151+
```
152+
153+
- Retrosheet - but needs a special CLI (See Appendix A)
154+
155+
## Data used in the book {-}
156+
157+
- Lahman data from `Lahman` package
158+
- Small examples for other data are in `abdwr3edata` package
159+
- Large examples will require downloading data seperately.
160+
161+
## Exercises {-}
162+
163+
### Exercise 1: {-}
164+
165+
This chapter has given an overview of the Lahman database, the Retrosheet game logs, the Retrosheet play-by-play files, the PITCHf/x database, and the Statcast database. Describe the relevant data among these four databases that can be used to answer the following baseball questions.
166+
167+
How has the rate of walks (per team for nine innings) changed over the history of baseball?
168+
What fraction of baseball games in 1968 were shutouts? Compare this fraction with the fraction of shutouts in the 2012 baseball season.
169+
170+
What percentage of first pitches are strikes? If the count is 2-0, what fraction of the pitches are strikes?
171+
172+
Which players are most likely to hit groundballs? Of these players, compare the speeds at which these groundballs are hit.
6173

7-
## SLIDE 1 {-}
174+
Is it easier to steal second base or third base? (Compare the fraction of successful steals of second base with the fraction of successful steals of third base.)
8175

9-
- ADD SLIDES AS SECTIONS (`##`).
10-
- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF.

DESCRIPTION

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,5 +9,9 @@ Depends:
99
R (>= 3.1.0)
1010
Imports:
1111
bookdown,
12-
rmarkdown
12+
rmarkdown,
13+
tidyverse,
14+
remotes,
15+
Lahman,
16+
baseballr
1317
Encoding: UTF-8

bookclub-r_baseball.Rproj

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
Version: 1.0
2-
ProjectId: be451784-a7bb-46ae-b058-c9212b09a2b2
32

43
RestoreWorkspace: Default
54
SaveWorkspace: Default

images/pitchlocation.png

42 KB
Loading

0 commit comments

Comments
 (0)