Skip to content

Commit 37672c9

Browse files
Update chapter 4 notes (#13)
* start ch 4 notes * added in more LaTeX and fixed long fig captions * added 3 figures * notes on empirical comparison, poisson regression and GLM * add logistic regression formulas * add prediction formula * added some extra details * create 04_presentation.qmd * embed resources on 04_presentation.qmd * remove directory from pyproject.toml --------- Co-authored-by: Jon Harmon <[email protected]>
1 parent 16ed2ee commit 37672c9

File tree

11 files changed

+387
-6
lines changed

11 files changed

+387
-6
lines changed

04_notes.qmd

Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,210 @@
11
# Notes {-}
2+
3+
## What is classification?
4+
> Predicting classification a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class.
5+
6+
### Some questions that can be solved with classification
7+
- A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?
8+
- An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the user’s IP address, past transaction history, and so forth.
9+
- On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not.
10+
11+
### Why not linear regression?
12+
13+
- Linear regression should not be used to predict a qualitative or categorical variable with more than 2 levels that does not have a natural ordering
14+
- Linear regression should not be used to predict a qualitative or categorical variable with more than 2 levels that does not have a reasonably similar gap between each level.
15+
- If your qualitative outcome variable only has 2 levels you could recode it as a dummy variable with 0/1 coding. This is not recommmended because the probability estimates will probably not be meaningful, for example have negative probabilities.
16+
- A linear regression model to predict the relationship between `default` and `balance` leads to negative probabilities of default for bank balances close to zero.
17+
18+
Examples of categorical variables that are not appropriate as outcome variables for linear regression?
19+
20+
## Logistic Regression
21+
> Rather than modeling this response $Y$ directly, logistic regression models the probability that $Y$ belongs to a particular category
22+
23+
24+
```{r}
25+
#| label: fig4-2
26+
#| echo: false
27+
#| fig-cap: Figure 4.2
28+
#| out-width: 100%
29+
knitr::include_graphics("images/04-fig4_2.png")
30+
```
31+
32+
Classification using the `Default` data. Left: Estimated probability of `default` using linear regression. Some estimated probabilities are negative! The orange ticks indicate the 0/1 values coded for `default` (`No` or `Yes`). Right: Predicted probabilities of `default` using logistic regression. All probabilities lie between 0 and 1.
33+
34+
- logistic regression uses a logistic function which gives probability estimates between 0 and 1 for all values of $X$.
35+
- a logistic function will always produce an S-shaped curve, probabilities will come close to but never below zero and close but never above one.
36+
- probability of response $Y$ can be predicted based on any value of $X$.
37+
- ${\beta_1}$ in linear regression is the average change in $Y$ associated with a one-unit increase in $X$. By contrast, in a logistic regression model, increasing $X$ by one unit changes the log odds by ${\beta_1}$
38+
- regardless of the value of $X$, if ${\beta_1}$ is positive then increasing $X$ will be associated with increasing $p(X)$, and if ${\beta_1}$ is negative then increasing $X$ will be associated with decreasing $p(X)$.
39+
40+
41+
$$p(X) = β_0 + β_1X \space \Longrightarrow {Linear \space regression}$$
42+
$$p (X) = \frac{e^{\beta_{0} + \beta_{1}X}}{1 + e^{\beta_{0} + \beta_{1}X}} \space \Longrightarrow {Logistic \space function}$$
43+
44+
45+
### Maximum Likelihood
46+
Using maximum likelihood, the regression coefficients are chosen based on the probability estimates being as close as possible to the observed response of $Y$ for each case in the training data.
47+
48+
> The estimates $\hat{\beta_0}$ and $\hat{\beta_1}$ are chosen to maximize this likelihood function.
49+
50+
```{r}
51+
#| label: tab4-1
52+
#| echo: false
53+
#| fig-cap: For the `Default` data, estimated coefficients of the logistic regression model that predicts the probability of `default` using `balance`. A one-unit increase in `balance` is associated with an increase in the log odds of `default` by 0.0055 units.
54+
#| out-width: 100%
55+
knitr::include_graphics("images/04-tab4_1.png")
56+
```
57+
58+
### Making Predictions
59+
60+
Once we have our estimated coefficients we can input any value of $X$ into the model and predict the probability of $Y$. Using the bank example, we can predict `default` based on any `balance`.
61+
62+
63+
$$\hat{p}(X) = \frac{e^{\hat{\beta}_{0} + \hat{\beta}_{1}X}}{1 + e^{\hat{\beta}_{0} + \hat{\beta}_{1}X}}
64+
= \frac{e^{-10.6513 + 0.0055 \times 1000}}{1 + e^{-10.6513 + 0.0055 \times 1000}}
65+
= 0.00576$$
66+
67+
### Qualitative predictors
68+
Instead of a quantitative predictor like credit balance, we could use a qualitative, or categorical, variable, like whether or not someone is a student, to predict whether or not someone will default.
69+
70+
## Multiple Logistic Regression
71+
The equation for simple logistic regression can be rewritten to include coefficient estimates for $p$ predictors.
72+
73+
$$\log \biggl(\frac{p(X)}{1- p(X)}\bigg) = \beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p$$
74+
$$p(X) = \frac{e^{\beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p}}{1 + e^{\beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p}}$$
75+
76+
```{r}
77+
#| label: fig4-3
78+
#| echo: false
79+
#| fig-cap: Figure 4.3
80+
#| out-width: 100%
81+
knitr::include_graphics("images/04-fig4_3.png")
82+
```
83+
84+
Confounding in the Default data. Left: Default rates are shown for students (orange) and non-students (blue). The solid lines display default rate as a function of balance, while the horizontal broken lines display the overall default rates. Right: Boxplots of balance for students (orange) and non-students (blue) are shown.
85+
86+
## Multinomial Logistic Regression
87+
- This is used in the setting where K > 2 classes. In multinomial, we select a single class to serve as the baseline.
88+
89+
- However, the interpretation of the coefficients in a multinomial logistic regression model must be done with care, since it is tied to the choice of baseline.
90+
91+
- Alternatively, you can use `Softmax coding, where we treat all K classes symmetrically, and assume that for k = 1, . . . ,K, rather than selecting a baseline. This means, we estimate coefficients for all K classes, rather than estimating coefficients for K − 1 classes.
92+
93+
94+
## Generative Models for Classification
95+
> model the distribution of the predictors X separately in each of the response classes (i.e. for each value of Y ).
96+
97+
98+
**Why not logistic regression?**
99+
100+
- When there is substantial separation between the two classes, the parameter estimates for the logistic regression model are surprisingly unstable.
101+
102+
- If the distribution of the predictors X is approximately normal in each of the classes and the sample size is small, then the generative modelling may be more accurate than logistic regression.
103+
104+
- Generative modelling can be naturally extended to the case of more than two response classes.
105+
106+
107+
108+
109+
$\pi_k$ = overall or prior probability that a randomly chosen observation comes from the prior kth class
110+
111+
$f_k(X) \equiv \Pr(X \mid Y = k)$ = the density function of $X$ for an observation that comes from the $k$th class. It is relatively large if there is a high probability that an observation in the $k$th class has $X \approx x$, and $f_k(X)$ is small if it is very unlikely that an observation in the $k$th class has $X \approx x$.
112+
113+
$p_k(X)$ is the probability that the observation belongs to the kth class, given the predictor value for that observation.
114+
115+
116+
### 3 classifiers to approximate the Bayes classifier
117+
118+
- Linear Discriminant Analysis (LDA) - normal distribution, covariance matrix that is common to all $K$ classes
119+
- Quadratic Discriminant Analysis (QDA) - normal distribution, covariance matrix that is NOT common to all $K$ classes
120+
- Naive Bayes - uncorrelated predictors, small n, no distribution assumption
121+
122+
123+
## Comparison of classification methods
124+
125+
### Empirical comparison
126+
127+
6 scenarios to empirically compare the performance of logistic regression, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), naive Bayes, and K Nearest Neighbors (KNN). Each of the 6 different scenarios was a binary classification problem with 2 quantitative predictors.
128+
129+
Linear Bayes decision boundaries
130+
131+
**Scenario 1:** 20 training observations for each class, uncorrelated normal distribution predictors
132+
133+
**Scenario 2:** Similar to scenario 1, but predictors had a correlation of -0.5.
134+
135+
**Scenario 3:** Substantial negative correlation, and predictors were generated from the t-distribution with 50 trainings observations for each class.
136+
137+
138+
```{r}
139+
#| label: fig4-11
140+
#| echo: false
141+
#| fig-cap: Figure 4.11 Boxplots of the test error rates for each of the linear scenarios described in the main text.
142+
#| out-width: 100%
143+
knitr::include_graphics("images/04-fig4_11.png")
144+
```
145+
146+
147+
Non-linear Bayes decision boundaries
148+
149+
**Scenario 4:** normal distribution, correlation of 0.5 between the predictors in the first class, and correlation of 0.5 between the predictors in the second class
150+
151+
**Scenario 5:** normal distribution with uncorrelated predictors, responses were sampled from the logistic function applied to a complicated non-linear function of the predictors
152+
153+
**Scenario 6:** normal distribution with a different diagonal covariance matrix for each class and a very small sample size of 6 in each class
154+
155+
156+
```{r}
157+
#| label: fig4-12
158+
#| echo: false
159+
#| fig-cap: Figure 4.12 Boxplots of the test error rates for each of the non-linear scenarios described in the main text.
160+
#| out-width: 100%
161+
knitr::include_graphics("images/04-fig4_12.png")
162+
```
163+
164+
165+
166+
## Poisson Regression
167+
168+
### Why not linear regression?
169+
Similarly to predicting the probability of qualitative variables, negative predictions of count data are not meaningful.
170+
171+
```{r}
172+
#| label: fig4-13
173+
#| echo: false
174+
#| fig-cap: Figure 4.13
175+
#| out-width: 100%
176+
knitr::include_graphics("images/04-fig4_13.png")
177+
```
178+
179+
Left: The coefficients associated with the month of the year. Bike usage is highest in the spring and fall, and lowest in the winter. Right: The coefficients associated with the hour of the day. Bike usage is highest during peak commute times, and lowest overnight.
180+
181+
```{r}
182+
#| label: fig4-14
183+
#| echo: false
184+
#| fig-cap: Figure 4.14
185+
#| out-width: 100%
186+
knitr::include_graphics("images/04-fig4_14.png")
187+
```
188+
189+
190+
Left: On the Bikeshare dataset, the number of bikers is displayed on the y-axis, and the hour of the day is displayed on the x-axis. For the most part, as the mean number of bikers increases, so does the variance in the number of bikers, violating the assumption of homoscedasticity. A smoothing spline fit is shown in green. Right: The log of the number of bikers is now displayed on the y-axis.
191+
192+
193+
### Poisson distribution
194+
**Poisson distribution** is typically used to model counts. When $Y$ is neither quantitative (sales) nor qualitative (whether or not someone will default on a loan). In a poisson distribution, the variance is the same as the mean.
195+
196+
> the larger the mean of $Y$, the larger its variance.
197+
198+
### Poisson regression
199+
200+
> rather than modeling the number of bikers, $Y$, as a Poisson distribution with a fixed mean value like $\lambda$ = 5, we would like to allow the mean to vary as a function of the covariates.
201+
202+
203+
## Generalized Linear Models
204+
205+
Linear, logistic and Poisson are three examples of regression models that are known as a generalized linear model (GLM).
206+
207+
Each of them uses predictors $X$ to predict a response $Y$.
208+
209+
> In general, we can perform a regression by modeling the response $Y$ as coming from a particular member of the exponential family, and then transforming the mean of the response so that the transformed mean is a linear function of the predictors... Any regression approach that follows this very general recipe is
210+
known as a **generalized linear model**

0 commit comments

Comments
 (0)