r4ds
diff --git a/‎04_notes.qmd‎
Lines changed: 209 additions & 0 deletions b/‎04_notes.qmd‎
Lines changed: 209 additions & 0 deletions
@@ -1 +1,210 @@
 # Notes {-}
+
+## What is classification?
+> Predicting classification a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class.
+
+### Some questions that can be solved with classification
+- A person arrives at the emergency room with a set of symptoms that could possibly be attributed to one of three medical conditions. Which of the three conditions does the individual have?
+- An online banking service must be able to determine whether or not a transaction being performed on the site is fraudulent, on the basis of the  user’s IP address, past transaction history, and so forth.
+- On the basis of DNA sequence data for a number of patients with and without a given disease, a biologist would like to figure out which DNA mutations are deleterious (disease-causing) and which are not.
+
+### Why not linear regression?
+
+- Linear regression should not be used to predict a qualitative or categorical variable with more than 2 levels that does not have a natural ordering
+- Linear regression should not be used to predict a qualitative or categorical variable with more than 2 levels that does not have a reasonably similar gap between each level. 
+- If your qualitative outcome variable only has 2 levels you could recode it as a dummy variable with 0/1 coding. This is not recommmended because the probability estimates will probably not be meaningful, for example have negative probabilities. 
+- A linear regression model to predict the relationship between `default` and `balance` leads to negative probabilities of default for bank balances close to zero.
+
+Examples of categorical variables that are not appropriate as outcome variables for linear regression?
+
+## Logistic Regression
+> Rather than modeling this response $Y$ directly, logistic regression models the probability that $Y$ belongs to a particular category
+
+
+```{r}
+#| label: fig4-2
+#| echo: false
+#| fig-cap: Figure 4.2
+#| out-width: 100%
+knitr::include_graphics("images/04-fig4_2.png")
+```
+
+Classification using the `Default` data. Left: Estimated probability of `default` using linear regression. Some estimated probabilities are negative! The orange ticks indicate the 0/1 values coded for `default` (`No` or `Yes`). Right: Predicted probabilities of `default` using logistic regression. All probabilities lie between 0 and 1.
+
+- logistic regression uses a logistic function which gives probability estimates between 0 and 1 for all values of $X$. 
+- a logistic function will always produce an S-shaped curve, probabilities will come close to but never below zero and close but never above one.
+- probability of response $Y$ can be predicted based on any value of $X$. 
+- ${\beta_1}$ in linear regression is the average change in $Y$ associated with a one-unit increase in $X$. By contrast, in a logistic regression model, increasing $X$ by one unit changes the log odds by ${\beta_1}$
+- regardless of the value of $X$, if ${\beta_1}$ is positive then increasing $X$ will be associated with increasing $p(X)$, and if ${\beta_1}$ is negative then increasing $X$ will be associated with decreasing $p(X)$.
+
+
+$$p(X) = β_0 + β_1X \space \Longrightarrow {Linear \space regression}$$
+$$p (X) =  \frac{e^{\beta_{0} + \beta_{1}X}}{1 + e^{\beta_{0} + \beta_{1}X}} \space \Longrightarrow {Logistic \space function}$$
+
+
+### Maximum Likelihood
+Using maximum likelihood, the regression coefficients are chosen based on the probability estimates being as close as possible to the observed response of $Y$ for each case in the training data.
+
+> The estimates $\hat{\beta_0}$ and $\hat{\beta_1}$ are chosen to maximize this likelihood function.
+
+```{r}
+#| label: tab4-1
+#| echo: false
+#| fig-cap: For the `Default` data, estimated coefficients of the logistic regression model that predicts the probability of `default` using `balance`. A one-unit increase in `balance` is associated with an increase in the log odds of `default` by 0.0055 units.
+#| out-width: 100%
+knitr::include_graphics("images/04-tab4_1.png")
+```
+
+### Making Predictions
+
+Once we have our estimated coefficients we can input any value of $X$ into the model and predict the probability of $Y$. Using the bank example, we can predict `default` based on any `balance`. 
+
+
+$$\hat{p}(X) = \frac{e^{\hat{\beta}_{0} + \hat{\beta}_{1}X}}{1 + e^{\hat{\beta}_{0} + \hat{\beta}_{1}X}} 
+= \frac{e^{-10.6513 + 0.0055 \times 1000}}{1 + e^{-10.6513 + 0.0055 \times 1000}} 
+= 0.00576$$
+
+### Qualitative predictors
+Instead of a quantitative predictor like credit balance, we could use a qualitative, or categorical, variable, like whether or not someone is a student, to predict whether or not someone will default. 
+
+## Multiple Logistic Regression
+The equation for simple logistic regression can be rewritten to include coefficient estimates for $p$ predictors. 
+
+$$\log \biggl(\frac{p(X)}{1- p(X)}\bigg) = \beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p$$
+$$p(X) = \frac{e^{\beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p}}{1 + e^{\beta_{0} + \beta_{1}X_1 + ... + \beta_{p}X_p}}$$
+
+```{r}
+#| label: fig4-3
+#| echo: false
+#| fig-cap: Figure 4.3
+#| out-width: 100%
+knitr::include_graphics("images/04-fig4_3.png")
+```
+
+Confounding in the Default data. Left: Default rates are shown for students (orange) and non-students (blue). The solid lines display  default rate as a function of balance, while the horizontal broken lines display the overall default rates. Right: Boxplots of balance for students (orange) and non-students (blue) are shown.
+
+## Multinomial  Logistic Regression
+- This is used in the setting where K > 2 classes. In multinomial, we select a single class to serve as the baseline.
+
+- However, the interpretation of the coefficients in a multinomial logistic regression model must be done with care, since it is tied to the choice of baseline. 
+
+- Alternatively, you can use `Softmax coding, where we treat all K classes symmetrically, and assume that for k = 1, . . . ,K, rather than selecting a baseline. This means, we estimate coefficients for all K classes, rather than estimating coefficients for K − 1 classes. 
+
+
+## Generative Models for Classification
+>  model the distribution of the predictors X separately in each of the response classes (i.e. for each value of Y ).
+
+
+**Why not logistic regression?**
+
+- When there is substantial separation between the two classes, the parameter estimates for the logistic regression model are surprisingly unstable. 
+
+- If the distribution of the predictors X is approximately normal in each of the classes and the sample size is small, then the generative modelling may be more accurate than logistic regression.
+
+- Generative modelling can be naturally extended to the case of more than two response classes. 
+
+
+
+
+$\pi_k$ = overall or prior probability that a randomly chosen observation comes from the prior kth class
+
+$f_k(X) \equiv \Pr(X \mid Y = k)$ = the density function of $X$ for an observation that comes from the $k$th class. It is relatively large if there is a high probability that an observation in the $k$th class has $X \approx x$, and $f_k(X)$ is small if it is very unlikely that an observation in the $k$th class has $X \approx x$.
+
+$p_k(X)$ is the probability that the observation belongs to the kth class, given the predictor value for that observation.
+
+
+### 3 classifiers to approximate the Bayes classifier
+
+- Linear Discriminant Analysis (LDA) - normal distribution, covariance matrix that is common to all $K$ classes
+- Quadratic Discriminant Analysis (QDA) - normal distribution, covariance matrix that is NOT common to all $K$ classes
+- Naive Bayes - uncorrelated predictors, small n, no distribution assumption
+
+
+## Comparison of classification methods
+
+### Empirical comparison
+
+6 scenarios to empirically compare the performance of logistic regression, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), naive Bayes, and K Nearest Neighbors (KNN). Each of the 6 different scenarios was a binary classification problem with 2 quantitative predictors. 
+
+Linear Bayes decision boundaries
+
+**Scenario 1:** 20 training observations for each class, uncorrelated normal distribution predictors
+
+**Scenario 2:** Similar to scenario 1, but predictors had a correlation of -0.5.
+
+**Scenario 3:** Substantial negative correlation, and predictors were generated from the t-distribution with 50 trainings observations for each class. 
+
+
+```{r}
+#| label: fig4-11
+#| echo: false
+#| fig-cap: Figure 4.11 Boxplots of the test error rates for each of the linear scenarios described in the main text.
+#| out-width: 100%
+knitr::include_graphics("images/04-fig4_11.png")
+```
+
+
+Non-linear Bayes decision boundaries
+
+**Scenario 4:** normal distribution,  correlation of 0.5 between the predictors in the first class, and correlation of 0.5 between the predictors in the second class
+
+**Scenario 5:** normal distribution with uncorrelated predictors, responses were sampled from the logistic function applied to a complicated non-linear function of the predictors
+
+**Scenario 6:** normal distribution with a different diagonal covariance matrix for each class and a very small sample size of 6 in each class
+
+
+```{r}
+#| label: fig4-12
+#| echo: false
+#| fig-cap: Figure 4.12 Boxplots of the test error rates for each of the non-linear scenarios described in the main text.
+#| out-width: 100%
+knitr::include_graphics("images/04-fig4_12.png")
+```
+
+
+
+## Poisson Regression
+
+### Why not linear regression?
+Similarly to predicting the probability of qualitative variables, negative predictions of count data are not meaningful. 
+
+```{r}
+#| label: fig4-13
+#| echo: false
+#| fig-cap: Figure 4.13 
+#| out-width: 100%
+knitr::include_graphics("images/04-fig4_13.png")
+```
+
+Left: The coefficients associated with the month of the year. Bike usage is highest in the spring and fall, and lowest in the winter. Right: The coefficients associated with the hour of the day. Bike usage is highest during peak commute times, and lowest overnight.
+
+```{r}
+#| label: fig4-14
+#| echo: false
+#| fig-cap: Figure 4.14
+#| out-width: 100%
+knitr::include_graphics("images/04-fig4_14.png")
+```
+
+
+Left: On the Bikeshare dataset, the number of bikers is displayed on the y-axis, and the hour of the day is displayed on the x-axis. For the most part, as the mean number of bikers increases, so does the variance in the number of bikers, violating the assumption of homoscedasticity. A smoothing spline fit is shown in green. Right: The log of the number of bikers is now displayed on the y-axis.
+
+
+### Poisson distribution
+**Poisson distribution** is typically used to model counts. When $Y$ is neither quantitative (sales) nor qualitative (whether or not someone will default on a loan). In a poisson distribution, the variance is the same as the mean. 
+
+> the larger the mean of $Y$, the larger its variance. 
+
+### Poisson regression
+
+>  rather  than modeling the number of bikers, $Y$, as a Poisson distribution with a fixed mean value like $\lambda$ = 5, we would like to allow the mean to vary as a function of the covariates.
+
+
+## Generalized Linear Models
+
+Linear, logistic and Poisson are three examples of regression models that are known as a generalized linear model (GLM).
+
+Each of them uses predictors $X$ to predict a response $Y$. 
+
+>  In general, we can perform a regression by modeling the response $Y$ as coming from a particular member of the exponential family, and then transforming the mean of the response so that the transformed mean is a linear function of the predictors... Any regression approach that follows this very general recipe is
+known as a **generalized linear model**