Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change subscript of X_{ik} to X_{ki} to make it consistent with lhs. #142

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
332 changes: 166 additions & 166 deletions 07_RegressionModels/02_01_multivariate/index.Rmd
Original file line number Diff line number Diff line change
@@ -1,166 +1,166 @@
---
title : Multivariable regression
subtitle :
author : Brian Caffo, Roger Peng and Jeff Leek
job : Johns Hopkins Bloomberg School of Public Health
logo : bloomberg_shield.png
framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
highlighter : highlight.js # {highlight.js, prettify, highlight}
hitheme : tomorrow #
url:
lib: ../../librariesNew
assets: ../../assets
widgets : [mathjax] # {mathjax, quiz, bootstrap}
mode : selfcontained # {standalone, draft}
---
```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
# make this an external chunk that can be included in any file
options(width = 100)
opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')
options(xtable.type = 'html')
knit_hooks$set(inline = function(x) {
if(is.numeric(x)) {
round(x, getOption('digits'))
} else {
paste(as.character(x), collapse = ', ')
}
})
knit_hooks$set(plot = knitr:::hook_plot_html)
runif(1)
```
## Multivariable regression analyses
* If I were to present evidence of a relationship between
breath mint useage (mints per day, X) and pulmonary function
(measured in FEV), you would be skeptical.
* Likely, you would say, 'smokers tend to use more breath mints than non smokers, smoking is related to a loss in pulmonary function. That's probably the culprit.'
* If asked what would convince you, you would likely say, 'If non-smoking breath mint users had lower lung function than non-smoking non-breath mint users and, similarly, if smoking breath mint users had lower lung function than smoking non-breath mint users, I'd be more inclined to believe you'.
* In other words, to even consider my results, I would have to demonstrate that they hold while holding smoking status fixed.
---
## Multivariable regression analyses
* An insurance company is interested in how last year's claims can predict a person's time in the hospital this year.
* They want to use an enormous amount of data contained in claims to predict a single number. Simple linear regression is not equipped to handle more than one predictor.
* How can one generalize SLR to incoporate lots of regressors for
the purpose of prediction?
* What are the consequences of adding lots of regressors?
* Surely there must be consequences to throwing variables in that aren't related to Y?
* Surely there must be consequences to omitting variables that are?
---
## The linear model
* The general linear model extends simple linear regression (SLR)
by adding terms linearly into the model.
$$
Y_i = \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots +
\beta_{p} X_{pi} + \epsilon_{i}
= \sum_{k=1}^p X_{ik} \beta_j + \epsilon_{i}
$$
* Here $X_{1i}=1$ typically, so that an intercept is included.
* Least squares (and hence ML estimates under iid Gaussianity
of the errors) minimizes
$$
\sum_{i=1}^n \left(Y_i - \sum_{k=1}^p X_{ki} \beta_j\right)^2
$$
* Note, the important linearity is linearity in the coefficients.
Thus
$$
Y_i = \beta_1 X_{1i}^2 + \beta_2 X_{2i}^2 + \ldots +
\beta_{p} X_{pi}^2 + \epsilon_{i}
$$
is still a linear model. (We've just squared the elements of the
predictor variables.)
---
## How to get estimates
* Recall that the LS estimate for regression through the origin, $E[Y_i]=X_{1i}\beta_1$, was $\sum X_i Y_i / \sum X_i^2$.
* Let's consider two regressors, $E[Y_i] = X_{1i}\beta_1 + X_{2i}\beta_2 = \mu_i$.
* Least squares tries to minimize
$$
\sum_{i=1}^n (Y_i - X_{1i} \beta_1 - X_{2i} \beta_2)^2
$$
---
## Result
$$\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2}$$
* That is, the regression estimate for $\beta_1$ is the regression
through the origin estimate having regressed $X_2$ out of both
the response and the predictor.
* (Similarly, the regression estimate for $\beta_2$ is the regression through the origin estimate having regressed $X_1$ out of both the response and the predictor.)
* More generally, multivariate regression estimates are exactly those having removed the linear relationship of the other variables
from both the regressor and response.
---
## Example with two variables, simple linear regression
* $Y_{i} = \beta_1 X_{1i} + \beta_2 X_{2i}$ where $X_{2i} = 1$ is an intercept term.
* Notice the fitted coefficient of $X_{2i}$ on $Y_{i}$ is $\bar Y$
* The residuals $e_{i, Y | X_2} = Y_i - \bar Y$
* Notice the fitted coefficient of $X_{2i}$ on $X_{1i}$ is $\bar X_1$
* The residuals $e_{i, X_1 | X_2}= X_{1i} - \bar X_1$
* Thus
$$
\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2} = \frac{\sum_{i=1}^n (X_i - \bar X)(Y_i - \bar Y)}{\sum_{i=1}^n (X_i - \bar X)^2}
= Cor(X, Y) \frac{Sd(Y)}{Sd(X)}
$$
---
## The general case
* Least squares solutions have to minimize
$$
\sum_{i=1}^n (Y_i - X_{1i}\beta_1 - \ldots - X_{pi}\beta_p)^2
$$
* The least squares estimate for the coefficient of a multivariate regression model is exactly regression through the origin with the linear relationships with the other regressors removed from both the regressor and outcome by taking residuals.
* In this sense, multivariate regression "adjusts" a coefficient for the linear impact of the other variables.
---
## Demonstration that it works using an example
### Linear model with two variables
```{r}
n = 100; x = rnorm(n); x2 = rnorm(n); x3 = rnorm(n)
y = 1 + x + x2 + x3 + rnorm(n, sd = .1)
ey = resid(lm(y ~ x2 + x3))
ex = resid(lm(x ~ x2 + x3))
sum(ey * ex) / sum(ex ^ 2)
coef(lm(ey ~ ex - 1))
coef(lm(y ~ x + x2 + x3))
```
---
## Interpretation of the coeficients
$$E[Y | X_1 = x_1, \ldots, X_p = x_p] = \sum_{k=1}^p x_{k} \beta_k$$
$$
E[Y | X_1 = x_1 + 1, \ldots, X_p = x_p] = (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k} \beta_k
$$
$$
E[Y | X_1 = x_1 + 1, \ldots, X_p = x_p] - E[Y | X_1 = x_1, \ldots, X_p = x_p]$$
$$= (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k} \beta_k + \sum_{k=1}^p x_{k} \beta_k = \beta_1 $$
So that the interpretation of a multivariate regression coefficient is the expected change in the response per unit change in the regressor, holding all of the other regressors fixed.
In the next lecture, we'll do examples and go over context-specific
interpretations.
---
## Fitted values, residuals and residual variation
All of our SLR quantities can be extended to linear models
* Model $Y_i = \sum_{k=1}^p X_{ik} \beta_{k} + \epsilon_{i}$ where $\epsilon_i \sim N(0, \sigma^2)$
* Fitted responses $\hat Y_i = \sum_{k=1}^p X_{ik} \hat \beta_{k}$
* Residuals $e_i = Y_i - \hat Y_i$
* Variance estimate $\hat \sigma^2 = \frac{1}{n-p} \sum_{i=1}^n e_i ^2$
* To get predicted responses at new values, $x_1, \ldots, x_p$, simply plug them into the linear model $\sum_{k=1}^p x_{k} \hat \beta_{k}$
* Coefficients have standard errors, $\hat \sigma_{\hat \beta_k}$, and
$\frac{\hat \beta_k - \beta_k}{\hat \sigma_{\hat \beta_k}}$
follows a $T$ distribution with $n-p$ degrees of freedom.
* Predicted responses have standard errors and we can calculate predicted and expected response intervals.
---
## Linear models
* Linear models are the single most important applied statistical and machine learning techniqe, *by far*.
* Some amazing things that you can accomplish with linear models
* Decompose a signal into its harmonics.
* Flexibly fit complicated functions.
* Fit factor variables as predictors.
* Uncover complex multivariate relationships with the response.
* Build accurate prediction models.
---
title : Multivariable regression
subtitle :
author : Brian Caffo, Roger Peng and Jeff Leek
job : Johns Hopkins Bloomberg School of Public Health
logo : bloomberg_shield.png
framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
highlighter : highlight.js # {highlight.js, prettify, highlight}
hitheme : tomorrow #
url:
lib: ../../librariesNew
assets: ../../assets
widgets : [mathjax] # {mathjax, quiz, bootstrap}
mode : selfcontained # {standalone, draft}
---
```{r setup, cache = F, echo = F, message = F, warning = F, tidy = F, results='hide'}
# make this an external chunk that can be included in any file
options(width = 100)
opts_chunk$set(message = F, error = F, warning = F, comment = NA, fig.align = 'center', dpi = 100, tidy = F, cache.path = '.cache/', fig.path = 'fig/')

options(xtable.type = 'html')
knit_hooks$set(inline = function(x) {
if(is.numeric(x)) {
round(x, getOption('digits'))
} else {
paste(as.character(x), collapse = ', ')
}
})
knit_hooks$set(plot = knitr:::hook_plot_html)
runif(1)
```
## Multivariable regression analyses
* If I were to present evidence of a relationship between
breath mint useage (mints per day, X) and pulmonary function
(measured in FEV), you would be skeptical.
* Likely, you would say, 'smokers tend to use more breath mints than non smokers, smoking is related to a loss in pulmonary function. That's probably the culprit.'
* If asked what would convince you, you would likely say, 'If non-smoking breath mint users had lower lung function than non-smoking non-breath mint users and, similarly, if smoking breath mint users had lower lung function than smoking non-breath mint users, I'd be more inclined to believe you'.
* In other words, to even consider my results, I would have to demonstrate that they hold while holding smoking status fixed.

---
## Multivariable regression analyses
* An insurance company is interested in how last year's claims can predict a person's time in the hospital this year.
* They want to use an enormous amount of data contained in claims to predict a single number. Simple linear regression is not equipped to handle more than one predictor.
* How can one generalize SLR to incoporate lots of regressors for
the purpose of prediction?
* What are the consequences of adding lots of regressors?
* Surely there must be consequences to throwing variables in that aren't related to Y?
* Surely there must be consequences to omitting variables that are?

---
## The linear model
* The general linear model extends simple linear regression (SLR)
by adding terms linearly into the model.
$$
Y_i = \beta_1 X_{1i} + \beta_2 X_{2i} + \ldots +
\beta_{p} X_{pi} + \epsilon_{i}
= \sum_{k=1}^p X_{ki} \beta_j + \epsilon_{i}
$$
* Here $X_{1i}=1$ typically, so that an intercept is included.
* Least squares (and hence ML estimates under iid Gaussianity
of the errors) minimizes
$$
\sum_{i=1}^n \left(Y_i - \sum_{k=1}^p X_{ki} \beta_j\right)^2
$$
* Note, the important linearity is linearity in the coefficients.
Thus
$$
Y_i = \beta_1 X_{1i}^2 + \beta_2 X_{2i}^2 + \ldots +
\beta_{p} X_{pi}^2 + \epsilon_{i}
$$
is still a linear model. (We've just squared the elements of the
predictor variables.)

---
## How to get estimates
* Recall that the LS estimate for regression through the origin, $E[Y_i]=X_{1i}\beta_1$, was $\sum X_i Y_i / \sum X_i^2$.
* Let's consider two regressors, $E[Y_i] = X_{1i}\beta_1 + X_{2i}\beta_2 = \mu_i$.
* Least squares tries to minimize
$$
\sum_{i=1}^n (Y_i - X_{1i} \beta_1 - X_{2i} \beta_2)^2
$$

---
## Result
$$\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2}$$
* That is, the regression estimate for $\beta_1$ is the regression
through the origin estimate having regressed $X_2$ out of both
the response and the predictor.
* (Similarly, the regression estimate for $\beta_2$ is the regression through the origin estimate having regressed $X_1$ out of both the response and the predictor.)
* More generally, multivariate regression estimates are exactly those having removed the linear relationship of the other variables
from both the regressor and response.

---
## Example with two variables, simple linear regression
* $Y_{i} = \beta_1 X_{1i} + \beta_2 X_{2i}$ where $X_{2i} = 1$ is an intercept term.
* Notice the fitted coefficient of $X_{2i}$ on $Y_{i}$ is $\bar Y$
* The residuals $e_{i, Y | X_2} = Y_i - \bar Y$
* Notice the fitted coefficient of $X_{2i}$ on $X_{1i}$ is $\bar X_1$
* The residuals $e_{i, X_1 | X_2}= X_{1i} - \bar X_1$
* Thus
$$
\hat \beta_1 = \frac{\sum_{i=1}^n e_{i, Y | X_2} e_{i, X_1 | X_2}}{\sum_{i=1}^n e_{i, X_1 | X_2}^2} = \frac{\sum_{i=1}^n (X_i - \bar X)(Y_i - \bar Y)}{\sum_{i=1}^n (X_i - \bar X)^2}
= Cor(X, Y) \frac{Sd(Y)}{Sd(X)}
$$

---
## The general case
* Least squares solutions have to minimize
$$
\sum_{i=1}^n (Y_i - X_{1i}\beta_1 - \ldots - X_{pi}\beta_p)^2
$$
* The least squares estimate for the coefficient of a multivariate regression model is exactly regression through the origin with the linear relationships with the other regressors removed from both the regressor and outcome by taking residuals.
* In this sense, multivariate regression "adjusts" a coefficient for the linear impact of the other variables.

---
## Demonstration that it works using an example
### Linear model with two variables
```{r}
n = 100; x = rnorm(n); x2 = rnorm(n); x3 = rnorm(n)
y = 1 + x + x2 + x3 + rnorm(n, sd = .1)
ey = resid(lm(y ~ x2 + x3))
ex = resid(lm(x ~ x2 + x3))
sum(ey * ex) / sum(ex ^ 2)
coef(lm(ey ~ ex - 1))
coef(lm(y ~ x + x2 + x3))
```

---
## Interpretation of the coeficients
$$E[Y | X_1 = x_1, \ldots, X_p = x_p] = \sum_{k=1}^p x_{k} \beta_k$$

$$
E[Y | X_1 = x_1 + 1, \ldots, X_p = x_p] = (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k} \beta_k
$$

$$
E[Y | X_1 = x_1 + 1, \ldots, X_p = x_p] - E[Y | X_1 = x_1, \ldots, X_p = x_p]$$
$$= (x_1 + 1) \beta_1 + \sum_{k=2}^p x_{k} \beta_k - \sum_{k=1}^p x_{k} \beta_k = \beta_1 $$
So that the interpretation of a multivariate regression coefficient is the expected change in the response per unit change in the regressor, holding all of the other regressors fixed.

In the next lecture, we'll do examples and go over context-specific
interpretations.

---
## Fitted values, residuals and residual variation
All of our SLR quantities can be extended to linear models
* Model $Y_i = \sum_{k=1}^p X_{ki} \beta_{k} + \epsilon_{i}$ where $\epsilon_i \sim N(0, \sigma^2)$
* Fitted responses $\hat Y_i = \sum_{k=1}^p X_{ki} \hat \beta_{k}$
* Residuals $e_i = Y_i - \hat Y_i$
* Variance estimate $\hat \sigma^2 = \frac{1}{n-p} \sum_{i=1}^n e_i ^2$
* To get predicted responses at new values, $x_1, \ldots, x_p$, simply plug them into the linear model $\sum_{k=1}^p x_{k} \hat \beta_{k}$
* Coefficients have standard errors, $\hat \sigma_{\hat \beta_k}$, and
$\frac{\hat \beta_k - \beta_k}{\hat \sigma_{\hat \beta_k}}$
follows a $T$ distribution with $n-p$ degrees of freedom.
* Predicted responses have standard errors and we can calculate predicted and expected response intervals.

---
## Linear models
* Linear models are the single most important applied statistical and machine learning techniqe, *by far*.
* Some amazing things that you can accomplish with linear models
* Decompose a signal into its harmonics.
* Flexibly fit complicated functions.
* Fit factor variables as predictors.
* Uncover complex multivariate relationships with the response.
* Build accurate prediction models.

Loading