Skip to content

Commit aff69d9

Browse files
author
Andrew G. Dunn
committed
Lecture on Estimation and Inference for Ordinary Least Squares Regression
1 parent 8dd3b93 commit aff69d9

File tree

6 files changed

+239
-5
lines changed

6 files changed

+239
-5
lines changed

Makefile

+1-1
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ study: study.md study.bib Makefile
55
--bibliography=study.bib \
66
--csl=templates/acm-siggraph.csl \
77
--template=templates/compact.latex \
8-
-V geometry:margin=0.7in,nohead,nofoot
8+
-V geometry:margin=0.7in
99

1010
clean:
1111
rm study.pdf

assignment/1/assignment1.md

+29
Original file line numberDiff line numberDiff line change
@@ -251,6 +251,35 @@ proc sgscatter data=ames;
251251

252252
![Five variables with LOESS Overlay](images/5varwloess.png "Five variables with LOESS Overlay")
253253

254+
# Investigate Potential Categorical Predictor Variables, with respect to Sales Price
255+
256+
We run the Pearson correlation on the categorical variables from the data set.
257+
Using the data dictionary and the contents procedure we will initially limit the
258+
categorical variables that we're looking at (to exclude arbitrary identifiers,
259+
nominal with less than 10 options, and ):
260+
261+
SubClass Zoning Street Alley LotShape LandContour Utilities LotConfig
262+
LandSlope Neighborhood
263+
264+
~~~{.fortran}
265+
proc corr data=ames nosimple rank;
266+
var MasVnrArea BsmtFinSF1 BsmtUnfSF TotalBsmtSF FirstFlrSF GrLivArea GarageArea;
267+
with SalePrice;
268+
~~~
269+
270+
We get results for the variables as follows:
271+
272+
| Variable | Pearson Correlation Coefficients | Prob > $|r|$ under $H_0$: $\rho$=0 | Number of Observations |
273+
|:-:|:-:|:-:|:-:|
274+
| GrLivArea | 0.70678 | <.0001 | 2930 |
275+
| GarageArea | 0.64040 | <.0001 | 2929 |
276+
| TotalBsmtSF | 0.63228 | <.0001 | 2929 |
277+
| FirstFlrSF | 0.62168 | <.0001 | 2930 |
278+
| MasVnrArea | 0.50828 | <.0001 | 2907 |
279+
| BsmtFinSF1 | 0.43291 | <.0001 | 2929 |
280+
| BsmtUnfSF | 0.18286 | <.0001 | 2929 |
281+
282+
254283
# Conclusion / Reflection
255284

256285
The Exploratory Data Analysis that we've done indicates that there are some

assignment/1/assignment1.sas

+2
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ libname mydata '/scs/crb519/PREDICT_410/SAS_Data/' access=readonly;
66
data ames;
77
set mydata.ames_housing_data;
88

9+
proc
10+
911
/*
1012
* initial examination of the correlation to saleprice;
1113
proc corr data=ames nosimple;

study.bib

+33-1
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,27 @@
1-
@online{bhatti2011,
1+
@online{bhattiprelim,
22
author = {Dr. Chad Bhatti},
33
title = {Statistical Preliminaries and Mathematical Notation},
44
year = 2011,
55
url = {http://nwuniversity.adobeconnect.com/p2u4z1zop3a/},
66
urldate = {2015-04-31}
77
}
88

9+
@online{bhattiolsassums,
10+
author = {Dr. Chad Bhatti},
11+
title = {Statistical Assumptions for Ordinary Least Squares Regression},
12+
year = 2011,
13+
url = {http://nwuniversity.adobeconnect.com/p14gl2rughs/},
14+
urldate = {2015-05-10}
15+
}
16+
17+
@online{bhattiestimols,
18+
author = {Dr. Chad Bhatti},
19+
title = {Estimation and Inference for Ordinary Least Squares Regression},
20+
year = 2011,
21+
url = {http://nwuniversity.adobeconnect.com/p9rbxcf6431/},
22+
urldate = {2015-05-10}
23+
}
24+
925
@book{montgomery2012introduction,
1026
title={Introduction to linear regression analysis},
1127
author={Montgomery, Douglas C and Peck, Elizabeth A and Vining, G Geoffrey},
@@ -21,3 +37,19 @@ @misc{ wiki:regressionanalysis
2137
url = "\url{http://en.wikipedia.org/w/index.php?title=Regression_analysis&oldid=647603059}",
2238
note = "[Online; accessed 1-April-2015]"
2339
}
40+
41+
@misc{ wiki:homoscedasticity,
42+
author = "Wikipedia",
43+
title = "Homoscedasticity --- Wikipedia{,} The Free Encyclopedia",
44+
year = "2015",
45+
url = "\url{http://en.wikipedia.org/w/index.php?title=Homoscedasticity&oldid=650997386}",
46+
note = "[Online; accessed 10-April-2015]"
47+
}
48+
49+
@misc{ wiki:qrdecomposition,
50+
author = "Wikipedia",
51+
title = "QR decomposition --- Wikipedia{,} The Free Encyclopedia",
52+
year = "2015",
53+
url = "\url{http://en.wikipedia.org/w/index.php?title=QR_decomposition&oldid=655839188}",
54+
note = "[Online; accessed 10-April-2015]"
55+
}

study.md

+174-3
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,13 @@
1-
study
1+
Study
22
=====
33

44
My study notes will draw heavily from the required texts and multimedia. I will
55
also draw from external sources that I find to be adept at explaining a
66
particular topic. If something is referenced here, it is because I found it to
77
be very useful in understanding a topic.
88

9-
109
# Standard Mathematical and Statistical Notation
11-
Notes below are from the following sources; [@bhatti2011].
10+
Notes below are from the following sources; [@bhattiprelim].
1211

1312
## Vector and Matrix Notation
1413

@@ -83,6 +82,175 @@ Let $X$ and $Y$ be random variables with a joint distribution function. (In the
8382

8483
Here the reader should note that in general $\mathrm{Cov}[aX+b,cY+d] = ac\mathrm{Cov}[X,Y]$. If $X$ and $Y$ are independent random variables, then $\mathrm{Cov}[X,Y] = 0$. The converse of this statement is not true except when both $X$ and $Y$ are normally distributed. In general $\mathrm{Cov}[X,Y] = 0$ does not imply that $X$ and $Y$ are indepdented random variables.
8584

85+
\newpage
86+
87+
# Statistical Assumptions for Ordinary Least Squares Regression
88+
Notes below are from the following sources; [@bhattiolsassums].
89+
90+
- In Ordinary Lease Squares (OLS) regression we wish to model a continuous random variable $Y$ (the response variable) given a set of _predictor variables_ $X_1, X_2, \ldots, X_k$.
91+
92+
- While we require that the response variable $Y$ will be continuous, or approximately continuous. The _predictor variables_ $X_1, X_2, \ldots, X_k$ can be either continuous or discrete.
93+
94+
- It is fairly standard notation to reserve $k$ for the number of predictor variables in the regression model, and $p$ for the number of parameters (regression coefficients or $\beta$s).
95+
96+
- When formulating a regression model, we want to explain the variation in the response variable by the variation in the predictor variable.
97+
98+
## Statistical Assumptions for OLS Regression
99+
100+
There are two primary assumptions for OLS regression:
101+
102+
1. The regression model can be expressed in the form $$Y = \beta_0 + \beta_1X_1 + \ldots + \beta_kX_k + \epsilon$$ Notice that the model formulation specifies error term $\epsilon$ to be additive, and that the model parameters ($\beta$s) enter the modeling linearly, that is, $\beta_i$ represents the change in $Y$ for a one unit increase in $X_i$ when $X_i$ is a continuous predictor variable. Any statistical model in which the parameters enter the model linearly is referred to as a _linear model_.
103+
104+
2. The response variable $Y$ is assumed to come from an independent and identically distributed (iid) random sample from a $N(\mathbf{X\beta},\sigma^2)$ distribution where the variance of $\sigma^2$ is a fixed but unkown quantity. The statistical notation for this assumption is $Y ~ N(\mathbf{X\beta},\sigma^2)$.
105+
106+
## Linear Versus Nonlinear Regression
107+
108+
Remember that a _linear model_ is linear in the parameters, not the predictor variables.
109+
110+
- The following regression models are all linear regression models: $$Y = \beta_0 + \beta_1X_1+\beta_2X_1^2 + \epsilon$$ $$Y = \beta_0 \beta_1\ln(X_1) + \epsilon$$
111+
112+
- The following regression models are all nonlinear regression models: $$Y = \beta_0\exp(\beta_1X_1) + \epsilon$$ $$Y = \beta_0 + \beta_2\sin(\beta_1X_1) + \epsilon$$
113+
114+
- If you know a little calculus, then there is an easy mathematical definition of a nonlinear regression model. In a nonlinear regression model at least one of the partial derivatives will be dependent on a model parameter.
115+
116+
- Any quantity that has a $\beta$ in front of it counts as a degree of freedom used, and subsequently counts as a predictor variable.
117+
118+
- A hint to identify a nonlinear model is when a parameter is within a function, specifically a nonlinear function.
119+
120+
## Distributional Assumptions for OLS Regression
121+
122+
The assumption $Y ~ N(\mathbf{X\beta},\sigma^2)$ can also be presented in terms of the error term $\epsilon$. Most introductory bookos present the distributional assumption in terms of the error term $\epsilon$, but more advanced books will use the standard Generalized Linear Model (GLM) presentation in terms of the response variable $Y$.
123+
124+
In terms of the error term $\epsilon$ the distributional assumption can also be presented as:
125+
126+
- The error term $\epsilon ~ N(0,\sigma^2)$. Since $Y ~ N(\mathbf{X\beta},\sigma^2)$, then $\epsilon = Y - \mathbf{X\beta}$ has a $N(0,\sigma^2)$.
127+
128+
## Distributional Assumptions in Terms of the Error
129+
130+
1. The errors are normally distributed.
131+
2. The errors are mean zero.
132+
3. The errors are independent and identically distributed (iid).
133+
4. The errors are _homoscedastic_, i.e. the errors do not have any correlation "in time or space".
134+
135+
When we build statistical models, we will check the assumptions about the errors by assessing the model _residuals_, which are our estimates of the error term.
136+
137+
_Homoscedasticity_: a sequence or vector of random variables is _homoscedastic_ if all random variables in the sequence or vector have the same finite variance. This is also known as the _homogeneity of variance_. [@wiki:homoscedasticity]
138+
139+
You'd need a pretty gross violation of _homoscedastic_ in the kind of problems that we work with today.
140+
141+
## Further Notation and Details
142+
143+
When we estimate an OLS regression model, we will be working with a random sample of response variables $Y_1, Y_2, \ldots, Y_n$, each with a vector of predictor variables $[X_{1i}, X_{2i},\ldots,X_{ki}]$. In matrix notation we will denote the regression problem by $$Y_{(n \times 1)} - X_{(n \times p)}\beta_{(p \times 1)} + \epsilon_{(n \times 1)}$$ where the matrix size is denoted by the subscript. Note that $X = [1, X_1, X_2, \ldots, X_k]$ and $\beta = [\beta_0, \beta_1, \beta_2, \ldots, \beta_k]$.
144+
145+
- When we want to express the regression in terms of a single observation, the new typically use the $i$ subscript notation $$Y_i = \mathbf{X_i\beta} + \epsilon_i$$ or simply $$Y_i = \beta_0 + \beta_1X_{1i} + \ldots + \beta_kX_{ki} + \epsilon_i$$
146+
147+
\newpage
148+
149+
# Estimation and Inference for Ordinary Least Squares Regression
150+
Notes below are from the following sources; [@bhattiestimols].
151+
152+
It's important to understand some aspects of estimation and inference for every statistical method that is used.
153+
154+
## Estimation - Simple Linear Regression
155+
156+
- A _simple linear regression_ is the special case of an OLS regression model with a single predictor variable. $$Y = \beta_0 + \beta1X + \epsilon$$
157+
158+
- For the $i$th observation we will denote the regression model by $$Y_i = \beta_0 + \beta_1X_i + \epsilon_i$$
159+
160+
- For the random sample $Y_1, Y_2, \ldots, Y_n$ we can estimate the parameters $\beta_0$ and $\beta_1$ by minimizing the sum of the squared errors, $$\min\sum_{i=1}^{n}\epsilon_i^2$$ which is equivalent to minimizing $$\min\sum_{i=1}^{n}(Y_i - \beta_0 - \beta_1X_i)^2$$
161+
162+
## Estimators and Estimates for Simple Linear Regression
163+
164+
- The estimators for $\beta_0$ and $\beta_1$ can be computed analytically and are given by $$\hat{\beta_1} = \frac{\sum(Y_i - \bar{Y})(X_i - \bar{X})}{(X_i - \bar{X})^2} = \frac{\mathrm{Cov}(Y,X)}{\mathrm{Var}(X)}$$ and $$\hat{\beta_0} = \bar{Y} - \hat{\beta_1}\bar{X}$$
165+
166+
- The regression line always goes through the centroid $(\bar{X},\bar{Y})$.
167+
168+
- We refer to the formulas for $\hat{\beta_0}$ and $\hat{\beta_1}$ as estimators and the values that these formulas can take for a given random sample as the estimates.
169+
170+
- In statistics we put hats on all estimators and estimates.
171+
172+
- Given $\hat{\beta_0}$ and $\hat{\beta_1}$ the predicted value or fitted value is given by $$\hat{Y} = \hat{\beta_0} + \hat{\beta_1}X$$
173+
174+
## Estimation - The General Case
175+
176+
- We seldom build regression models with a single predictor variable. Typically we have multiple predictor variables denoted by $X_1, X_2, \ldots, X_k$, and hence the standard regression case is sometimes referred to as _multiple regression_ in introductory regression texts.
177+
178+
- We can still think about the estimation of $\beta_0, \beta_1, \beta_2, \ldots, \beta_k$ in the same manner as the sime linear regression case $$\min\sum_{i=1}^n(Y_i - \beta_0 - \beta_1X_{1i} - \beta_2X_{2i} - \ldots - \beta_kX_{ki})^2$$ but the computations will be performed as matrix computations.
179+
180+
## General Estimation - Matrix Notation
181+
182+
Before we set up the matrix formulation for the OLS model, let's begin by defining some matrix notation.
183+
184+
- The error vector $\epsilon = [\epsilon_1, \ldots, \epsilon_n]^T$.
185+
- The response vector $Y = [Y_1, \ldots, Y_n]^T$.
186+
- The design matrix or predictor matrix $X = [1, X_1, X_2, \ldots, X_k]$.
187+
- The parameter vector $\beta = [\beta_0, \beta_1, \beta_2, \ldots, \beta_k]^T$.
188+
189+
- All vectors are column vectors, and the superscript $T$ denotes the vector or matrix _transpose_.
190+
191+
## General Estimation - Matrix Computations
192+
193+
- We minimize the sum of the squared error by minimizing $S(\beta) = \epsilon^T\epsilon$ which can be re-expressed as $$S(\beta) = (Y - X\beta)^T(Y - X\beta)$$
194+
195+
- Taking the matrix derivative of $S(\beta)$, we get $$S_\beta(\hat{\beta}) = -2X^TY + 2X^TX\hat{\beta}$$
196+
197+
- Setting the matrix derivative to zero, we can write the expression for the least squares _normal equations_ $$X^TX\hat{\beta} = X^TY$$, which yield the estimator $$\hat{\beta} = (X^TX)^{-1}X^TY$$
198+
199+
- The estimator form $\hat{\beta} = (X^TX)^{-1}X^TY$ assumes that the inverse matrix $(X^TX)^{-1}$ exists and can be computed. In practice your statistical software will directly solve the normal equations using a QR Factorization.
200+
201+
_normal equations_: projection of a linear space into a subspace to ensure that a solution exists.
202+
203+
_QR Factorization_: or QR decompositionof a matrix is a decomposition of a matrix $A$ into a product $A = QR$ of an orthogonal matrix $Q$ and an upper triangular matrix $R$ [@wiki:qrdecomposition].
204+
205+
## Statistical Inference with the t-Test
206+
207+
- In OLS regression the statistical inference for the individual regression coefficients can be performed using a t-test.
208+
209+
_t-test_: any statistical test using a t-statistic to derive the test and the p-value for the test. Alternatively, any statistical test that uses a t-statistic as the decision variable.
210+
211+
_statistical test_: have a null and alternative hypothesis, and a test statistic with a known distribution.
212+
213+
- When performing a t-test there are three primary components: (1) stating the null and alternative hypotheses, (2) Computing the value of the test statistic, and (3) deriving a statistical conclusion based on a desired significance level.
214+
215+
- Step 1: The null and alternate hypotheses for $\beta_i$ are given by $$H_0:\beta_i = 0 \text{ versus } H_1:\beta_i \neq 0$$
216+
217+
- Step 2: The t statistic for $\beta_i$ is computed by $$t_i = \frac{\hat{\beta_i}}{SE(\hat{\beta_i})}$$ and has a degrees of freedom equal to the sample size minus the number of model parameters, i.e. $df = n-dim(model)$. For example if you had a regression model with two predictor variables and an intercept estimated on a sample of size 50, then the t statistic would have 47 degrees of freedom.
218+
219+
- Step 3: Reject the $H_0$ or Fail to Reject $H_0$ based on the value of your t statistic and your significance level. This decision can be made by using the p-value of your t statistic or by using the critical value for your significance level.
220+
221+
## Confidence Intervals for Parameter Estimates
222+
223+
An alternative to performing a formal hypothesis test is to use a confidence interval for your parameter estimate. There is a duality between confidence intervals and formal hypothesis testing for regression parameters.
224+
225+
- The confidence interval for $\hat{\beta_i}$ is given by $$\hat{\beta_i} \pm t(df,\frac{\alpha}{2}) \times SE(\hat{\beta_i})$$ where $t(df,\frac{\alpha}{2})$ is a t value from a theoretical t distribution, not a t statistic value.
226+
227+
- If the confidence interval does not contain zero, then this is the equivalent to rejecting the null hypothesis $H_0:\beta_i = 0$.
228+
229+
## Statistical Intervals for Predicted Values
230+
231+
The phrase _predicted value_ is used in statistics to refer to the in-sample _fitted values_ from the estimated model or to refer to the out-of-sample _forecasted values_. The dual use of this phrase can be confusing. A better habit is to use the phrase _in-sample fitted values_ and in the _out-of-sample predicted values_ to clearly reference these different values.
232+
233+
_inference_ is an in-sample activity, measuring the quality of the model based on in-sample performance. _predictive modeling_ is an out-of-sample activity, measuring the quality of the model based on out-of-sample.
234+
235+
- Given $\hat{\beta} = (X^TX)^{-1}X^TY$ the vector of fitted values can be computed by $\hat{y} = X\hat{\beta} = HY$, where $H = X(X^TX)^{-1}X^T$. The matrix $H$ is called the _hat matrix_ since it puts the hat on $Y$.
236+
237+
- The point estimate $\hat{Y_0}$ at the point $x_0$ can be computed by $\hat{Y_0} = x_0^T\hat{\beta}$.
238+
239+
- The confidence interval for an in-sample point $x_0$ on the estimated regression function is given by $$x_0^T\hat{\beta} \pm \hat{\sigma} \sqrt{x_0^T(X^TX)^{-1}x_0}$$
240+
241+
- The prediction interval for the point estimator $\hat{Y_0}$ for an out-of-sample $x_0$ is given by $$x_0^T\hat{\beta} \pm \hat{\sigma} \sqrt{1 + x_0^T(X^TX)^{-1}x_0}$$
242+
243+
- Note that the out-of-sample prediction interval is always wider than the in-sample confidence interval.
244+
245+
## Further Notation and Details
246+
247+
In order to compute the t statistic you need the standard error of the parameter estimate. Most statistical software packages should provide this estimate and compute this t statistic for you However, it is always a good idea to know from where this number comes. Here are the details needed to compute the standard error for $\hat{\beta_i}$.
248+
249+
- The estimated parameter vector $\hat{\beta}$ has the covariance matrix given by $$\mathrm{Cov}(\hat{\beta}) = \hat{\sigma}^2X^TX$$ where $$\hat{\sigma}^2 = \frac{SSE}{n - k - 1}$$
250+
251+
- The variance of $\hat{\beta_i}$ is the $i$th diagonal element of the covariance matrix $$\mathrm{Var}(\hat{\beta_i}) = \hat{\sigma}^2(X^TX)_{ii}$$
252+
253+
\newpage
86254

87255
# Study Questions for Ordinary Least Squares Regression
88256

@@ -230,6 +398,8 @@ __Question__: Variable Selection: How does forward variable selection work? How
230398
does backward variable selection work? How does stepwise variable selection
231399
work?
232400

401+
\newpage
402+
233403
# Study Questions for Multivariate Analysis
234404

235405
## Principle Components Analysis
@@ -276,4 +446,5 @@ __Question__: Do the data need to be treated before we perform a cluster
276446
analysis?
277447

278448
\newpage
449+
279450
# References

study.pdf

45.5 KB
Binary file not shown.

0 commit comments

Comments
 (0)