You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We see that the data contain infomation on life expectancy (lifeExp), population (pop), and gross domestic product per capita (gdpPercap, a rough measure for economical richness) for many coutries across many years.
670
-
671
-
A very naive working hypothesis that you may come to is that our life expectancy grew with time. This would be represent in r with `lifeExp ~ year`.
670
+
We see that the data contain infomation on life expectancy (lifeExp), population (pop), and gross domestic product per capita (gdpPercap, a rough measure for economical richness) for many countries across many years. A very naive working hypothesis that you may come to is that our life expectancy grew with time. This would be represent in r with `lifeExp ~ year`.
672
671
673
672
Let's explore this hypothesis graphically. Using the `gapminder` data set, create a scatterplot with `year` on the x-axis and `lifeExp` on the y-axis. Remember to create human readable labels!
674
673
675
-
```{r gapminder-package, exercise = TRUE}
676
-
# load and check the data
677
-
library(<package>)
678
-
<function>(gapminder)
674
+
```{r gapminder-plot, exercise = TRUE}
675
+
# uncomment the last line to customize
676
+
<dataset> %>%
677
+
ggplot(aes(x = <x-axis>, y = <y-axis>)) +
678
+
geom_point()
679
+
#labs(x = "x-axis", y = "y-axis")
679
680
```
680
681
681
-
```{r gapminder-package-solution}
682
+
```{r gapminder-plot-solution}
683
+
gapminder %>%
684
+
ggplot(aes(x = year, y = lifeExp)) +
685
+
geom_point()
686
+
```
687
+
688
+
```{r setup-gapminder-plot, include = FALSE}
682
689
library(gapminder)
683
-
head(gapminder)
684
690
```
691
+
692
+
Although there is very high variance, we do see a certain trend with mean life expectancy increasing over time. Similarly, we can naively hypothesize that life expectancy is higher where the per-capita gdp is higher. In R, this is `lifeExp ~ gdpPercap`.
693
+
694
+
Again, let's explore this hypothesis graphically. Using the `gapminder` data set, create a scatterplot with `gdpPercap` on the x-axis and `lifeExp` on the y-axis. Have the x-axis be "Life expectancy (yrs)" and the y-axis label be "Per-capita GDP".
695
+
696
+
```{r gapminder-plot2, exercise = TRUE}
697
+
# uncomment the last line to customize
698
+
<dataset> %>%
699
+
ggplot(aes(x = <x-axis>, y = <y-axis>)) +
700
+
<geom> +
701
+
labs(x = <x-axis label>, y = <y-axis label>)
702
+
```
703
+
704
+
```{r gapminder-plot2-solution}
705
+
gapminder %>%
706
+
ggplot(aes(x = gdpPercap, y = lifeExp)) +
707
+
geom_point() +
708
+
labs(x = "Life expectancy (yrs)", y = "Per-capita GDP")
709
+
```
710
+
711
+
```{r setup-gapminder-plot2, include = FALSE}
712
+
library(gapminder)
713
+
```
714
+
715
+
### Linear models
716
+
A **linear regression** model describes the change of a *dependent variable*, say `lifeExp`, as a *linear* function of one or more *explanatory* variables, say *yaer*. This means tat increasing by $x$ the variable `year` will have an effect $\beta \cdot x$ on the *dependent* variable `lifeExp`, whatever the value $x$ is, in mathematical terms:
717
+
\[
718
+
\mbox{lifeExp} = \alpha + \beta \cdot \mbox{year}
719
+
\]
720
+
721
+
We call $\alpha$ the intercept of the model, or the value of `lifeExp` when `year` is equal to zero. When we go forward in time, increasing `year`, `lifeExp` increases (if $\beta$ is positive, otherwise it decreases):
722
+
\[
723
+
\alpha + \beta \cdot \left(\mbox{year} + x \right) = \alpha + \beta \cdot \mbox{year} + \beta \cdot x = \mbox{lifeExp} + \beta \cdot x
724
+
\]
725
+
726
+
### Key assumptions
727
+
A number of assumptions must be satisfied for a linear model to be relaiable: These are:
728
+
729
+
* the predictor variables should be measured with not too much error (**weak exogeneity**)
730
+
* the variance of the response variable should be roughly the same across the range of its predictors (**homoscedasticity**, a fancy pants word for "constant variance")
731
+
* the discrepancies between observed and predicted values should be **independent**
732
+
* the predictors themselves should be **non-colinear** (a rather technical issues, given by the way we solve the model, that may happen when two predictors are perfectly correlated or we try to estimate the effect of too many predictors with too little data).
733
+
734
+
Here, we only mention these assumptions, but for more details, take a look at [wiki](https://en.wikipedia.org/wiki/Linear_regression#Assumptions).
735
+
736
+
When we have only one predictive variable (what is called a _simple_ linear regression model), the formula we just introduced describes a straight line. The task of a linear regression method is identifying the _best fitting_ slope and intercept of that straight line. But what does _best fitting_ means in this context? We will first adopt a heuristic definition of it but will rigorously define it later on.
737
+
738
+
Let's consider a bunch of straight lines in our first plot:
**So, which line best describes the data? To determine this, we must fit a linear model to the data.**
753
+
754
+
### Simple linear regression
755
+
To obtain the slope and intercept of the green line, we can use the built-in R function `lm()`. This function works very similarly to the `aov()` function we used earlier for our Anova model.
756
+
757
+
Let's explore our earlier hypothesis that life expectancy is higher where the per-capita gdp is higher. Create a lm "model" object called `lifeExp_model1` using the `gapminder` data set.
758
+
759
+
```{r gapminder-model1, exercise = TRUE}
760
+
<object_name> <- lm(lifeExp ~ year,
761
+
data = <data_set>)
762
+
```
763
+
764
+
```{r gapminder-model1-solution}
765
+
lifeExp_model1 <- lm(lifeExp ~ year,
766
+
data = gapminder)
767
+
```
768
+
769
+
```{r setup-gapminder-model1, include = FALSE}
770
+
library(gapminder)
771
+
```
772
+
773
+
Now, use the function `summary()` to see all the relevant results.
774
+
```{r gapminder-model1-sum, exercise = TRUE}
775
+
summary(<object_name>)
776
+
```
777
+
778
+
```{r gapminder-model1-sum-solution}
779
+
summary(lifeExp_model1)
780
+
```
781
+
782
+
```{r setup-gapminder-model1-sum, include = FALSE}
783
+
lifeExp_model1 <- lm(lifeExp ~ year,
784
+
data = gapminder)
785
+
```
786
+
787
+
Now, however, we are more interested in more then just the p-values.
788
+
789
+
The `Estimate` values are the best foot for the intercept, $\alpha$, and the slope, $\beta$. The slope, the parameter that links `year` to `lifeExp`, is a positive value: every 1 year, the life expectancy increases of $`r summary(lifeExp_model1)$coefficients[2]`$ years. This is in line with our hypothesis. Moorever, its p-value, the probability of finding a correlation at least as strong between predictive and response variable, is rather low at $`r summary(lifeExp_model1)$coefficients[8]`$ (but see [this](https://backyardbrains.com/experiments/p-value) for a cautionary tale about p-values!).
790
+
791
+
Now, using the slope and intercept, we can plot the best fit line on our data. We can use the `geom_smooth()` function to do this.
792
+
793
+
Here are the key arguments:
794
+
795
+
*`method` is the smoothing method to be used. Possible values include lm, glm, gam, loess, rlm.
796
+
*`lm` fits a linear model (this is the one we will be using for this example)
797
+
*`se` is a boolean value. If set to TRUE, the confidence interval will be displayed.
798
+
*`color`, `size`, `linetype` changes the line color, size and type
799
+
*`fill` changes the fill color of the confidence region
800
+
801
+
For this example, we will be adding a best fit line to our previous plot that looks at mean life expectancy increasing over time. Use the `geom_smooth()` function to create the line. Set method to `lm`, with no confidence intervals, and make the line green.
802
+
803
+
```{r gapminder-line1, exercise = TRUE}
804
+
# play around with the key arguments!
805
+
gapminder %>%
806
+
ggplot(aes(x = year, y = lifeExp)) +
807
+
geom_point() +
808
+
<add_geom>(method = <method>, se = <se>, color = <color>)
809
+
```
810
+
811
+
```{r gapminder-line1-solution}
812
+
gapminder %>%
813
+
ggplot(aes(x = year, y = lifeExp)) +
814
+
geom_point() +
815
+
geom_smooth(method = "lm", se = FALSE, colour = "green")
816
+
```
817
+
818
+
```{r setup-gapminder-line1, include = FALSE}
819
+
library(gapminder)
820
+
```
821
+
822
+
Another important bit of information in our results is the R-squared values, both are the _Multiple_ and _Adjusted R-squared_. These tell us how much of _variance_ in the life expectancy data is explained by the year. In this case, not much (`r summary(lifeExp_model1)$r.squared`, `r summary(lifeExp_model1)$adj.r.squared`, respectively)
823
+
824
+
### Residuals
825
+
We can further explore our linear model by plotting some diagnostic plots. Base R provides a quick and easy way to view all of these plots at once with `plot()`.
826
+
827
+
```{r gapminder-res, exercise = TRUE}
828
+
# Set plot frame to 2 by 2
829
+
par(mfrow=c(2,2))
830
+
# Create diagnostic plots
831
+
plot(<model_object>)
832
+
```
833
+
834
+
```{r gapminder-res-solution}
835
+
par(mfrow=c(2,2))
836
+
plot(lifeExp_model1)
837
+
```
838
+
839
+
```{r setup-gapminder-res, include = FALSE}
840
+
lifeExp_model1 <- lm(lifeExp ~ year,
841
+
data = gapminder)
842
+
```
843
+
844
+
Whoa right!? Let's break it down. Overall, these diagnostic plots are useful for understanding the model *residuals*. The residuals are the discrepancies between the life expectancy we should have guessed by the model and the observed values in the available data. In other words, the distance between the straight line and actual data points. In a linear regression model, these residuals are the values we are trying to minimize when we fit a straight line.
845
+
846
+
There is a lot of information contained in these 4 plots, and you can find in-depth explanations [here](https://data.library.virginia.edu/diagnostic-plots/). For our purposes today, let's focus on just the **Residuals vs Fitted** and **Normal Q-Q plots**.
847
+
848
+
The **Residuals vs Fitted** plot shows the differences between the best fit line and all the available data points. When the mmodel is a good fit for the data, this plot should have no discernable pattern. That is, the red line should *not* form a shape like an 'S' or a parabola. Another way to look at it is that the points should look like 'stars in the sky', *e.g.* random. This second description is not great for these data since year is an integer (whole number) but we do see that the red line is relatively straight and without pattern.
849
+
850
+
The **Normal Q-Q plot** directly compares the best fit and actual data values. A good model closely adheres to the dotted line and points that fall off the line should not portray any pattern. In our case, this plot indicates that this simple linear model may not be the best fit for these data. Notice how either end deviates more and more from the line and the plot forms somewhat of an 'S' pattern.
851
+
852
+
These ends are particularly important in a linear model. Because we have chosen to use a simple linear model, *outliers* (observed values that are very far away from the best fit line), are very important. They have a high *leverage* (see the forth dignostic plot). This is especially true if the outliers are at the edge of the preddicting variable ranges such as we see in our Q-Q plot.
Fit a linear model of life expectancy as a function of per-capita GDP from the gapminder data set. Use the plot you previously created to help you out.
915
+
916
+
* Create a model variable called `lm_exercise`
917
+
* Use the `summary()` function to examine the output.
918
+
* Use the `plot()` function to create the diagnostic plots
919
+
920
+
Do you think this is a good fit for these data?
921
+
922
+
**Note: Hints will be provided for exercises but no solution. Try to figure it out!**
0 commit comments