Skip to content

Commit

Permalink
switched to bookdown
Browse files Browse the repository at this point in the history
  • Loading branch information
benwhalley committed Jun 21, 2017
1 parent 18afca3 commit 236393e
Show file tree
Hide file tree
Showing 174 changed files with 39,257 additions and 2,476 deletions.
51 changes: 51 additions & 0 deletions _bookdown.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
book_filename: "Just enough R"
new_session: yes
output_dir: "docs"

rmd_files:
[ "index.Rmd",
"start_here.Rmd",
"datasets.Rmd",
"working-with-dataframes.Rmd",
"summarising-data.Rmd",
"graphics.Rmd",
"linear-models-simple.Rmd",

"real-data.Rmd",
"correlations.Rmd",
"t-tests.Rmd",
"anova.Rmd",
"understanding-interactions.Rmd",
"predictions-and-margins.Rmd",
"mediation.Rmd",
"multilevel-models.Rmd",
"cfa.Rmd",

"installation.Rmd",
"packages.Rmd",
"help.Rmd",
"missing-values.Rmd",
"rmarkdown-tricks.Rmd",
"confidence-vs-prediction-intervals.Rmd"
]





















14 changes: 14 additions & 0 deletions _output.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
bookdown::gitbook:
css: style.css
config:
toc:
before: |
<li><a href="./">Just enough R</a></li>
download: ["pdf", "epub"]
bookdown::pdf_book:
includes:
in_header: preamble.tex
latex_engine: xelatex
citation_package: natbib
keep_tex: yes
bookdown::epub_book: default
32 changes: 32 additions & 0 deletions anova.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
title: 'Anova'
output:
bookdown::tufte_html2
---




# Anova

```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse=TRUE, cache=TRUE)
library(tidyverse)
library(ez)
```




## Anova for between subjects designs

TODO


## Repeated measures

TODO




20 changes: 20 additions & 0 deletions bibliography.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
@article{newman2012bar,
title={Bar graphs depicting averages are perceptually misinterpreted: The within-the-bar bias},
author={Newman, George E and Scholl, Brian J},
journal={Psychonomic bulletin \& review},
volume={19},
number={4},
pages={601--607},
year={2012},
publisher={Springer}
}
@article{kockelman2002driver,
title={Driver injury severity: an application of ordered probit models},
author={Kockelman, Kara Maria and Kweon, Young-Jun},
journal={Accident Analysis \& Prevention},
volume={34},
number={3},
pages={313--321},
year={2002},
publisher={Elsevier}
}
13 changes: 7 additions & 6 deletions cfa.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,15 @@ output:
---


```{r setup, include=FALSE}
# ignore all this for the moment
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse=TRUE, cache=TRUE)
library(tufte)
library(tidyverse)
```


# Confirmatory factor analysis

*This section adapted from a guide originally produced by Jon May*


Expand All @@ -36,7 +37,7 @@ hz %>% glimpse()
```


# Defining the model
## Defining the model

Define the CFA model you want to run using 'lavaan model syntax'^[A full guide is here: <http://lavaan.ugent.be/tutorial/syntax1.html>]. For example:

Expand Down Expand Up @@ -79,7 +80,7 @@ The output has three parts:



# Model fit
## Model fit

To examine the model fit:

Expand All @@ -91,7 +92,7 @@ fitmeasures(hz.fit, c('cfi', 'rmsea', 'rmsea.ci.upper', 'bic'))
This looks OK, but could be improved.


# Modification indices
## Modification indices

To examine the modification indices (here sorted to see the largest first) we type:

Expand Down Expand Up @@ -123,7 +124,7 @@ fitmeasures(hz.fit.2, c('cfi', 'rmsea', 'rmsea.ci.upper', 'bic'))
RMSEA has improved marginally, but we'd probably want to investigate this model further, and make additional improvements to it.


# Missing data
## Missing data

If you have missing data, add the argument `estimator='MLM'` to the `cfa()` function to use a robust method. There are no missing data in this dataset, but it would look like this:

Expand Down
251 changes: 0 additions & 251 deletions cfa.html

This file was deleted.

79 changes: 79 additions & 0 deletions confidence-vs-prediction-intervals.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
---
title: 'Confidence, credible and prediction intervals'
output:
bookdown::tufte_html2
bibliography: bibliography.bib
---

```{r setup}
library(tidyverse)
```


# Confidence, credible and prediction intervals


TODO: EXPAND ON THESE DEFINITIONS AND USE GRAPHICS AND PLOTS TO ILLUSTRATE

*Confidence interval*: The range within which we would expect the true value to fall, 95% of the time, if we replicated the study.

*Prediction interval*: the range within which we expect 95% of new observations to fall. If we're considering the prediction interval for a specific point prediction (i.e. where we set predictors to specific values), then this interval woud be for new observations *with the same predictor values*.


## The problem with confidence intervals

Confidence intervals are helpful when we want to think about how *precise our estimate* is. For example, in an RCT we will want to estimate the difference between treatment groups, and it's conceivable be reasonable to want to know, for example, the range within which the true effect would fall 95% of the time if we replicated our study many times.

If we run a study with small N, intuitively we know that we have less information about the difference between our RCT treatments, and so we'd like the CI to expand accordingly.

*So — all things being equal — the confidence interval reduces as our sample size increases.*


The problem with confidence intervals come because many researchers and clinicians read them incorrectly. Typically, they either:

- Forget that the CI represents the *precision of the estimate*
- Misinterpret the CI as the range in which we are 95% sure the true value lies.


## Forgetting that the CI depends on sample size.

By forgetting that the CI contracts as the sample size increases, researchers can become overconfident about their ability to predict new observations. Imagine that we sample data from two populations with the same mean, but different variability:

```{r}
set.seed(1234)
df <- expand.grid(v=c(1,3,3,3), i=1:1000) %>%
as.data.frame %>%
mutate(y = rnorm(length(.$i), 100, v)) %>%
mutate(samp = factor(v, labels=c("Low variability", "High variability")))
```


```{r}
df %>%
ggplot(aes(y)) +
geom_histogram() +
facet_grid(~samp) +
scale_color_discrete("")
```


- If we sample 100 individuals from each population the confidence interval around the sample mean would be wider in the high variability group.

If we increase our sample size we would become more confident about the location of the mean, and this confidence interval would shrink.

But imagine taking a single *new sample* from either population. These samples would be new grey squares, which we place on the histograms above. It does not matter how much extra data we have collected in group B or how sure what the mean of the group is: *We would always be less certain making predictions for new observations in the high variability group*.

The important insight here is that *if our data are noisy and highly variable we can never make firm predictions for new individuals, even if we collect so much data that we are very certain about the location of the mean*.




## What does this mean for my work on [insert speciality here]?








138 changes: 138 additions & 0 deletions correlations.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
---
title: 'Correlations'
output:
bookdown::tufte_html2
---

```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse=TRUE, cache=TRUE)
library(tidyverse)
library(pander)
panderOptions('digits', 2)
panderOptions('round', 3)
panderOptions('keep.trailing.zeros', TRUE)
```


# Correlations


The base R `cor()` function provides a simple way to get Pearson correlations, but to get a correlation matrix as you might expect from SPSS or Stata it's best to use the ``corr.test()`` function in the `psych` package.

Before you start though, plotting the correlations might be the best way of getting to grips with the patterns of relationship in your data. A pairs plot is a nice way of doing this:


```{r}
airquality %>%
select(-Month, -Day) %>%
pairs
```


If we were satisfied the relationships were (reasonably) linear, we could also visualise correlations themselves with a 'corrgram', using the `corrgram` library:


```{r, fig.cap="A corrgram, showing pearson correlations (above the diagonal), variable distributions (on the diagonal) and ellipses and smoothed lines of best fit (below the diagnonal). Long, narrow ellipses denote large correlations; circular ellipses indicate small correlations."}
library("corrgram")
airquality %>%
select(-Month, -Day) %>%
corrgram(lower.panel=corrgram::panel.ellipse,
upper.panel=panel.cor,
diag.panel=panel.density)
```


The ggpairs function from the `GGally` package is also a nice way of plotting relationships between a combination of categorical and continuous data - it packs a lot of information into a limited space:


```{r, message=F}
mtcars %>%
mutate(cyl = factor(cyl)) %>%
select(mpg, wt, drat, cyl) %>%
GGally::ggpairs()
```



## Obtaining a correlation matrix


The `psych::corr.test()` function is a quick way to obtain a pairwise correlation matrix for an entire dataset, along with p values and confidence intervals which the base R `cor()` function will not provide:


```{r}
mycorrelations <- psych::corr.test(airquality)
mycorrelations
```


One thing to be aware of is that by default `corr.test()` produces p values that are adjusted for multiple comparisons in the top right hand triangle (i.e. above the diagonal). If you want the uncorrected values use the values below the diagonal (or pass `adjust=FALSE` when calling the function).


## Making correlation tables for publication

If you want to produce output tables for publication, you might find it useful to extract the *r* and *p* values as dataframes which can then be saved to a csv and opened in excel, or converted to a table. You can do this by sorting the `corr.test` output in a variable, and the accessing the `$r` and `$p` values within it:


```{r}
write.csv(mycorrelations$p, file="airquality-r-values.csv")
mycorrelations$p
mycorrelations$r
```



If you wanted to merge these and produce a table for publication, you could do something like this:

```{r, echo=T}
corr.table.with.p <- function(df, r.format="%+0.2f", p.format="%.3f"){
corrtests <- psych::corr.test(df)
m <- matrix(
paste(sprintf(r.format, corrtests$r),
" (",
sprintf(p.format, corrtests$p),
") ",
sep=""
),
ncol = length(rownames(corrtests$r)),
nrow = length(rownames(corrtests$r))) %>%
as.data.frame()
names(m) <- rownames(corrtests$r)
m %>% mutate(` ` = rownames(corrtests$r)) %>%
select(` `, everything())
}
mtcars %>% select(wt, mpg, cyl, drat) %>%
corr.table.with.p %>%
pander()
```



You can also access the CI for each pariwise correlation as a table:

```{r}
mycorrelations$ci %>%
head() %>%
pander()
```


## Other types of correlation

By default `corr.test` produces Pearson correlations, but You can pass the `method` argument `psych::corr.test()`:

```{r, echo=T, eval=F}
psych::corr.test(airquality, method="spearman")
psych::corr.test(airquality, method="kendall")
```




Loading

0 comments on commit 236393e

Please sign in to comment.