switched to bookdown

benwhalley · Jun 21, 2017 · 236393e · 236393e
1 parent 18afca3
commit 236393e
Show file tree

Hide file tree

Showing 174 changed files with 39,257 additions and 2,476 deletions.
diff --git a/_bookdown.yml b/_bookdown.yml
@@ -0,0 +1,51 @@
+book_filename: "Just enough R"
+new_session: yes
+output_dir: "docs"
+
+rmd_files:
+  [ "index.Rmd",
+    "start_here.Rmd",
+    "datasets.Rmd",
+    "working-with-dataframes.Rmd",
+    "summarising-data.Rmd",
+    "graphics.Rmd",
+    "linear-models-simple.Rmd",
+
+    "real-data.Rmd",
+    "correlations.Rmd",
+    "t-tests.Rmd",
+    "anova.Rmd",
+    "understanding-interactions.Rmd",
+    "predictions-and-margins.Rmd",
+    "mediation.Rmd",
+    "multilevel-models.Rmd",
+    "cfa.Rmd",
+
+    "installation.Rmd",
+    "packages.Rmd",
+    "help.Rmd",
+    "missing-values.Rmd",
+    "rmarkdown-tricks.Rmd",
+    "confidence-vs-prediction-intervals.Rmd"
+  ]
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/_output.yml b/_output.yml
@@ -0,0 +1,14 @@
+bookdown::gitbook:
+  css: style.css
+  config:
+    toc:
+      before: |
+        <li><a href="./">Just enough R</a></li>
+    download: ["pdf", "epub"]
+bookdown::pdf_book:
+  includes:
+    in_header: preamble.tex
+  latex_engine: xelatex
+  citation_package: natbib
+  keep_tex: yes
+bookdown::epub_book: default
diff --git a/anova.Rmd b/anova.Rmd
@@ -0,0 +1,32 @@
+---
+title: 'Anova'
+output: 
+  bookdown::tufte_html2
+---
+
+
+
+
+# Anova
+
+```{r, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, collapse=TRUE, cache=TRUE)
+library(tidyverse)
+library(ez)
+```
+
+
+
+
+## Anova for between subjects designs
+
+TODO
+
+
+## Repeated measures 
+
+TODO
+
+
+
+
diff --git a/bibliography.bib b/bibliography.bib
@@ -0,0 +1,20 @@
+@article{newman2012bar,
+  title={Bar graphs depicting averages are perceptually misinterpreted: The within-the-bar bias},
+  author={Newman, George E and Scholl, Brian J},
+  journal={Psychonomic bulletin \& review},
+  volume={19},
+  number={4},
+  pages={601--607},
+  year={2012},
+  publisher={Springer}
+}
+@article{kockelman2002driver,
+  title={Driver injury severity: an application of ordered probit models},
+  author={Kockelman, Kara Maria and Kweon, Young-Jun},
+  journal={Accident Analysis \& Prevention},
+  volume={34},
+  number={3},
+  pages={313--321},
+  year={2002},
+  publisher={Elsevier}
+}
diff --git a/cfa.Rmd b/cfa.Rmd
@@ -5,14 +5,15 @@ output:
 ---
 
 
-```{r setup, include=FALSE}
-# ignore all this for the moment
+```{r, include=FALSE}
 knitr::opts_chunk$set(echo = TRUE, collapse=TRUE, cache=TRUE)
 library(tufte)
 library(tidyverse)
 ```
 
 
+# Confirmatory factor analysis
+
 *This section adapted from a guide originally produced by Jon May*
 
 
@@ -36,7 +37,7 @@ hz %>% glimpse()
 ```
 
 
-# Defining the model
+## Defining the model
 
 Define the CFA model you want to run using 'lavaan model syntax'^[A full guide is here: <http://lavaan.ugent.be/tutorial/syntax1.html>]. For example:
 
@@ -79,7 +80,7 @@ The output has three parts:
 
 
 
-# Model fit
+## Model fit
 
 To examine the model fit:
 
@@ -91,7 +92,7 @@ fitmeasures(hz.fit, c('cfi', 'rmsea', 'rmsea.ci.upper', 'bic'))
 This looks OK, but could be improved. 
 
 
-# Modification indices
+## Modification indices
 
 To examine the modification indices (here sorted to see the largest first) we type:
 
@@ -123,7 +124,7 @@ fitmeasures(hz.fit.2, c('cfi', 'rmsea', 'rmsea.ci.upper', 'bic'))
 RMSEA has improved marginally, but we'd probably want to investigate this model further, and make additional improvements to it.
 
 
-# Missing data
+## Missing data
 
 If you have missing data, add the argument `estimator='MLM'` to the `cfa()` function to use a robust method. There are no missing data in this dataset, but it would look like this:
 

diff --git a/cfa.html b/cfa.html
diff --git a/confidence-vs-prediction-intervals.Rmd b/confidence-vs-prediction-intervals.Rmd
@@ -0,0 +1,79 @@
+---
+title: 'Confidence, credible and prediction intervals'
+output:
+  bookdown::tufte_html2
+bibliography: bibliography.bib
+---
+
+```{r setup}
+library(tidyverse)
+```
+
+
+# Confidence, credible and prediction intervals
+
+
+TODO: EXPAND ON THESE DEFINITIONS AND USE GRAPHICS AND PLOTS TO ILLUSTRATE
+
+*Confidence interval*: The range within which we would expect the true value to fall, 95% of the time, if we replicated the study. 
+
+*Prediction interval*: the range within which we expect 95% of new observations to fall. If we're considering the prediction interval for a specific point prediction (i.e. where we set predictors to specific values), then this interval woud be for new observations *with the same predictor values*.
+
+
+## The problem with confidence intervals
+
+Confidence intervals are helpful when we want to think about how *precise our estimate* is. For example, in an RCT we will want to estimate the difference between treatment groups, and it's conceivable be reasonable to want to know, for example, the range within which the true effect would fall 95% of the time if we replicated our study many times.
+
+If we run a study with small N, intuitively we know that we have less information about the difference between our RCT treatments, and so we'd like the CI to expand accordingly.
+
+*So — all things being equal — the confidence interval reduces as our sample size increases.*
+
+
+The problem with confidence intervals come because many researchers and clinicians read them incorrectly. Typically, they either:
+
+- Forget that the CI represents the *precision of the estimate*
+- Misinterpret the CI as the range in which we are 95% sure the true value lies.
+
+
+## Forgetting that the CI depends on sample size.
+
+By forgetting that the CI contracts as the sample size increases, researchers can become overconfident about their ability to predict new observations. Imagine that we sample data from two populations with the same mean, but different variability:
+
+```{r}
+set.seed(1234)
+df <- expand.grid(v=c(1,3,3,3), i=1:1000) %>% 
+  as.data.frame %>%
+  mutate(y = rnorm(length(.$i), 100, v)) %>% 
+  mutate(samp = factor(v, labels=c("Low variability", "High variability")))
+```
+
+
+```{r}
+df %>% 
+  ggplot(aes(y)) + 
+  geom_histogram() + 
+  facet_grid(~samp) +
+  scale_color_discrete("")
+```
+
+
+- If we sample 100 individuals from each population the confidence interval around the sample mean would be wider in the high variability group. 
+
+If we increase our sample size we would become more confident about the location of the mean, and this confidence interval would shrink.
+
+But imagine taking a single *new sample* from either population. These samples would be new grey squares, which we place on the histograms above. It does not matter how much extra data we have collected in group B or how sure what the mean of the group is: *We would always be less certain making predictions for new observations in the high variability group*.
+
+The important insight here is that *if our data are noisy and highly variable we can never make firm predictions for new individuals, even if we collect so much data that we are very certain about the location of the mean*.
+
+
+
+
+## What does this mean for my work on [insert speciality here]?
+
+
+
+
+
+
+
+
diff --git a/correlations.Rmd b/correlations.Rmd
@@ -0,0 +1,138 @@
+---
+title: 'Correlations'
+output:
+  bookdown::tufte_html2
+---
+
+```{r, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE, collapse=TRUE, cache=TRUE)
+library(tidyverse)
+library(pander)
+panderOptions('digits', 2)
+panderOptions('round', 3)
+panderOptions('keep.trailing.zeros', TRUE)
+
+```
+
+
+# Correlations
+
+
+The base R `cor()` function provides a simple way to get Pearson correlations, but to get a correlation matrix as you might expect from SPSS or Stata it's best to use the ``corr.test()`` function in the `psych` package.
+
+Before you start though, plotting the correlations might be the best way of getting to grips with the patterns of relationship in your data.  A pairs plot is a nice way of doing this:
+
+
+```{r}
+airquality %>% 
+  select(-Month, -Day) %>% 
+  pairs
+```
+
+
+If we were satisfied the relationships were (reasonably) linear, we could also visualise correlations themselves with a 'corrgram', using the `corrgram` library:
+
+
+```{r, fig.cap="A corrgram, showing pearson correlations (above the diagonal), variable distributions (on the diagonal) and ellipses and smoothed lines of best fit (below the diagnonal). Long, narrow ellipses denote large correlations; circular ellipses indicate small correlations."}
+library("corrgram")
+airquality %>% 
+  select(-Month, -Day) %>% 
+  corrgram(lower.panel=corrgram::panel.ellipse,
+         upper.panel=panel.cor,
+         diag.panel=panel.density)
+
+```
+
+
+The ggpairs function from the `GGally` package is also a nice way of plotting relationships between a combination of categorical and continuous data - it packs a lot of information into a limited space:
+
+
+```{r, message=F}
+mtcars %>% 
+  mutate(cyl = factor(cyl)) %>% 
+  select(mpg, wt, drat, cyl) %>% 
+  GGally::ggpairs()
+
+```
+
+
+
+## Obtaining a correlation matrix
+
+
+The `psych::corr.test()` function is a quick way to obtain a pairwise correlation matrix for an entire dataset, along with p values and confidence intervals which the base R `cor()` function will not provide:
+
+
+```{r}
+mycorrelations <- psych::corr.test(airquality)
+mycorrelations
+```
+
+
+One thing to be aware of is that by default `corr.test()` produces p values that are adjusted for multiple comparisons in the top right hand triangle (i.e. above the diagonal). If you want the uncorrected values use the values below the diagonal (or pass `adjust=FALSE` when calling the function). 
+
+
+## Making correlation tables for publication
+
+If you want to produce output tables for publication, you might find it useful to extract the *r* and *p* values as dataframes which can then be saved to a csv and opened in excel, or converted to a table. You can do this by sorting the `corr.test` output in a variable, and the accessing the `$r` and `$p` values within it:
+
+
+```{r}
+write.csv(mycorrelations$p, file="airquality-r-values.csv")
+mycorrelations$p 
+mycorrelations$r 
+```
+
+
+
+If you wanted to merge these and produce a table for publication, you could do something like this:
+
+```{r, echo=T}
+corr.table.with.p <- function(df, r.format="%+0.2f", p.format="%.3f"){
+  corrtests <- psych::corr.test(df)
+  m <- matrix(
+    paste(sprintf(r.format, corrtests$r), 
+          " (", 
+          sprintf(p.format, corrtests$p), 
+          ") ", 
+          sep=""
+    ),
+    ncol = length(rownames(corrtests$r)), 
+    nrow = length(rownames(corrtests$r))) %>% 
+    as.data.frame()
+  
+  names(m) <-  rownames(corrtests$r)
+  
+  m %>% mutate(` ` = rownames(corrtests$r)) %>% 
+    select(` `, everything()) 
+}  
+
+
+mtcars %>% select(wt, mpg, cyl, drat) %>% 
+  corr.table.with.p %>% 
+  pander()
+```
+
+
+
+You can also access the CI for each pariwise correlation as a table:
+
+```{r}
+mycorrelations$ci %>% 
+  head() %>% 
+  pander()
+```
+
+
+## Other types of correlation
+
+By default `corr.test` produces Pearson correlations, but You can pass the `method` argument `psych::corr.test()`:
+
+```{r, echo=T, eval=F}
+psych::corr.test(airquality, method="spearman")
+psych::corr.test(airquality, method="kendall")
+```
+
+
+
+