.

benwhalley · May 13, 2019 · f124232 · f124232
1 parent 8993f59
commit f124232
Show file tree

Hide file tree

Showing 52 changed files with 3,879 additions and 3,496 deletions.
diff --git a/bayes-mcmc.Rmd b/bayes-mcmc.Rmd
@@ -5,9 +5,9 @@ title: 'Bayesian linear modelling via MCMC'
 ```{r, include=F}
 knitr::opts_chunk$set(echo = TRUE, collapse=TRUE, cache=TRUE, message=F, warning=F)
 
-library(tidyverse)
-library(pander)
-library(lmerTest)
+  library(tidyverse)
+  library(pander)
+  library(lmerTest)
 
 ```
 
@@ -84,7 +84,7 @@ http://doingbayesiandataanalysis.blogspot.co.uk/2012/04/why-to-use-highest-densi
 
 -->
 
-```{r, eval=F}
+```{r}
 
 params.of.interest <-
   pain.model.mcmc %>%
@@ -94,7 +94,7 @@ params.of.interest <-
   group_by(variable)
 
 params.of.interest %>%
-  do(., mHPDI(.$value)) %>%
+  tidybayes::mean_hdi() %>%
   pander::pandoc.table(caption="Estimates and 95% credible intervals for the parameters of interest")
 ```
 

diff --git a/cfa-sem.Rmd b/cfa-sem.Rmd
diff --git a/cleaning-up-your-mess.Rmd b/cleaning-up-your-mess.Rmd
@@ -2,18 +2,15 @@
 title: 'Cleaning up the mess'
 ---
 
-
-
 ```{r, include=FALSE, message=F}
 library(tidyverse)
 library(reshape2)
 library(broom)
 library(pander)
 ```
 
-
-
 ## Cleaning up the mess: dealing with raw data {- #raw-data-mess}
 
-
-XXX TODO expand on [multiple files example](#multiple-raw-data-files) and show it worked all the way through, merging multiple files with left_join and bind_rows/cols
+XXX TODO expand on [multiple files example](#multiple-raw-data-files) and show
+it worked all the way through, merging multiple files with left_join and
+bind_rows/cols
diff --git a/clustering.Rmd b/clustering.Rmd
@@ -1,29 +1,39 @@
 ---
 title: 'Clustered data'
-
 ---
 
 # Non-independence {#clustering}
 
-Psychological data often contains natural *groupings*. In intervention research, multiple patients may be treated by individual therapists, or children taught within classes, which are further nested within schools; in experimental research participants may respond on multiple occasions to a variety of stimuli.
-
-Although disparate in nature, these groupings share a common characteristic: they induce *dependency* between the observations we make. That is, our data points are *not independently sampled* from one another.
+Psychological data often contains natural _groupings_. In intervention research,
+multiple patients may be treated by individual therapists, or children taught
+within classes, which are further nested within schools; in experimental
+research participants may respond on multiple occasions to a variety of stimuli.
 
-What this means is that observations *within* a particular grouping will tend, all other things being equal, be more alike than those from a different group.
+Although disparate in nature, these groupings share a common characteristic:
+they induce _dependency_ between the observations we make. That is, our data
+points are _not independently sampled_ from one another.
 
+What this means is that observations _within_ a particular grouping will tend,
+all other things being equal, be more alike than those from a different group.
 
 #### Why does this matter? {-}
 
-Think of the last quantitative experiment you read about. If you were the author of that study, and were offered 10 additional datapoints for 'free', which would you choose:
+Think of the last quantitative experiment you read about. If you were the author
+of that study, and were offered 10 additional datapoints for 'free', which would
+you choose:
 
 1.  10 extra datapoints from existing participants.
 2.  10 data points from 10 new participants.
 
-In general you will gain more *new information* from data from a new
-participant. Intuitively we know this is correct because an extra observation from
-someone we have already studies is *less likely to surprise us* or be
+In general you will gain more _new information_ from data from a new
+participant. Intuitively we know this is correct because an extra observation
+from someone we have already studies is _less likely to surprise us_ or be
 different from the data we already have than an observation from a new
 participant.
 
-Most traditional statistical models assume that data *are* sampled independently however. And the precision of the inferences we can draw from from statistical models is based on the *amount of information we have available*.  This means that if we violate this assumption of independent sampling we will trick our model into thinking we have more information than we really do, and our inferences may be wrong.
-
+Most traditional statistical models assume that data _are_ sampled independently
+however. And the precision of the inferences we can draw from from statistical
+models is based on the _amount of information we have available_. This means
+that if we violate this assumption of independent sampling we will trick our
+model into thinking we have more information than we really do, and our
+inferences may be wrong.
diff --git a/code-hygiene.Rmd b/code-hygiene.Rmd
@@ -2,16 +2,12 @@
 title: 'Code hygiene'
 ---
 
-
 Sometimes code has a 'smell' about it...
 
-
-
 # Naming variables
 
-
 # Using commments
 
-You can include comments within your R code, to help others understand what your code does. Comments start with a `#` symbol and are not processed by R when the code runs.
-
-
+You can include comments within your R code, to help others understand what your
+code does. Comments start with a `#` symbol and are not processed by R when the
+code runs.
diff --git a/colours.Rmd b/colours.Rmd
@@ -9,25 +9,19 @@ library(tidyverse)
 library(pander)
 ```
 
-
-
 ## Colours {-}
 
-
 ### Picking colours for plots {- #picking-colours}
 
-See https://www.perceptualedge.com/articles/b-eye/choosing_colors.pdf for an interesting discussion on picking colours for data visualisation. 
-
-Also check the [ggplots docs for colour brewer](http://ggplot2.tidyverse.org/reference/scale_brewer.html) and the [Colour Brewer website](http://colorbrewer2.org/).
-
-
+See https://www.perceptualedge.com/articles/b-eye/choosing_colors.pdf for an
+interesting discussion on picking colours for data visualisation.
 
+Also check the
+[ggplots docs for colour brewer](http://ggplot2.tidyverse.org/reference/scale_brewer.html)
+and the [Colour Brewer website](http://colorbrewer2.org/).
 
 ### Named colours in R {- #named-colours}
 
-
-
-
 ```{r, results='asis'}
 print.col <- Vectorize(function(col){
   rgb <- grDevices::col2rgb(col)
@@ -37,10 +31,6 @@ print.col <- Vectorize(function(col){
 pandoc.p(print.col(colours()))
 ```
 
-
 ### ColourBrewer with ggplot {- #color-brewer}
 
 See: http://ggplot2.tidyverse.org/reference/scale_brewer.html
-
-
-
diff --git a/confidence-and-intervals.Rmd b/confidence-and-intervals.Rmd
@@ -7,77 +7,95 @@ bibliography: bibliography.bib
 library(tidyverse)
 ```
 
-
 # Confidence and Intervals {#intervals}
 
-
-Some quick definitions to begin. Let's say we have made an estimate from a model. To keep things simple, it could just be the sample mean.
+Some quick definitions to begin. Let's say we have made an estimate from a
+model. To keep things simple, it could just be the sample mean.
 
 <!-- TODO: EXPAND ON THESE DEFINITIONS AND USE GRAPHICS AND PLOTS TO ILLUSTRATE -->
 
-A *Confidence interval* is the range within which we would expect the 'true' value to fall, 95% of the time, if we replicated the study. 
-
-A *Prediction interval* is the range within which we expect 95% of new observations to fall. If we're considering the prediction interval for a specific point prediction (i.e. where we set predictors to specific values), then this interval woud be for new observations *with the same predictor values*.
-
-A Bayesian *Credible interval* is the range of values within which we are 95% sure the true value lies, based on our prior knowledge and the data we have collected.
+A _Confidence interval_ is the range within which we would expect the 'true'
+value to fall, 95% of the time, if we replicated the study.
 
+A _Prediction interval_ is the range within which we expect 95% of new
+observations to fall. If we're considering the prediction interval for a
+specific point prediction (i.e. where we set predictors to specific values),
+then this interval woud be for new observations _with the same predictor
+values_.
 
+A Bayesian _Credible interval_ is the range of values within which we are 95%
+sure the true value lies, based on our prior knowledge and the data we have
+collected.
 
 ### The problem with confidence intervals {-}
 
-Confidence intervals are helpful when we want to think about how *precise our estimate* is. For example, in an RCT we will want to estimate the difference between treatment groups, and it's conceivable we would to want to know, for example, the range within which the true effect would fall 95% of the time if we replicated our study many times (although in reality, this isn't a question many people would actually ask).
-
-If we run a study with small N, intuitively we know that we have less information about the difference between our RCT treatments, and so we'd like the CI to expand accordingly.
+Confidence intervals are helpful when we want to think about how _precise our
+estimate_ is. For example, in an RCT we will want to estimate the difference
+between treatment groups, and it's conceivable we would to want to know, for
+example, the range within which the true effect would fall 95% of the time if we
+replicated our study many times (although in reality, this isn't a question many
+people would actually ask).
 
-So — all things being equal — the confidence interval reduces as we collect more data.
+If we run a study with small N, intuitively we know that we have less
+information about the difference between our RCT treatments, and so we'd like
+the CI to expand accordingly.
 
-The problem with confidence intervals comes about because many researchers and clinicians read them incorrectly. Typically, they either:
+So — all things being equal — the confidence interval reduces as we collect more
+data.
 
-- Forget that the CI represents only the *precision of the estimate*. The CI *doesn't* reflect how good our predictions for new observations will be.
+The problem with confidence intervals comes about because many researchers and
+clinicians read them incorrectly. Typically, they either:
 
-- Misinterpret the CI as the range in which we are 95% sure the true value lies.
+-   Forget that the CI represents only the _precision of the estimate_. The CI
+    _doesn't_ reflect how good our predictions for new observations will be.
 
+-   Misinterpret the CI as the range in which we are 95% sure the true value
+    lies.
 
 ### Forgetting that the CI depends on sample size. {-}
 
-By forgetting that the CI contracts as the sample size increases, researchers can become overconfident about their ability to predict new observations. Imagine that we sample data from two populations with the same mean, but different variability:
+By forgetting that the CI contracts as the sample size increases, researchers
+can become overconfident about their ability to predict new observations.
+Imagine that we sample data from two populations with the same mean, but
+different variability:
 
 ```{r}
 set.seed(1234)
-df <- expand.grid(v=c(1,3,3,3), i=1:1000) %>% 
+df <- expand.grid(v=c(1,3,3,3), i=1:1000) %>%
   as_data_frame %>%
-  mutate(y = rnorm(length(.$i), 100, v)) %>% 
+  mutate(y = rnorm(length(.$i), 100, v)) %>%
   mutate(samp = factor(v, labels=c("Low variability", "High variability")))
 ```
 
-
 ```{r}
-df %>% 
-  ggplot(aes(y)) + 
-  geom_histogram() + 
+df %>%
+  ggplot(aes(y)) +
+  geom_histogram() +
   facet_grid(~samp) +
   scale_color_discrete("")
 ```
 
+-   If we sample 100 individuals from each population the confidence interval
+    around the sample mean would be wider in the high variability group.
 
-- If we sample 100 individuals from each population the confidence interval around the sample mean would be wider in the high variability group. 
+If we increase our sample size we would become more confident about the location
+of the mean, and this confidence interval would shrink.
 
-If we increase our sample size we would become more confident about the location of the mean, and this confidence interval would shrink.
+But imagine taking a single _new sample_ from either population. These samples
+would be new grey squares, which we place on the histograms above. It does not
+matter how much extra data we have collected in group B or how sure what the
+mean of the group is: _We would always be less certain making predictions for
+new observations in the high variability group_.
 
-But imagine taking a single *new sample* from either population. These samples would be new grey squares, which we place on the histograms above. It does not matter how much extra data we have collected in group B or how sure what the mean of the group is: *We would always be less certain making predictions for new observations in the high variability group*.
+The important insight here is that _if our data are noisy and highly variable we
+can never make firm predictions for new individuals, even if we collect so much
+data that we are very certain about the location of the mean_.
 
-The important insight here is that *if our data are noisy and highly variable we can never make firm predictions for new individuals, even if we collect so much data that we are very certain about the location of the mean*.
-
-
-<!-- 
+<!--
 
 
 ### But should I report the CI or not? {-}
 
 
 
  -->
-
-
-
-