Skip to content

Commit

Permalink
.
Browse files Browse the repository at this point in the history
  • Loading branch information
benwhalley committed May 13, 2019
1 parent 8993f59 commit f124232
Show file tree
Hide file tree
Showing 52 changed files with 3,879 additions and 3,496 deletions.
10 changes: 5 additions & 5 deletions bayes-mcmc.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ title: 'Bayesian linear modelling via MCMC'
```{r, include=F}
knitr::opts_chunk$set(echo = TRUE, collapse=TRUE, cache=TRUE, message=F, warning=F)
library(tidyverse)
library(pander)
library(lmerTest)
library(tidyverse)
library(pander)
library(lmerTest)
```

Expand Down Expand Up @@ -84,7 +84,7 @@ http://doingbayesiandataanalysis.blogspot.co.uk/2012/04/why-to-use-highest-densi
-->

```{r, eval=F}
```{r}
params.of.interest <-
pain.model.mcmc %>%
Expand All @@ -94,7 +94,7 @@ params.of.interest <-
group_by(variable)
params.of.interest %>%
do(., mHPDI(.$value)) %>%
tidybayes::mean_hdi() %>%
pander::pandoc.table(caption="Estimates and 95% credible intervals for the parameters of interest")
```

Expand Down
776 changes: 424 additions & 352 deletions cfa-sem.Rmd

Large diffs are not rendered by default.

9 changes: 3 additions & 6 deletions cleaning-up-your-mess.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,15 @@
title: 'Cleaning up the mess'
---



```{r, include=FALSE, message=F}
library(tidyverse)
library(reshape2)
library(broom)
library(pander)
```



## Cleaning up the mess: dealing with raw data {- #raw-data-mess}


XXX TODO expand on [multiple files example](#multiple-raw-data-files) and show it worked all the way through, merging multiple files with left_join and bind_rows/cols
XXX TODO expand on [multiple files example](#multiple-raw-data-files) and show
it worked all the way through, merging multiple files with left_join and
bind_rows/cols
32 changes: 21 additions & 11 deletions clustering.Rmd
Original file line number Diff line number Diff line change
@@ -1,29 +1,39 @@
---
title: 'Clustered data'

---

# Non-independence {#clustering}

Psychological data often contains natural *groupings*. In intervention research, multiple patients may be treated by individual therapists, or children taught within classes, which are further nested within schools; in experimental research participants may respond on multiple occasions to a variety of stimuli.

Although disparate in nature, these groupings share a common characteristic: they induce *dependency* between the observations we make. That is, our data points are *not independently sampled* from one another.
Psychological data often contains natural _groupings_. In intervention research,
multiple patients may be treated by individual therapists, or children taught
within classes, which are further nested within schools; in experimental
research participants may respond on multiple occasions to a variety of stimuli.

What this means is that observations *within* a particular grouping will tend, all other things being equal, be more alike than those from a different group.
Although disparate in nature, these groupings share a common characteristic:
they induce _dependency_ between the observations we make. That is, our data
points are _not independently sampled_ from one another.

What this means is that observations _within_ a particular grouping will tend,
all other things being equal, be more alike than those from a different group.

#### Why does this matter? {-}

Think of the last quantitative experiment you read about. If you were the author of that study, and were offered 10 additional datapoints for 'free', which would you choose:
Think of the last quantitative experiment you read about. If you were the author
of that study, and were offered 10 additional datapoints for 'free', which would
you choose:

1. 10 extra datapoints from existing participants.
2. 10 data points from 10 new participants.

In general you will gain more *new information* from data from a new
participant. Intuitively we know this is correct because an extra observation from
someone we have already studies is *less likely to surprise us* or be
In general you will gain more _new information_ from data from a new
participant. Intuitively we know this is correct because an extra observation
from someone we have already studies is _less likely to surprise us_ or be
different from the data we already have than an observation from a new
participant.

Most traditional statistical models assume that data *are* sampled independently however. And the precision of the inferences we can draw from from statistical models is based on the *amount of information we have available*. This means that if we violate this assumption of independent sampling we will trick our model into thinking we have more information than we really do, and our inferences may be wrong.

Most traditional statistical models assume that data _are_ sampled independently
however. And the precision of the inferences we can draw from from statistical
models is based on the _amount of information we have available_. This means
that if we violate this assumption of independent sampling we will trick our
model into thinking we have more information than we really do, and our
inferences may be wrong.
10 changes: 3 additions & 7 deletions code-hygiene.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,12 @@
title: 'Code hygiene'
---


Sometimes code has a 'smell' about it...



# Naming variables


# Using commments

You can include comments within your R code, to help others understand what your code does. Comments start with a `#` symbol and are not processed by R when the code runs.


You can include comments within your R code, to help others understand what your
code does. Comments start with a `#` symbol and are not processed by R when the
code runs.
20 changes: 5 additions & 15 deletions colours.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,25 +9,19 @@ library(tidyverse)
library(pander)
```



## Colours {-}


### Picking colours for plots {- #picking-colours}

See https://www.perceptualedge.com/articles/b-eye/choosing_colors.pdf for an interesting discussion on picking colours for data visualisation.

Also check the [ggplots docs for colour brewer](http://ggplot2.tidyverse.org/reference/scale_brewer.html) and the [Colour Brewer website](http://colorbrewer2.org/).


See https://www.perceptualedge.com/articles/b-eye/choosing_colors.pdf for an
interesting discussion on picking colours for data visualisation.

Also check the
[ggplots docs for colour brewer](http://ggplot2.tidyverse.org/reference/scale_brewer.html)
and the [Colour Brewer website](http://colorbrewer2.org/).

### Named colours in R {- #named-colours}




```{r, results='asis'}
print.col <- Vectorize(function(col){
rgb <- grDevices::col2rgb(col)
Expand All @@ -37,10 +31,6 @@ print.col <- Vectorize(function(col){
pandoc.p(print.col(colours()))
```


### ColourBrewer with ggplot {- #color-brewer}

See: http://ggplot2.tidyverse.org/reference/scale_brewer.html



84 changes: 51 additions & 33 deletions confidence-and-intervals.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,77 +7,95 @@ bibliography: bibliography.bib
library(tidyverse)
```


# Confidence and Intervals {#intervals}


Some quick definitions to begin. Let's say we have made an estimate from a model. To keep things simple, it could just be the sample mean.
Some quick definitions to begin. Let's say we have made an estimate from a
model. To keep things simple, it could just be the sample mean.

<!-- TODO: EXPAND ON THESE DEFINITIONS AND USE GRAPHICS AND PLOTS TO ILLUSTRATE -->

A *Confidence interval* is the range within which we would expect the 'true' value to fall, 95% of the time, if we replicated the study.

A *Prediction interval* is the range within which we expect 95% of new observations to fall. If we're considering the prediction interval for a specific point prediction (i.e. where we set predictors to specific values), then this interval woud be for new observations *with the same predictor values*.

A Bayesian *Credible interval* is the range of values within which we are 95% sure the true value lies, based on our prior knowledge and the data we have collected.
A _Confidence interval_ is the range within which we would expect the 'true'
value to fall, 95% of the time, if we replicated the study.

A _Prediction interval_ is the range within which we expect 95% of new
observations to fall. If we're considering the prediction interval for a
specific point prediction (i.e. where we set predictors to specific values),
then this interval woud be for new observations _with the same predictor
values_.

A Bayesian _Credible interval_ is the range of values within which we are 95%
sure the true value lies, based on our prior knowledge and the data we have
collected.

### The problem with confidence intervals {-}

Confidence intervals are helpful when we want to think about how *precise our estimate* is. For example, in an RCT we will want to estimate the difference between treatment groups, and it's conceivable we would to want to know, for example, the range within which the true effect would fall 95% of the time if we replicated our study many times (although in reality, this isn't a question many people would actually ask).

If we run a study with small N, intuitively we know that we have less information about the difference between our RCT treatments, and so we'd like the CI to expand accordingly.
Confidence intervals are helpful when we want to think about how _precise our
estimate_ is. For example, in an RCT we will want to estimate the difference
between treatment groups, and it's conceivable we would to want to know, for
example, the range within which the true effect would fall 95% of the time if we
replicated our study many times (although in reality, this isn't a question many
people would actually ask).

So — all things being equal — the confidence interval reduces as we collect more data.
If we run a study with small N, intuitively we know that we have less
information about the difference between our RCT treatments, and so we'd like
the CI to expand accordingly.

The problem with confidence intervals comes about because many researchers and clinicians read them incorrectly. Typically, they either:
So — all things being equal — the confidence interval reduces as we collect more
data.

- Forget that the CI represents only the *precision of the estimate*. The CI *doesn't* reflect how good our predictions for new observations will be.
The problem with confidence intervals comes about because many researchers and
clinicians read them incorrectly. Typically, they either:

- Misinterpret the CI as the range in which we are 95% sure the true value lies.
- Forget that the CI represents only the _precision of the estimate_. The CI
_doesn't_ reflect how good our predictions for new observations will be.

- Misinterpret the CI as the range in which we are 95% sure the true value
lies.

### Forgetting that the CI depends on sample size. {-}

By forgetting that the CI contracts as the sample size increases, researchers can become overconfident about their ability to predict new observations. Imagine that we sample data from two populations with the same mean, but different variability:
By forgetting that the CI contracts as the sample size increases, researchers
can become overconfident about their ability to predict new observations.
Imagine that we sample data from two populations with the same mean, but
different variability:

```{r}
set.seed(1234)
df <- expand.grid(v=c(1,3,3,3), i=1:1000) %>%
df <- expand.grid(v=c(1,3,3,3), i=1:1000) %>%
as_data_frame %>%
mutate(y = rnorm(length(.$i), 100, v)) %>%
mutate(y = rnorm(length(.$i), 100, v)) %>%
mutate(samp = factor(v, labels=c("Low variability", "High variability")))
```


```{r}
df %>%
ggplot(aes(y)) +
geom_histogram() +
df %>%
ggplot(aes(y)) +
geom_histogram() +
facet_grid(~samp) +
scale_color_discrete("")
```

- If we sample 100 individuals from each population the confidence interval
around the sample mean would be wider in the high variability group.

- If we sample 100 individuals from each population the confidence interval around the sample mean would be wider in the high variability group.
If we increase our sample size we would become more confident about the location
of the mean, and this confidence interval would shrink.

If we increase our sample size we would become more confident about the location of the mean, and this confidence interval would shrink.
But imagine taking a single _new sample_ from either population. These samples
would be new grey squares, which we place on the histograms above. It does not
matter how much extra data we have collected in group B or how sure what the
mean of the group is: _We would always be less certain making predictions for
new observations in the high variability group_.

But imagine taking a single *new sample* from either population. These samples would be new grey squares, which we place on the histograms above. It does not matter how much extra data we have collected in group B or how sure what the mean of the group is: *We would always be less certain making predictions for new observations in the high variability group*.
The important insight here is that _if our data are noisy and highly variable we
can never make firm predictions for new individuals, even if we collect so much
data that we are very certain about the location of the mean_.

The important insight here is that *if our data are noisy and highly variable we can never make firm predictions for new individuals, even if we collect so much data that we are very certain about the location of the mean*.


<!--
<!--
### But should I report the CI or not? {-}
-->




Loading

0 comments on commit f124232

Please sign in to comment.