sm11_eQTL.Rmd

---
output:
  github_document
always_allow_html: true
---

# STAT 540 - Seminar 11: eQTL analysis

## Acknowledgements

Contributors: Jasleen Grewal and Keegan Korthauer


## Learning objectives
By the end of this tutorial, you should be able to
- Identify the goal of an eQTL analysis
- Explain the importance of log transforming the expression data in an eQTL analysis
- List the basic steps of an eQTL analysis
- Undertake a linear regression analysis for an eQTL study

## Dependencies

We will be using the following packages today. Follow instructions from previous tutorials to install these if you don't have them already.

```{r, echo=TRUE, message=F, warning=F}
library(GEOquery)
library(Biobase)
library(tidyverse)
theme_set(theme_bw())
library(knitr)
library(kableExtra)
```

## Introduction

As its name suggests (albeit in a roundabout way) Expression Quantitative Trait Loci analysis, or eQTL analysis, aims to identify associations between gene expression and genetic variation. eQTL analysis is equivalent to a linear regression problem where we are fitting our covariates (genetic variation data such as genotype) to explain our outcome, the observed changes in gene expression.

Today, we will be working with genotype data (SNPs) and gene expression values (counts) from a recently released dataset of melanoma cell lines. In this [study](https://www.nature.com/articles/ng.3927), authors were interested in identifying genes whose expression levels are correlated with common variants in melanomas.

## Explore GSE99221

In this seminar, we are going to perform cis-eQTL analysis for 59 melanoma cell line samples, hence we will only be concerned with variants within 1 Mb (1e6 base-pairs) away from the Transcription Start Site (TSS) of each gene, where the vast majority of such associations are found. The gene expression data we will be using is from GEO accession [GSE78995](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE78995), and the genotype data will be from GEO accession GSE99193. These two datasets are from the same study, and can also be jointly downloaded from the accession GSE99221.

The datasets we are going to use are:

- GSE78995, which contains expression data (54675 features, i.e. rows), for 59 melanoma cell line samples. Note that the expression data is measured by Affymetrix Human Genome U133 Plus 2.0 Array, and has already been normalized with RMA (robust microarray average) which involves a log2 transformation as well.

- GSE99193, which contains genotype data (733202 SNPs, i.e. rows) for 67 melanoma cell lines, measured by the Illumina Infinium human OmniExpress array

First, let's retrieve our datasets from GEO with `getGEO` from `GEOquery` package. Note: this step may take a few minutes!

We will extract our data and metadata from the objects we get from GEO. We will also subset our genotype data to only include cell lines for which we have expression data. And, **critically** make sure that the samples are in the same order in both the expression data and SNP data.


```{r}
# If you run into an issue while downloading the following data
Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 5)
```

```{r fetchGEO, message = F}
exprs_dat <- getGEO('GSE78995', getGPL = FALSE)
show(exprs_dat) ## 59 samples
genotype_dat <- getGEO('GSE99193', getGPL = FALSE)
show(genotype_dat) ## 67 samples

# Extract expression matrices (turn into data frames at once)
exprsdat <- as.data.frame(exprs(exprs_dat[[1]]))
genodat <- as.data.frame(exprs(genotype_dat[[1]]))

# Obtain the meta-data for the samples
exprs.meta <- pData(phenoData(exprs_dat[[1]]))
geno.meta <- pData(phenoData(genotype_dat[[1]]))

# Get the cell line labels for both types of data
exprs.meta$cell_line = gsub(".*UACC) melanoma ", "", exprs.meta$title)
geno.meta$cell_line = gsub(".*UACC ", "", geno.meta$title)

# Subset genotype data to keep cases overlapping with expression data
geno.meta.clean = geno.meta[geno.meta$cell_line %in% exprs.meta$cell_line,]
genodat.clean = genodat[,geno.meta.clean$geo_accession]

# reorder genotype data to match expression data (CRITICAL STEP)
x <- match(exprs.meta$cell_line, geno.meta.clean$cell_line)
geno.meta.clean <- geno.meta.clean[x,]
genodat.clean <- genodat.clean[,x]
```

Let us update the column names so that the samples are defined by the cell line, and not by the GSM IDs (the latter are specific to the submission sample and data type, so will not overlap between the SNP and expression data).

```{r}
colnames(exprsdat) = exprs.meta[exprs.meta$geo_accession == colnames(exprsdat),
                                c("cell_line")]
colnames(genodat.clean) = geno.meta.clean[geno.meta.clean$geo_accession == colnames(genodat.clean),
                                          c("cell_line")]
head(exprsdat) %>% kable() %>% kable_styling()
head(genodat.clean) %>% kable() %>% kable_styling()
```

**Exercise: Pause and Review**
- Consider the two dataframes we have - `exprsdat`, and `genodat.clean`. How many columns are in each dataframe? How many rows? Are the columns the features, or the samples?
- What type(s) of data are we using? Is the SNP data numeric?

```{r}
# YOUR CODE HERE
```

## Preprocess Data
### Check for missing values
As a first step, we will check for any missing values in the data.

> Hint: we can use the `complete.cases` function to test if there are any rows that contain a missing value.

```{r, echo = T}
# For our expression data
# Are the number of rows without any NA value, the same as total number of rows?
dim(exprsdat) == dim(exprsdat[complete.cases(exprsdat),])

# For our genotype data
# Are the number of rows without any NA value, the same as total number of rows?
dim(genodat.clean) == dim(genodat.clean[complete.cases(genodat.clean),])
```

### Map SNP data to numeric

Secondly, we will convert our SNP data to numeric. We will do so by 'converting' our genotype levels to numeric.
The genotypes in our dataset are defined as 'AA', 'BB', 'AB', 'NC'. (NC = No Call).

**Exercise: Pause and Review**
- Which of these genotype labels is/are *homozygous*?
- Which of these is/are *heterozygous*?

We will be setting the 'NC' values to NA, and AA, AB, and BB to 0, 1, and 2 respectively (representing the average count of the alternative allele, B, in the allele pair).

```{r}
# AA = 0, AB = 1, BB = 2
genodat.numeric = genodat.clean
genodat.numeric[genodat.numeric == "NC"] = NA
genodat.numeric = data.frame(sapply(genodat.numeric, function(x) as.numeric(as.factor(x)) - 1))
rownames(genodat.numeric) = rownames(genodat.clean)
```

### Transformation of expression data

Before we go ahead and attempt to fit a linear model to our samples, let us review the assumptions of a linear model.
A linear model assumes that the samples being fit are *independent* and with a *constant variance*.
- Are our samples independent?
- Do our values have constant variance?

Count data from RNA-seq experiments generally do not meet the assumption of variance-mean independence. Our data here is microarray data that has been normalized and log2 transformed. This helps stabilize the variance of the data by reducing the 'range' of variability in cases where the gene expression can range from very high to very low. We can evaluate the assumptions in our data by plotting the mean vs variance of every gene in the dataset, as follows:

```{r}
probe_means = rowMeans(exprsdat)
probe_vars = apply(exprsdat, 1 , sd)

data.frame(Mean = probe_means, Variance = probe_vars) %>%
  ggplot(aes(x = Mean, y = Variance)) +
  geom_point(alpha = 0.1) +
  geom_smooth()
```

Hmmm, do you think this looks symmetric, or is there an association between mean and variance?

Now let us look at the data on the original scale by undoing the log2 transformation.

```{r}
exprsdat_raw = 2^(exprsdat)
probe_means = rowMeans(exprsdat_raw)
probe_vars = apply(exprsdat_raw, 1 , sd)

data.frame(Mean = probe_means, Variance = probe_vars) %>%
  ggplot(aes(x = Mean, y = Variance)) +
  geom_point(alpha = 0.1) +
  geom_smooth()
```

Compare this to our previous plot. Clearly there is a strong dependence between mean and variance.

For the subsequent analysis, we will be using the log transformed expression data as downloaded from GEO to fit our linear models.

**Exercise: Pause and Review**
For RNA-seq data, you can also use the 'rlog' transformation using the rlog function in DESeq2, which can be used for differential expression analysis. How might this transformation be better than the log2 transformation? (i.e. what factors does it account for that log2 transform might not?)

## Fitting a single linear model

In a simple test for linear association between a SNP and a particular gene's expression, we can fit a linear regression model for a given gene/SNP pair. We'll do this by carrying out the following steps:
- Let's pick a random gene probe from the `exprsdat` dataframe (Need to consider: Are the rows genes? Or are the columns genes?)
- Pick a random SNP from the `genodat.numeric` dataframe
- Fit a linear model for the gene/SNP data

```{r}
set.seed(10)

random_gene_ix = sample(nrow(exprsdat), 1)
random_snp_ix = sample(nrow(genodat.numeric), 1)

#What random gene and SNP did we pick?
print(paste("My random gene probe from the expression data is", rownames(exprsdat)[random_gene_ix]))
print(paste("My random SNP from the genotype data is", rownames(genodat.numeric)[random_snp_ix]))

exprs_random = slice(exprsdat,random_gene_ix) %>% as.numeric()
snp_random = slice(genodat.numeric,random_snp_ix) %>% as.numeric()
lm_random = lm(exprs_random ~ snp_random)

summary(lm_random)
```

> How do we interpret this?

Let us try to visualize the expression values (across our 59 cell lines), for our random probe. We will colour the data points by the genotype of the SNP (across our 59 cell lines) that we are testing for association with our probe.
```{r}
plot(exprs_random ~ jitter(snp_random), col=(snp_random + 1), xaxt="n",
     xlab="Genotype", ylab="Expression") ; axis(1, at=c(0:2),
                                                labels=c("AA", "AB","BB"))
```

Now let us overlay this plot with the model we have fit.
```{r}
plot(exprs_random ~ jitter(snp_random), col=(snp_random + 1), xaxt="n",
     xlab="Genotype", ylab="Expression")
axis(1, at=c(0:2), labels=c("AA", "AB","BB"))
lines(lm_random$fitted ~ snp_random, type="b", pch=15, col="black")
```


## Fitting multiple eQTL models

What if we want to fit multiple eQTL models. If we fit a model for every pairwise combination of SNP and gene probe, we have way too many combinations to do this one by one (54,675 expression probes times 733,202 SNPs = 40 Billion models). We can undertake eQTL analysis for large datasets using specialized software such as the [MatrixEQTL](http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/) package. This software can restrict to local (cis) comparisons, to only include the SNPs that are nearby an expression probe. If you are interested in learning more about this sort of analysis, follow the link to the package. Also check out [Jeff Leek's tutorial on eQTL analysis using MatrixEQTL](http://jtleek.com/genstats/inst/doc/04_10_eQTL.html).

## Exercises

1. Fill in the blanks using 3 of the 5 options listed below:
An eQTL analysis usually involves treating ____________ as covariates and fitting a _______ model, where the outcome is _____________.

> OPTIONS: gene expression values, indels, SNPs, normal, linear


2. In the [publication](https://pubmed.ncbi.nlm.nih.gov/28759004/) associated with the data used in this demo, the authors demonstrated an eQTL relationship between the SNP rs3219090 and the expression of the PARP1 gene (which is measured by probe 208644_at). Can you reproduce this finding?

Fit a linear model for the following probe and SNP:
  - Probe: *208644_at* (this corresponds to PARP1)
  - SNP: *rs3219090*

From your results, what is the p-value for the SNP of interest? What does that imply about the association with the gene of interest?

Also make a similar plot as above for this SNP/gene pair, and compare it with the published result in [Figure 1a](https://www.nature.com/articles/ng.3927/figures/1).

> HINT: You might have to remove a sample with an 'NA' genotype

Try on your own before peeking at the code chunk below

```{r}
# YOUR CODE HERE
```

```{r, eval = FALSE, echo = FALSE}
df <- data.frame(
    exprs = exprsdat[rownames(exprsdat) == "208644_at",] %>% as.numeric(),
    genotype = genodat.numeric[rownames(genodat.numeric) == "rs3219090",] %>% as.numeric()
    ) %>%
  drop_na()

lmfit = lm(exprs ~ genotype, data = df)
summary(lmfit)

plot(df$exprs ~ jitter(df$genotype),
     col=(df$genotype + 1), xaxt="n",
     xlab="Genotype", ylab="Expression", main = "PARP1")
axis(1, at=c(0:2), labels=c("AA", "AB", "BB"))
lines(lmfit$fitted ~ df$genotype, type="b", pch=15, col="black")
```