STAT 540 - Seminar 11: eQTL analysis

Acknowledgements

Contributors: Jasleen Grewal and Keegan Korthauer

Learning objectives

By the end of this tutorial, you should be able to - Identify the goal of an eQTL analysis - Explain the importance of log transforming the expression data in an eQTL analysis - List the basic steps of an eQTL analysis - Undertake a linear regression analysis for an eQTL study

Dependencies

We will be using the following packages today. Follow instructions from previous tutorials to install these if you don’t have them already.

library(GEOquery)
library(Biobase)
library(tidyverse)
theme_set(theme_bw())
library(knitr)
library(kableExtra)

Introduction

As its name suggests (albeit in a roundabout way) Expression Quantitative Trait Loci analysis, or eQTL analysis, aims to identify associations between gene expression and genetic variation. eQTL analysis is equivalent to a linear regression problem where we are fitting our covariates (genetic variation data such as genotype) to explain our outcome, the observed changes in gene expression.

Today, we will be working with genotype data (SNPs) and gene expression values (counts) from a recently released dataset of melanoma cell lines. In this study, authors were interested in identifying genes whose expression levels are correlated with common variants in melanomas.

Explore GSE99221

In this seminar, we are going to perform cis-eQTL analysis for 59 melanoma cell line samples, hence we will only be concerned with variants within 1 Mb (1e6 base-pairs) away from the Transcription Start Site (TSS) of each gene, where the vast majority of such associations are found. The gene expression data we will be using is from GEO accession GSE78995, and the genotype data will be from GEO accession GSE99193. These two datasets are from the same study, and can also be jointly downloaded from the accession GSE99221.

The datasets we are going to use are:

GSE78995, which contains expression data (54675 features, i.e. rows), for 59 melanoma cell line samples. Note that the expression data is measured by Affymetrix Human Genome U133 Plus 2.0 Array, and has already been normalized with RMA (robust microarray average) which involves a log2 transformation as well.
GSE99193, which contains genotype data (733202 SNPs, i.e. rows) for 67 melanoma cell lines, measured by the Illumina Infinium human OmniExpress array

First, let’s retrieve our datasets from GEO with getGEO from GEOquery package. Note: this step may take a few minutes!

We will extract our data and metadata from the objects we get from GEO. We will also subset our genotype data to only include cell lines for which we have expression data. And, critically make sure that the samples are in the same order in both the expression data and SNP data.

# If you run into an issue while downloading the following data
Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 5)

exprs_dat <- getGEO('GSE78995', getGPL = FALSE)
show(exprs_dat) ## 59 samples

## $GSE78995_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 54675 features, 59 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM2083256 GSM2083257 ... GSM2083314 (59 total)
##   varLabels: title geo_accession ... tissue:ch1 (34 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 28759004 
## Annotation: GPL570

genotype_dat <- getGEO('GSE99193', getGPL = FALSE)
show(genotype_dat) ## 67 samples

## $GSE99193_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 733202 features, 67 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM2635493 GSM2635494 ... GSM2635559 (67 total)
##   varLabels: title geo_accession ... tissue:ch1 (36 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 28759004 
## Annotation: GPL13135

# Extract expression matrices (turn into data frames at once)
exprsdat <- as.data.frame(exprs(exprs_dat[[1]]))
genodat <- as.data.frame(exprs(genotype_dat[[1]]))

# Obtain the meta-data for the samples
exprs.meta <- pData(phenoData(exprs_dat[[1]]))
geno.meta <- pData(phenoData(genotype_dat[[1]]))

# Get the cell line labels for both types of data
exprs.meta$cell_line = gsub(".*UACC) melanoma ", "", exprs.meta$title)
geno.meta$cell_line = gsub(".*UACC ", "", geno.meta$title)

# Subset genotype data to keep cases overlapping with expression data
geno.meta.clean = geno.meta[geno.meta$cell_line %in% exprs.meta$cell_line,]
genodat.clean = genodat[,geno.meta.clean$geo_accession]

# reorder genotype data to match expression data (CRITICAL STEP)
x <- match(exprs.meta$cell_line, geno.meta.clean$cell_line)
geno.meta.clean <- geno.meta.clean[x,]
genodat.clean <- genodat.clean[,x]

Let us update the column names so that the samples are defined by the cell line, and not by the GSM IDs (the latter are specific to the submission sample and data type, so will not overlap between the SNP and expression data).

colnames(exprsdat) = exprs.meta[exprs.meta$geo_accession == colnames(exprsdat),
                                c("cell_line")]
colnames(genodat.clean) = geno.meta.clean[geno.meta.clean$geo_accession == colnames(genodat.clean),
                                          c("cell_line")]
head(exprsdat) %>% kable() %>% kable_styling()

	cell line 1093	cell line 1097	cell line 1113	cell line 1118	cell line 1120	cell line 1190	cell line 1237	cell line 1241	cell line 1265	cell line 1308	cell line 1316	cell line 1389	cell line 1451	cell line 1509	cell line 1592	cell line 1649	cell line 1729	cell line 1816	cell line 1940	cell line 2331	cell line 2404	cell line 2427	cell line 2496	cell line 2512	cell line 2528	cell line 2534	cell line 2545	cell line 257	cell line 2610	cell line 2641	cell line 2659	cell line 2837	cell line 285	cell line 2851	cell line 2876	cell line 2972	cell line 2993	cell line 2994	cell line 3074	cell line 3090	cell line 3093	cell line 3211	cell line 3291	cell line 3312	cell line 3337	cell line 3361	cell line 3379	cell line 3395	cell line 383	cell line 457	cell line 502	cell line 558	cell line 612	cell line 62	cell line 827	cell line 903	cell line 91	cell line 929	cell line 952
1007_s_at	8.638036	9.569502	8.888012	8.997548	8.178419	9.859501	9.899436	8.958769	9.981231	9.221468	9.122782	10.083248	9.527601	9.352663	10.391349	9.354233	8.488048	8.700209	8.724595	11.229636	9.484369	8.242250	7.333083	9.041875	9.019678	10.563816	7.535070	8.589149	8.666304	10.145057	11.024141	9.821123	9.266064	9.292188	8.882724	10.528187	8.468762	9.566095	8.948761	9.549295	9.317864	8.611362	9.429071	9.057645	10.177045	8.626289	9.654319	9.946739	9.475134	10.056217	8.908891	9.289914	8.095805	9.475180	9.815099	9.532179	9.156111	8.128645	8.760415
1053_at	6.511082	6.892053	8.098704	7.318858	8.104367	6.097134	8.022223	7.538899	7.382145	6.880304	7.634117	7.819483	7.079510	6.808387	6.723331	7.317578	6.516934	7.164454	7.694599	7.490678	6.644928	7.748965	7.904377	7.353012	7.516102	6.507049	6.705707	7.289198	8.208734	7.020544	6.743984	7.272101	8.113513	7.254123	7.835481	6.879287	8.003466	7.152403	7.180104	9.171832	6.407972	6.962079	7.729556	7.253038	7.756037	7.754817	7.812666	7.568846	8.156207	7.818405	7.735215	6.696510	8.567920	6.729650	7.401750	7.134959	8.330609	7.626840	7.060035
117_at	4.752166	3.686001	4.510261	5.359558	4.185034	4.959753	3.914128	3.913718	4.688048	4.955991	3.910185	4.336408	4.490547	4.775587	4.089210	4.427098	3.797345	4.715663	5.148917	3.828924	3.713938	6.269204	3.730159	4.536183	4.103633	4.191025	4.603317	4.881486	4.873805	3.955486	4.291414	4.266460	4.735696	4.414866	3.681968	4.723171	4.283983	4.178583	3.934642	6.120346	4.916537	4.229331	5.039042	4.035960	5.249714	5.744336	3.888232	4.437059	4.400191	4.417484	4.033891	4.179866	3.844249	4.551650	4.900716	4.330514	4.796757	6.677047	3.989919
121_at	7.280487	6.960408	7.167364	7.196955	6.918797	7.088626	6.761861	7.450579	7.065914	6.822917	6.759347	7.096096	7.640705	7.068972	6.526180	6.708067	7.109364	7.210015	6.337690	7.372803	6.780501	7.556065	7.471491	6.992312	7.629332	7.217674	7.154515	6.836663	7.323042	7.231007	7.002526	6.777624	7.242775	7.603851	6.653988	7.324272	6.940816	6.624049	6.879138	7.187866	7.087906	6.828678	6.994734	6.815146	7.243580	7.250607	6.636049	6.977882	6.795325	7.071736	7.133528	6.642631	6.685315	6.757389	6.831284	7.469521	6.715281	7.027035	7.648609
1255_g_at	2.259491	2.371965	2.283499	2.186442	2.455879	2.296577	2.333483	2.229012	2.468397	2.530879	2.300941	2.558433	2.378130	2.354706	2.276999	2.288279	2.305138	2.326907	2.256450	2.436730	2.426354	2.398398	2.214872	2.288419	2.201610	2.594016	2.590422	2.349674	2.370590	2.183665	2.458047	2.335073	2.265239	2.175599	2.254122	2.452293	2.496869	2.209466	2.486782	2.334564	2.888266	2.285354	2.257714	2.269391	2.325922	2.352795	2.402064	2.307650	2.339831	2.344689	2.345543	2.279138	2.111807	2.485640	2.263299	2.255623	2.328615	2.374157	2.480321
1294_at	5.551776	5.444600	6.927450	4.722336	5.955897	6.184568	7.191890	6.336652	6.620428	7.708175	6.677614	6.157961	6.710778	7.012948	6.223337	6.707393	7.905926	4.735737	5.643904	7.199911	7.790508	5.140462	6.090006	7.240944	7.357003	6.679847	4.981623	5.547245	6.045961	6.541726	6.511393	8.520112	6.335275	7.653637	6.733077	5.250051	7.434904	6.133121	6.782157	6.521575	6.669553	6.330201	6.020764	5.926245	5.954995	6.999129	5.529070	6.338303	6.457639	5.197230	7.151159	6.776002	6.211269	6.069893	5.939545	6.155424	4.762139	5.274493	7.095326

head(genodat.clean) %>% kable() %>% kable_styling()

	cell line 1093	cell line 1097	cell line 1113	cell line 1118	cell line 1120	cell line 1190	cell line 1237	cell line 1241	cell line 1265	cell line 1308	cell line 1316	cell line 1389	cell line 1451	cell line 1509	cell line 1592	cell line 1649	cell line 1729	cell line 1816	cell line 1940	cell line 2331	cell line 2404	cell line 2427	cell line 2496	cell line 2512	cell line 2528	cell line 2534	cell line 2545	cell line 257	cell line 2610	cell line 2641	cell line 2659	cell line 2837	cell line 285	cell line 2851	cell line 2876	cell line 2972	cell line 2993	cell line 2994	cell line 3074	cell line 3090	cell line 3093	cell line 3211	cell line 3291	cell line 3312	cell line 3337	cell line 3361	cell line 3379	cell line 3395	cell line 383	cell line 457	cell line 502	cell line 558	cell line 612	cell line 62	cell line 827	cell line 903	cell line 91	cell line 929	cell line 952
rs1000000	AB	BB	BB	BB	BB	BB	AA	AB	AA	AA	BB	AB	AB	BB	AB	BB	AB	BB	AB	BB	BB	BB	BB	BB	BB	AB	BB	BB	BB	BB	BB	AB	AB	BB	AB	BB	BB	AA	BB	BB	BB	BB	BB	BB	BB	BB	AB	BB	BB	BB	AA	BB	BB	BB	BB	BB	BB	AA	AB
rs1000002	BB	BB	BB	AA	AB	NC	AB	AB	BB	AA	AA	AB	AB	AB	AB	AB	AA	AA	AB	AB	AB	BB	BB	AB	AB	AB	BB	AA	AB	AA	BB	BB	AA	BB	BB	BB	AB	AA	AA	AA	AB	AB	AA	AA	AA	BB	BB	AA	BB	AB	AB	AA	BB	AB	BB	BB	BB	AB	AB
rs10000023	AA	AB	AA	AB	AB	BB	AA	BB	AA	AB	AB	BB	AB	AB	AB	AB	AA	AA	AA	AA	AB	BB	BB	AB	BB	AA	AA	AA	BB	BB	BB	AA	AA	BB	AA	AA	AB	AA	AA	AA	AB	AA	AA	AB	AA	AB	NC	AA	BB	AA	AB	AB	AA	AA	AB	BB	BB	AB	AA
rs1000003	AA	AA	AA	AA	AA	AB	AB	AB	AA	AB	AA	AA	AA	BB	AB	AA	AA	AA	AB	AA	AA	AA	AA	AB	AA	AB	AA	AA	AA	AA	AB	AA	AB	AA	AA	AB	AA	AA	AB	BB	AA	AB	AA	AA	AB	AA	AB	BB	AA	AA	BB	AA	AA	AA	AA	AA	AA	AA	AA
rs10000030	BB	BB	BB	BB	BB	BB	NC	BB	BB	BB	BB	BB	NC	BB	BB	BB	BB	NC	BB	BB	BB	AB	BB	BB	BB	AB	BB	NC	BB	BB	BB	BB	BB	BB	BB	BB	BB	AB	BB	NC	BB	BB	BB	BB	BB	BB	BB	BB	BB	BB	BB	AB	NC	BB	AB	BB	BB	BB	BB
rs10000037	BB	AB	BB	AB	BB	BB	AA	BB	BB	BB	BB	AB	AB	AA	BB	AA	AB	BB	BB	BB	BB	AB	BB	AA	AB	AA	BB	BB	BB	BB	BB	BB	BB	AB	BB	AB	BB	AA	BB	AB	BB	BB	AB	AA	BB	BB	BB	AA	BB	BB	BB	AB	BB	BB	BB	BB	AB	AB	BB

Exercise: Pause and Review - Consider the two dataframes we have - exprsdat, and genodat.clean. How many columns are in each dataframe? How many rows? Are the columns the features, or the samples? - What type(s) of data are we using? Is the SNP data numeric?

# YOUR CODE HERE

Preprocess Data

Check for missing values

As a first step, we will check for any missing values in the data.

Hint: we can use the complete.cases function to test if there are any rows that contain a missing value.

# For our expression data
# Are the number of rows without any NA value, the same as total number of rows?
dim(exprsdat) == dim(exprsdat[complete.cases(exprsdat),])

## [1] TRUE TRUE

# For our genotype data
# Are the number of rows without any NA value, the same as total number of rows?
dim(genodat.clean) == dim(genodat.clean[complete.cases(genodat.clean),])

## [1] TRUE TRUE

Map SNP data to numeric

Secondly, we will convert our SNP data to numeric. We will do so by ‘converting’ our genotype levels to numeric. The genotypes in our dataset are defined as ‘AA’, ‘BB’, ‘AB’, ‘NC’. (NC = No Call).

Exercise: Pause and Review - Which of these genotype labels is/are homozygous? - Which of these is/are heterozygous?

We will be setting the ‘NC’ values to NA, and AA, AB, and BB to 0, 1, and 2 respectively (representing the average count of the alternative allele, B, in the allele pair).

# AA = 0, AB = 1, BB = 2
genodat.numeric = genodat.clean
genodat.numeric[genodat.numeric == "NC"] = NA
genodat.numeric = data.frame(sapply(genodat.numeric, function(x) as.numeric(as.factor(x)) - 1))
rownames(genodat.numeric) = rownames(genodat.clean)

Transformation of expression data

Before we go ahead and attempt to fit a linear model to our samples, let us review the assumptions of a linear model. A linear model assumes that the samples being fit are independent and with a constant variance.

Are our samples independent? - Do our values have constant variance?

Count data from RNA-seq experiments generally do not meet the assumption of variance-mean independence. Our data here is microarray data that has been normalized and log2 transformed. This helps stabilize the variance of the data by reducing the ‘range’ of variability in cases where the gene expression can range from very high to very low. We can evaluate the assumptions in our data by plotting the mean vs variance of every gene in the dataset, as follows:

probe_means = rowMeans(exprsdat)
probe_vars = apply(exprsdat, 1 , sd)

data.frame(Mean = probe_means, Variance = probe_vars) %>%
  ggplot(aes(x = Mean, y = Variance)) +
  geom_point(alpha = 0.1) +
  geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Hmmm, do you think this looks symmetric, or is there an association between mean and variance?

Now let us look at the data on the original scale by undoing the log2 transformation.

exprsdat_raw = 2^(exprsdat)
probe_means = rowMeans(exprsdat_raw)
probe_vars = apply(exprsdat_raw, 1 , sd)

data.frame(Mean = probe_means, Variance = probe_vars) %>%
  ggplot(aes(x = Mean, y = Variance)) +
  geom_point(alpha = 0.1) +
  geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Compare this to our previous plot. Clearly there is a strong dependence between mean and variance.

For the subsequent analysis, we will be using the log transformed expression data as downloaded from GEO to fit our linear models.

Exercise: Pause and Review For RNA-seq data, you can also use the ‘rlog’ transformation using the rlog function in DESeq2, which can be used for differential expression analysis. How might this transformation be better than the log2 transformation? (i.e. what factors does it account for that log2 transform might not?)

Fitting a single linear model

In a simple test for linear association between a SNP and a particular gene’s expression, we can fit a linear regression model for a given gene/SNP pair. We’ll do this by carrying out the following steps: - Let’s pick a random gene probe from the exprsdat dataframe (Need to consider: Are the rows genes? Or are the columns genes?) - Pick a random SNP from the genodat.numeric dataframe - Fit a linear model for the gene/SNP data

set.seed(10)

random_gene_ix = sample(nrow(exprsdat), 1)
random_snp_ix = sample(nrow(genodat.numeric), 1)

#What random gene and SNP did we pick?
print(paste("My random gene probe from the expression data is", rownames(exprsdat)[random_gene_ix]))

## [1] "My random gene probe from the expression data is 210722_at"

print(paste("My random SNP from the genotype data is", rownames(genodat.numeric)[random_snp_ix]))

## [1] "My random SNP from the genotype data is rs7727450"

exprs_random = slice(exprsdat,random_gene_ix) %>% as.numeric()
snp_random = slice(genodat.numeric,random_snp_ix) %>% as.numeric()
lm_random = lm(exprs_random ~ snp_random)

summary(lm_random)

## 
## Call:
## lm(formula = exprs_random ~ snp_random)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3722 -0.1079 -0.0183  0.1241  0.5143 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.59736    0.04272  60.807   <2e-16 ***
## snp_random   0.01019    0.03143   0.324    0.747    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1922 on 57 degrees of freedom
## Multiple R-squared:  0.001839,   Adjusted R-squared:  -0.01567 
## F-statistic: 0.105 on 1 and 57 DF,  p-value: 0.7471

How do we interpret this?

Let us try to visualize the expression values (across our 59 cell lines), for our random probe. We will colour the data points by the genotype of the SNP (across our 59 cell lines) that we are testing for association with our probe.

plot(exprs_random ~ jitter(snp_random), col=(snp_random + 1), xaxt="n",
     xlab="Genotype", ylab="Expression") ; axis(1, at=c(0:2),
                                                labels=c("AA", "AB","BB"))

Now let us overlay this plot with the model we have fit.

plot(exprs_random ~ jitter(snp_random), col=(snp_random + 1), xaxt="n",
     xlab="Genotype", ylab="Expression")
axis(1, at=c(0:2), labels=c("AA", "AB","BB"))
lines(lm_random$fitted ~ snp_random, type="b", pch=15, col="black")

Fitting multiple eQTL models

What if we want to fit multiple eQTL models. If we fit a model for every pairwise combination of SNP and gene probe, we have way too many combinations to do this one by one (54,675 expression probes times 733,202 SNPs = 40 Billion models). We can undertake eQTL analysis for large datasets using specialized software such as the MatrixEQTL package. This software can restrict to local (cis) comparisons, to only include the SNPs that are nearby an expression probe. If you are interested in learning more about this sort of analysis, follow the link to the package. Also check out Jeff Leek’s tutorial on eQTL analysis using MatrixEQTL.

Exercises

Fill in the blanks using 3 of the 5 options listed below: An eQTL analysis usually involves treating ____________ as covariates and fitting a _______ model, where the outcome is _____________.

OPTIONS: gene expression values, indels, SNPs, normal, linear

In the publication associated with the data used in this demo, the authors demonstrated an eQTL relationship between the SNP rs3219090 and the expression of the PARP1 gene (which is measured by probe 208644_at). Can you reproduce this finding?

Fit a linear model for the following probe and SNP: - Probe: 208644_at (this corresponds to PARP1) - SNP: rs3219090

From your results, what is the p-value for the SNP of interest? What does that imply about the association with the gene of interest?

Also make a similar plot as above for this SNP/gene pair, and compare it with the published result in Figure 1a.

HINT: You might have to remove a sample with an ‘NA’ genotype

Try on your own before peeking at the code chunk below

# YOUR CODE HERE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sm11_eQTL.md

sm11_eQTL.md

STAT 540 - Seminar 11: eQTL analysis

Acknowledgements

Learning objectives

Dependencies

Introduction

Explore GSE99221

Preprocess Data

Check for missing values

Map SNP data to numeric

Transformation of expression data

Fitting a single linear model

Fitting multiple eQTL models

Exercises

Files

sm11_eQTL.md

Latest commit

History

sm11_eQTL.md

File metadata and controls

STAT 540 - Seminar 11: eQTL analysis

Acknowledgements

Learning objectives

Dependencies

Introduction

Explore GSE99221

Preprocess Data

Check for missing values

Map SNP data to numeric

Transformation of expression data

Fitting a single linear model

Fitting multiple eQTL models

Exercises