Skip to content

Latest commit

 

History

History
5519 lines (2869 loc) · 53.1 KB

sm11_eQTL.md

File metadata and controls

5519 lines (2869 loc) · 53.1 KB

STAT 540 - Seminar 11: eQTL analysis

Acknowledgements

Contributors: Jasleen Grewal and Keegan Korthauer

Learning objectives

By the end of this tutorial, you should be able to - Identify the goal of an eQTL analysis - Explain the importance of log transforming the expression data in an eQTL analysis - List the basic steps of an eQTL analysis - Undertake a linear regression analysis for an eQTL study

Dependencies

We will be using the following packages today. Follow instructions from previous tutorials to install these if you don’t have them already.

library(GEOquery)
library(Biobase)
library(tidyverse)
theme_set(theme_bw())
library(knitr)
library(kableExtra)

Introduction

As its name suggests (albeit in a roundabout way) Expression Quantitative Trait Loci analysis, or eQTL analysis, aims to identify associations between gene expression and genetic variation. eQTL analysis is equivalent to a linear regression problem where we are fitting our covariates (genetic variation data such as genotype) to explain our outcome, the observed changes in gene expression.

Today, we will be working with genotype data (SNPs) and gene expression values (counts) from a recently released dataset of melanoma cell lines. In this study, authors were interested in identifying genes whose expression levels are correlated with common variants in melanomas.

Explore GSE99221

In this seminar, we are going to perform cis-eQTL analysis for 59 melanoma cell line samples, hence we will only be concerned with variants within 1 Mb (1e6 base-pairs) away from the Transcription Start Site (TSS) of each gene, where the vast majority of such associations are found. The gene expression data we will be using is from GEO accession GSE78995, and the genotype data will be from GEO accession GSE99193. These two datasets are from the same study, and can also be jointly downloaded from the accession GSE99221.

The datasets we are going to use are:

  • GSE78995, which contains expression data (54675 features, i.e. rows), for 59 melanoma cell line samples. Note that the expression data is measured by Affymetrix Human Genome U133 Plus 2.0 Array, and has already been normalized with RMA (robust microarray average) which involves a log2 transformation as well.

  • GSE99193, which contains genotype data (733202 SNPs, i.e. rows) for 67 melanoma cell lines, measured by the Illumina Infinium human OmniExpress array

First, let’s retrieve our datasets from GEO with getGEO from GEOquery package. Note: this step may take a few minutes!

We will extract our data and metadata from the objects we get from GEO. We will also subset our genotype data to only include cell lines for which we have expression data. And, critically make sure that the samples are in the same order in both the expression data and SNP data.

# If you run into an issue while downloading the following data
Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 5)
exprs_dat <- getGEO('GSE78995', getGPL = FALSE)
show(exprs_dat) ## 59 samples
## $GSE78995_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 54675 features, 59 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM2083256 GSM2083257 ... GSM2083314 (59 total)
##   varLabels: title geo_accession ... tissue:ch1 (34 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 28759004 
## Annotation: GPL570
genotype_dat <- getGEO('GSE99193', getGPL = FALSE)
show(genotype_dat) ## 67 samples
## $GSE99193_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 733202 features, 67 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM2635493 GSM2635494 ... GSM2635559 (67 total)
##   varLabels: title geo_accession ... tissue:ch1 (36 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 28759004 
## Annotation: GPL13135
# Extract expression matrices (turn into data frames at once)
exprsdat <- as.data.frame(exprs(exprs_dat[[1]]))
genodat <- as.data.frame(exprs(genotype_dat[[1]]))

# Obtain the meta-data for the samples
exprs.meta <- pData(phenoData(exprs_dat[[1]]))
geno.meta <- pData(phenoData(genotype_dat[[1]]))

# Get the cell line labels for both types of data
exprs.meta$cell_line = gsub(".*UACC) melanoma ", "", exprs.meta$title)
geno.meta$cell_line = gsub(".*UACC ", "", geno.meta$title)

# Subset genotype data to keep cases overlapping with expression data
geno.meta.clean = geno.meta[geno.meta$cell_line %in% exprs.meta$cell_line,]
genodat.clean = genodat[,geno.meta.clean$geo_accession]

# reorder genotype data to match expression data (CRITICAL STEP)
x <- match(exprs.meta$cell_line, geno.meta.clean$cell_line)
geno.meta.clean <- geno.meta.clean[x,]
genodat.clean <- genodat.clean[,x]

Let us update the column names so that the samples are defined by the cell line, and not by the GSM IDs (the latter are specific to the submission sample and data type, so will not overlap between the SNP and expression data).

colnames(exprsdat) = exprs.meta[exprs.meta$geo_accession == colnames(exprsdat),
                                c("cell_line")]
colnames(genodat.clean) = geno.meta.clean[geno.meta.clean$geo_accession == colnames(genodat.clean),
                                          c("cell_line")]
head(exprsdat) %>% kable() %>% kable_styling()

cell line 1093

cell line 1097

cell line 1113

cell line 1118

cell line 1120

cell line 1190

cell line 1237

cell line 1241

cell line 1265

cell line 1308

cell line 1316

cell line 1389

cell line 1451

cell line 1509

cell line 1592

cell line 1649

cell line 1729

cell line 1816

cell line 1940

cell line 2331

cell line 2404

cell line 2427

cell line 2496

cell line 2512

cell line 2528

cell line 2534

cell line 2545

cell line 257

cell line 2610

cell line 2641

cell line 2659

cell line 2837

cell line 285

cell line 2851

cell line 2876

cell line 2972

cell line 2993

cell line 2994

cell line 3074

cell line 3090

cell line 3093

cell line 3211

cell line 3291

cell line 3312

cell line 3337

cell line 3361

cell line 3379

cell line 3395

cell line 383

cell line 457

cell line 502

cell line 558

cell line 612

cell line 62

cell line 827

cell line 903

cell line 91

cell line 929

cell line 952

1007_s_at

8.638036

9.569502

8.888012

8.997548

8.178419

9.859501

9.899436

8.958769

9.981231

9.221468

9.122782

10.083248

9.527601

9.352663

10.391349

9.354233

8.488048

8.700209

8.724595

11.229636

9.484369

8.242250

7.333083

9.041875

9.019678

10.563816

7.535070

8.589149

8.666304

10.145057

11.024141

9.821123

9.266064

9.292188

8.882724

10.528187

8.468762

9.566095

8.948761

9.549295

9.317864

8.611362

9.429071

9.057645

10.177045

8.626289

9.654319

9.946739

9.475134

10.056217

8.908891

9.289914

8.095805

9.475180

9.815099

9.532179

9.156111

8.128645

8.760415

1053_at

6.511082

6.892053

8.098704

7.318858

8.104367

6.097134

8.022223

7.538899

7.382145

6.880304

7.634117

7.819483

7.079510

6.808387

6.723331

7.317578

6.516934

7.164454

7.694599

7.490678

6.644928

7.748965

7.904377

7.353012

7.516102

6.507049

6.705707

7.289198

8.208734

7.020544

6.743984

7.272101

8.113513

7.254123

7.835481

6.879287

8.003466

7.152403

7.180104

9.171832

6.407972

6.962079

7.729556

7.253038

7.756037

7.754817

7.812666

7.568846

8.156207

7.818405

7.735215

6.696510

8.567920

6.729650

7.401750

7.134959

8.330609

7.626840

7.060035

117_at

4.752166

3.686001

4.510261

5.359558

4.185034

4.959753

3.914128

3.913718

4.688048

4.955991

3.910185

4.336408

4.490547

4.775587

4.089210

4.427098

3.797345

4.715663

5.148917

3.828924

3.713938

6.269204

3.730159

4.536183

4.103633

4.191025

4.603317

4.881486

4.873805

3.955486

4.291414

4.266460

4.735696

4.414866

3.681968

4.723171

4.283983

4.178583

3.934642

6.120346

4.916537

4.229331

5.039042

4.035960

5.249714

5.744336

3.888232

4.437059

4.400191

4.417484

4.033891

4.179866

3.844249

4.551650

4.900716

4.330514

4.796757

6.677047

3.989919

121_at

7.280487

6.960408

7.167364

7.196955

6.918797

7.088626

6.761861

7.450579

7.065914

6.822917

6.759347

7.096096

7.640705

7.068972

6.526180

6.708067

7.109364

7.210015

6.337690

7.372803

6.780501

7.556065

7.471491

6.992312

7.629332

7.217674

7.154515

6.836663

7.323042

7.231007

7.002526

6.777624

7.242775

7.603851

6.653988

7.324272

6.940816

6.624049

6.879138

7.187866

7.087906

6.828678

6.994734

6.815146

7.243580

7.250607

6.636049

6.977882

6.795325

7.071736

7.133528

6.642631

6.685315

6.757389

6.831284

7.469521

6.715281

7.027035

7.648609

1255_g_at

2.259491

2.371965

2.283499

2.186442

2.455879

2.296577

2.333483

2.229012

2.468397

2.530879

2.300941

2.558433

2.378130

2.354706

2.276999

2.288279

2.305138

2.326907

2.256450

2.436730

2.426354

2.398398

2.214872

2.288419

2.201610

2.594016

2.590422

2.349674

2.370590

2.183665

2.458047

2.335073

2.265239

2.175599

2.254122

2.452293

2.496869

2.209466

2.486782

2.334564

2.888266

2.285354

2.257714

2.269391

2.325922

2.352795

2.402064

2.307650

2.339831

2.344689

2.345543

2.279138

2.111807

2.485640

2.263299

2.255623

2.328615

2.374157

2.480321

1294_at

5.551776

5.444600

6.927450

4.722336

5.955897

6.184568

7.191890

6.336652

6.620428

7.708175

6.677614

6.157961

6.710778

7.012948

6.223337

6.707393

7.905926

4.735737

5.643904

7.199911

7.790508

5.140462

6.090006

7.240944

7.357003

6.679847

4.981623

5.547245

6.045961

6.541726

6.511393

8.520112

6.335275

7.653637

6.733077

5.250051

7.434904

6.133121

6.782157

6.521575

6.669553

6.330201

6.020764

5.926245

5.954995

6.999129

5.529070

6.338303

6.457639

5.197230

7.151159

6.776002

6.211269

6.069893

5.939545

6.155424

4.762139

5.274493

7.095326

head(genodat.clean) %>% kable() %>% kable_styling()

cell line 1093

cell line 1097

cell line 1113

cell line 1118

cell line 1120

cell line 1190

cell line 1237

cell line 1241

cell line 1265

cell line 1308

cell line 1316

cell line 1389

cell line 1451

cell line 1509

cell line 1592

cell line 1649

cell line 1729

cell line 1816

cell line 1940

cell line 2331

cell line 2404

cell line 2427

cell line 2496

cell line 2512

cell line 2528

cell line 2534

cell line 2545

cell line 257

cell line 2610

cell line 2641

cell line 2659

cell line 2837

cell line 285

cell line 2851

cell line 2876

cell line 2972

cell line 2993

cell line 2994

cell line 3074

cell line 3090

cell line 3093

cell line 3211

cell line 3291

cell line 3312

cell line 3337

cell line 3361

cell line 3379

cell line 3395

cell line 383

cell line 457

cell line 502

cell line 558

cell line 612

cell line 62

cell line 827

cell line 903

cell line 91

cell line 929

cell line 952

rs1000000

AB

BB

BB

BB

BB

BB

AA

AB

AA

AA

BB

AB

AB

BB

AB

BB

AB

BB

AB

BB

BB

BB

BB

BB

BB

AB

BB

BB

BB

BB

BB

AB

AB

BB

AB

BB

BB

AA

BB

BB

BB

BB

BB

BB

BB

BB

AB

BB

BB

BB

AA

BB

BB

BB

BB

BB

BB

AA

AB

rs1000002

BB

BB

BB

AA

AB

NC

AB

AB

BB

AA

AA

AB

AB

AB

AB

AB

AA

AA

AB

AB

AB

BB

BB

AB

AB

AB

BB

AA

AB

AA

BB

BB

AA

BB

BB

BB

AB

AA

AA

AA

AB

AB

AA

AA

AA

BB

BB

AA

BB

AB

AB

AA

BB

AB

BB

BB

BB

AB

AB

rs10000023

AA

AB

AA

AB

AB

BB

AA

BB

AA

AB

AB

BB

AB

AB

AB

AB

AA

AA

AA

AA

AB

BB

BB

AB

BB

AA

AA

AA

BB

BB

BB

AA

AA

BB

AA

AA

AB

AA

AA

AA

AB

AA

AA

AB

AA

AB

NC

AA

BB

AA

AB

AB

AA

AA

AB

BB

BB

AB

AA

rs1000003

AA

AA

AA

AA

AA

AB

AB

AB

AA

AB

AA

AA

AA

BB

AB

AA

AA

AA

AB

AA

AA

AA

AA

AB

AA

AB

AA

AA

AA

AA

AB

AA

AB

AA

AA

AB

AA

AA

AB

BB

AA

AB

AA

AA

AB

AA

AB

BB

AA

AA

BB

AA

AA

AA

AA

AA

AA

AA

AA

rs10000030

BB

BB

BB

BB

BB

BB

NC

BB

BB

BB

BB

BB

NC

BB

BB

BB

BB

NC

BB

BB

BB

AB

BB

BB

BB

AB

BB

NC

BB

BB

BB

BB

BB

BB

BB

BB

BB

AB

BB

NC

BB

BB

BB

BB

BB

BB

BB

BB

BB

BB

BB

AB

NC

BB

AB

BB

BB

BB

BB

rs10000037

BB

AB

BB

AB

BB

BB

AA

BB

BB

BB

BB

AB

AB

AA

BB

AA

AB

BB

BB

BB

BB

AB

BB

AA

AB

AA

BB

BB

BB

BB

BB

BB

BB

AB

BB

AB

BB

AA

BB

AB

BB

BB

AB

AA

BB

BB

BB

AA

BB

BB

BB

AB

BB

BB

BB

BB

AB

AB

BB

Exercise: Pause and Review - Consider the two dataframes we have - exprsdat, and genodat.clean. How many columns are in each dataframe? How many rows? Are the columns the features, or the samples? - What type(s) of data are we using? Is the SNP data numeric?

# YOUR CODE HERE

Preprocess Data

Check for missing values

As a first step, we will check for any missing values in the data.

Hint: we can use the complete.cases function to test if there are any rows that contain a missing value.

# For our expression data
# Are the number of rows without any NA value, the same as total number of rows?
dim(exprsdat) == dim(exprsdat[complete.cases(exprsdat),])
## [1] TRUE TRUE
# For our genotype data
# Are the number of rows without any NA value, the same as total number of rows?
dim(genodat.clean) == dim(genodat.clean[complete.cases(genodat.clean),])
## [1] TRUE TRUE

Map SNP data to numeric

Secondly, we will convert our SNP data to numeric. We will do so by ‘converting’ our genotype levels to numeric. The genotypes in our dataset are defined as ‘AA’, ‘BB’, ‘AB’, ‘NC’. (NC = No Call).

Exercise: Pause and Review - Which of these genotype labels is/are homozygous? - Which of these is/are heterozygous?

We will be setting the ‘NC’ values to NA, and AA, AB, and BB to 0, 1, and 2 respectively (representing the average count of the alternative allele, B, in the allele pair).

# AA = 0, AB = 1, BB = 2
genodat.numeric = genodat.clean
genodat.numeric[genodat.numeric == "NC"] = NA
genodat.numeric = data.frame(sapply(genodat.numeric, function(x) as.numeric(as.factor(x)) - 1))
rownames(genodat.numeric) = rownames(genodat.clean)

Transformation of expression data

Before we go ahead and attempt to fit a linear model to our samples, let us review the assumptions of a linear model. A linear model assumes that the samples being fit are independent and with a constant variance.

  • Are our samples independent? - Do our values have constant variance?

Count data from RNA-seq experiments generally do not meet the assumption of variance-mean independence. Our data here is microarray data that has been normalized and log2 transformed. This helps stabilize the variance of the data by reducing the ‘range’ of variability in cases where the gene expression can range from very high to very low. We can evaluate the assumptions in our data by plotting the mean vs variance of every gene in the dataset, as follows:

probe_means = rowMeans(exprsdat)
probe_vars = apply(exprsdat, 1 , sd)

data.frame(Mean = probe_means, Variance = probe_vars) %>%
  ggplot(aes(x = Mean, y = Variance)) +
  geom_point(alpha = 0.1) +
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Hmmm, do you think this looks symmetric, or is there an association between mean and variance?

Now let us look at the data on the original scale by undoing the log2 transformation.

exprsdat_raw = 2^(exprsdat)
probe_means = rowMeans(exprsdat_raw)
probe_vars = apply(exprsdat_raw, 1 , sd)

data.frame(Mean = probe_means, Variance = probe_vars) %>%
  ggplot(aes(x = Mean, y = Variance)) +
  geom_point(alpha = 0.1) +
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Compare this to our previous plot. Clearly there is a strong dependence between mean and variance.

For the subsequent analysis, we will be using the log transformed expression data as downloaded from GEO to fit our linear models.

Exercise: Pause and Review For RNA-seq data, you can also use the ‘rlog’ transformation using the rlog function in DESeq2, which can be used for differential expression analysis. How might this transformation be better than the log2 transformation? (i.e. what factors does it account for that log2 transform might not?)

Fitting a single linear model

In a simple test for linear association between a SNP and a particular gene’s expression, we can fit a linear regression model for a given gene/SNP pair. We’ll do this by carrying out the following steps: - Let’s pick a random gene probe from the exprsdat dataframe (Need to consider: Are the rows genes? Or are the columns genes?) - Pick a random SNP from the genodat.numeric dataframe - Fit a linear model for the gene/SNP data

set.seed(10)

random_gene_ix = sample(nrow(exprsdat), 1)
random_snp_ix = sample(nrow(genodat.numeric), 1)

#What random gene and SNP did we pick?
print(paste("My random gene probe from the expression data is", rownames(exprsdat)[random_gene_ix]))
## [1] "My random gene probe from the expression data is 210722_at"
print(paste("My random SNP from the genotype data is", rownames(genodat.numeric)[random_snp_ix]))
## [1] "My random SNP from the genotype data is rs7727450"
exprs_random = slice(exprsdat,random_gene_ix) %>% as.numeric()
snp_random = slice(genodat.numeric,random_snp_ix) %>% as.numeric()
lm_random = lm(exprs_random ~ snp_random)

summary(lm_random)
## 
## Call:
## lm(formula = exprs_random ~ snp_random)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3722 -0.1079 -0.0183  0.1241  0.5143 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.59736    0.04272  60.807   <2e-16 ***
## snp_random   0.01019    0.03143   0.324    0.747    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1922 on 57 degrees of freedom
## Multiple R-squared:  0.001839,   Adjusted R-squared:  -0.01567 
## F-statistic: 0.105 on 1 and 57 DF,  p-value: 0.7471

How do we interpret this?

Let us try to visualize the expression values (across our 59 cell lines), for our random probe. We will colour the data points by the genotype of the SNP (across our 59 cell lines) that we are testing for association with our probe.

plot(exprs_random ~ jitter(snp_random), col=(snp_random + 1), xaxt="n",
     xlab="Genotype", ylab="Expression") ; axis(1, at=c(0:2),
                                                labels=c("AA", "AB","BB"))

Now let us overlay this plot with the model we have fit.

plot(exprs_random ~ jitter(snp_random), col=(snp_random + 1), xaxt="n",
     xlab="Genotype", ylab="Expression")
axis(1, at=c(0:2), labels=c("AA", "AB","BB"))
lines(lm_random$fitted ~ snp_random, type="b", pch=15, col="black")

Fitting multiple eQTL models

What if we want to fit multiple eQTL models. If we fit a model for every pairwise combination of SNP and gene probe, we have way too many combinations to do this one by one (54,675 expression probes times 733,202 SNPs = 40 Billion models). We can undertake eQTL analysis for large datasets using specialized software such as the MatrixEQTL package. This software can restrict to local (cis) comparisons, to only include the SNPs that are nearby an expression probe. If you are interested in learning more about this sort of analysis, follow the link to the package. Also check out Jeff Leek’s tutorial on eQTL analysis using MatrixEQTL.

Exercises

  1. Fill in the blanks using 3 of the 5 options listed below: An eQTL analysis usually involves treating ____________ as covariates and fitting a _______ model, where the outcome is _____________.

OPTIONS: gene expression values, indels, SNPs, normal, linear

  1. In the publication associated with the data used in this demo, the authors demonstrated an eQTL relationship between the SNP rs3219090 and the expression of the PARP1 gene (which is measured by probe 208644_at). Can you reproduce this finding?

Fit a linear model for the following probe and SNP: - Probe: 208644_at (this corresponds to PARP1) - SNP: rs3219090

From your results, what is the p-value for the SNP of interest? What does that imply about the association with the gene of interest?

Also make a similar plot as above for this SNP/gene pair, and compare it with the published result in Figure 1a.

HINT: You might have to remove a sample with an ‘NA’ genotype

Try on your own before peeking at the code chunk below

# YOUR CODE HERE