Contributors: Jasleen Grewal and Keegan Korthauer
By the end of this tutorial, you should be able to - Identify the goal of an eQTL analysis - Explain the importance of log transforming the expression data in an eQTL analysis - List the basic steps of an eQTL analysis - Undertake a linear regression analysis for an eQTL study
We will be using the following packages today. Follow instructions from previous tutorials to install these if you don’t have them already.
library(GEOquery)
library(Biobase)
library(tidyverse)
theme_set(theme_bw())
library(knitr)
library(kableExtra)
As its name suggests (albeit in a roundabout way) Expression Quantitative Trait Loci analysis, or eQTL analysis, aims to identify associations between gene expression and genetic variation. eQTL analysis is equivalent to a linear regression problem where we are fitting our covariates (genetic variation data such as genotype) to explain our outcome, the observed changes in gene expression.
Today, we will be working with genotype data (SNPs) and gene expression values (counts) from a recently released dataset of melanoma cell lines. In this study, authors were interested in identifying genes whose expression levels are correlated with common variants in melanomas.
In this seminar, we are going to perform cis-eQTL analysis for 59 melanoma cell line samples, hence we will only be concerned with variants within 1 Mb (1e6 base-pairs) away from the Transcription Start Site (TSS) of each gene, where the vast majority of such associations are found. The gene expression data we will be using is from GEO accession GSE78995, and the genotype data will be from GEO accession GSE99193. These two datasets are from the same study, and can also be jointly downloaded from the accession GSE99221.
The datasets we are going to use are:
-
GSE78995, which contains expression data (54675 features, i.e. rows), for 59 melanoma cell line samples. Note that the expression data is measured by Affymetrix Human Genome U133 Plus 2.0 Array, and has already been normalized with RMA (robust microarray average) which involves a log2 transformation as well.
-
GSE99193, which contains genotype data (733202 SNPs, i.e. rows) for 67 melanoma cell lines, measured by the Illumina Infinium human OmniExpress array
First, let’s retrieve our datasets from GEO with getGEO
from
GEOquery
package. Note: this step may take a few minutes!
We will extract our data and metadata from the objects we get from GEO. We will also subset our genotype data to only include cell lines for which we have expression data. And, critically make sure that the samples are in the same order in both the expression data and SNP data.
# If you run into an issue while downloading the following data
Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 5)
exprs_dat <- getGEO('GSE78995', getGPL = FALSE)
show(exprs_dat) ## 59 samples
## $GSE78995_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 54675 features, 59 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM2083256 GSM2083257 ... GSM2083314 (59 total)
## varLabels: title geo_accession ... tissue:ch1 (34 total)
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## pubMedIds: 28759004
## Annotation: GPL570
genotype_dat <- getGEO('GSE99193', getGPL = FALSE)
show(genotype_dat) ## 67 samples
## $GSE99193_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 733202 features, 67 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM2635493 GSM2635494 ... GSM2635559 (67 total)
## varLabels: title geo_accession ... tissue:ch1 (36 total)
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## pubMedIds: 28759004
## Annotation: GPL13135
# Extract expression matrices (turn into data frames at once)
exprsdat <- as.data.frame(exprs(exprs_dat[[1]]))
genodat <- as.data.frame(exprs(genotype_dat[[1]]))
# Obtain the meta-data for the samples
exprs.meta <- pData(phenoData(exprs_dat[[1]]))
geno.meta <- pData(phenoData(genotype_dat[[1]]))
# Get the cell line labels for both types of data
exprs.meta$cell_line = gsub(".*UACC) melanoma ", "", exprs.meta$title)
geno.meta$cell_line = gsub(".*UACC ", "", geno.meta$title)
# Subset genotype data to keep cases overlapping with expression data
geno.meta.clean = geno.meta[geno.meta$cell_line %in% exprs.meta$cell_line,]
genodat.clean = genodat[,geno.meta.clean$geo_accession]
# reorder genotype data to match expression data (CRITICAL STEP)
x <- match(exprs.meta$cell_line, geno.meta.clean$cell_line)
geno.meta.clean <- geno.meta.clean[x,]
genodat.clean <- genodat.clean[,x]
Let us update the column names so that the samples are defined by the cell line, and not by the GSM IDs (the latter are specific to the submission sample and data type, so will not overlap between the SNP and expression data).
colnames(exprsdat) = exprs.meta[exprs.meta$geo_accession == colnames(exprsdat),
c("cell_line")]
colnames(genodat.clean) = geno.meta.clean[geno.meta.clean$geo_accession == colnames(genodat.clean),
c("cell_line")]
head(exprsdat) %>% kable() %>% kable_styling()
cell line 1093 |
cell line 1097 |
cell line 1113 |
cell line 1118 |
cell line 1120 |
cell line 1190 |
cell line 1237 |
cell line 1241 |
cell line 1265 |
cell line 1308 |
cell line 1316 |
cell line 1389 |
cell line 1451 |
cell line 1509 |
cell line 1592 |
cell line 1649 |
cell line 1729 |
cell line 1816 |
cell line 1940 |
cell line 2331 |
cell line 2404 |
cell line 2427 |
cell line 2496 |
cell line 2512 |
cell line 2528 |
cell line 2534 |
cell line 2545 |
cell line 257 |
cell line 2610 |
cell line 2641 |
cell line 2659 |
cell line 2837 |
cell line 285 |
cell line 2851 |
cell line 2876 |
cell line 2972 |
cell line 2993 |
cell line 2994 |
cell line 3074 |
cell line 3090 |
cell line 3093 |
cell line 3211 |
cell line 3291 |
cell line 3312 |
cell line 3337 |
cell line 3361 |
cell line 3379 |
cell line 3395 |
cell line 383 |
cell line 457 |
cell line 502 |
cell line 558 |
cell line 612 |
cell line 62 |
cell line 827 |
cell line 903 |
cell line 91 |
cell line 929 |
cell line 952 |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1007_s_at |
8.638036 |
9.569502 |
8.888012 |
8.997548 |
8.178419 |
9.859501 |
9.899436 |
8.958769 |
9.981231 |
9.221468 |
9.122782 |
10.083248 |
9.527601 |
9.352663 |
10.391349 |
9.354233 |
8.488048 |
8.700209 |
8.724595 |
11.229636 |
9.484369 |
8.242250 |
7.333083 |
9.041875 |
9.019678 |
10.563816 |
7.535070 |
8.589149 |
8.666304 |
10.145057 |
11.024141 |
9.821123 |
9.266064 |
9.292188 |
8.882724 |
10.528187 |
8.468762 |
9.566095 |
8.948761 |
9.549295 |
9.317864 |
8.611362 |
9.429071 |
9.057645 |
10.177045 |
8.626289 |
9.654319 |
9.946739 |
9.475134 |
10.056217 |
8.908891 |
9.289914 |
8.095805 |
9.475180 |
9.815099 |
9.532179 |
9.156111 |
8.128645 |
8.760415 |
1053_at |
6.511082 |
6.892053 |
8.098704 |
7.318858 |
8.104367 |
6.097134 |
8.022223 |
7.538899 |
7.382145 |
6.880304 |
7.634117 |
7.819483 |
7.079510 |
6.808387 |
6.723331 |
7.317578 |
6.516934 |
7.164454 |
7.694599 |
7.490678 |
6.644928 |
7.748965 |
7.904377 |
7.353012 |
7.516102 |
6.507049 |
6.705707 |
7.289198 |
8.208734 |
7.020544 |
6.743984 |
7.272101 |
8.113513 |
7.254123 |
7.835481 |
6.879287 |
8.003466 |
7.152403 |
7.180104 |
9.171832 |
6.407972 |
6.962079 |
7.729556 |
7.253038 |
7.756037 |
7.754817 |
7.812666 |
7.568846 |
8.156207 |
7.818405 |
7.735215 |
6.696510 |
8.567920 |
6.729650 |
7.401750 |
7.134959 |
8.330609 |
7.626840 |
7.060035 |
117_at |
4.752166 |
3.686001 |
4.510261 |
5.359558 |
4.185034 |
4.959753 |
3.914128 |
3.913718 |
4.688048 |
4.955991 |
3.910185 |
4.336408 |
4.490547 |
4.775587 |
4.089210 |
4.427098 |
3.797345 |
4.715663 |
5.148917 |
3.828924 |
3.713938 |
6.269204 |
3.730159 |
4.536183 |
4.103633 |
4.191025 |
4.603317 |
4.881486 |
4.873805 |
3.955486 |
4.291414 |
4.266460 |
4.735696 |
4.414866 |
3.681968 |
4.723171 |
4.283983 |
4.178583 |
3.934642 |
6.120346 |
4.916537 |
4.229331 |
5.039042 |
4.035960 |
5.249714 |
5.744336 |
3.888232 |
4.437059 |
4.400191 |
4.417484 |
4.033891 |
4.179866 |
3.844249 |
4.551650 |
4.900716 |
4.330514 |
4.796757 |
6.677047 |
3.989919 |
121_at |
7.280487 |
6.960408 |
7.167364 |
7.196955 |
6.918797 |
7.088626 |
6.761861 |
7.450579 |
7.065914 |
6.822917 |
6.759347 |
7.096096 |
7.640705 |
7.068972 |
6.526180 |
6.708067 |
7.109364 |
7.210015 |
6.337690 |
7.372803 |
6.780501 |
7.556065 |
7.471491 |
6.992312 |
7.629332 |
7.217674 |
7.154515 |
6.836663 |
7.323042 |
7.231007 |
7.002526 |
6.777624 |
7.242775 |
7.603851 |
6.653988 |
7.324272 |
6.940816 |
6.624049 |
6.879138 |
7.187866 |
7.087906 |
6.828678 |
6.994734 |
6.815146 |
7.243580 |
7.250607 |
6.636049 |
6.977882 |
6.795325 |
7.071736 |
7.133528 |
6.642631 |
6.685315 |
6.757389 |
6.831284 |
7.469521 |
6.715281 |
7.027035 |
7.648609 |
1255_g_at |
2.259491 |
2.371965 |
2.283499 |
2.186442 |
2.455879 |
2.296577 |
2.333483 |
2.229012 |
2.468397 |
2.530879 |
2.300941 |
2.558433 |
2.378130 |
2.354706 |
2.276999 |
2.288279 |
2.305138 |
2.326907 |
2.256450 |
2.436730 |
2.426354 |
2.398398 |
2.214872 |
2.288419 |
2.201610 |
2.594016 |
2.590422 |
2.349674 |
2.370590 |
2.183665 |
2.458047 |
2.335073 |
2.265239 |
2.175599 |
2.254122 |
2.452293 |
2.496869 |
2.209466 |
2.486782 |
2.334564 |
2.888266 |
2.285354 |
2.257714 |
2.269391 |
2.325922 |
2.352795 |
2.402064 |
2.307650 |
2.339831 |
2.344689 |
2.345543 |
2.279138 |
2.111807 |
2.485640 |
2.263299 |
2.255623 |
2.328615 |
2.374157 |
2.480321 |
1294_at |
5.551776 |
5.444600 |
6.927450 |
4.722336 |
5.955897 |
6.184568 |
7.191890 |
6.336652 |
6.620428 |
7.708175 |
6.677614 |
6.157961 |
6.710778 |
7.012948 |
6.223337 |
6.707393 |
7.905926 |
4.735737 |
5.643904 |
7.199911 |
7.790508 |
5.140462 |
6.090006 |
7.240944 |
7.357003 |
6.679847 |
4.981623 |
5.547245 |
6.045961 |
6.541726 |
6.511393 |
8.520112 |
6.335275 |
7.653637 |
6.733077 |
5.250051 |
7.434904 |
6.133121 |
6.782157 |
6.521575 |
6.669553 |
6.330201 |
6.020764 |
5.926245 |
5.954995 |
6.999129 |
5.529070 |
6.338303 |
6.457639 |
5.197230 |
7.151159 |
6.776002 |
6.211269 |
6.069893 |
5.939545 |
6.155424 |
4.762139 |
5.274493 |
7.095326 |
head(genodat.clean) %>% kable() %>% kable_styling()
cell line 1093 |
cell line 1097 |
cell line 1113 |
cell line 1118 |
cell line 1120 |
cell line 1190 |
cell line 1237 |
cell line 1241 |
cell line 1265 |
cell line 1308 |
cell line 1316 |
cell line 1389 |
cell line 1451 |
cell line 1509 |
cell line 1592 |
cell line 1649 |
cell line 1729 |
cell line 1816 |
cell line 1940 |
cell line 2331 |
cell line 2404 |
cell line 2427 |
cell line 2496 |
cell line 2512 |
cell line 2528 |
cell line 2534 |
cell line 2545 |
cell line 257 |
cell line 2610 |
cell line 2641 |
cell line 2659 |
cell line 2837 |
cell line 285 |
cell line 2851 |
cell line 2876 |
cell line 2972 |
cell line 2993 |
cell line 2994 |
cell line 3074 |
cell line 3090 |
cell line 3093 |
cell line 3211 |
cell line 3291 |
cell line 3312 |
cell line 3337 |
cell line 3361 |
cell line 3379 |
cell line 3395 |
cell line 383 |
cell line 457 |
cell line 502 |
cell line 558 |
cell line 612 |
cell line 62 |
cell line 827 |
cell line 903 |
cell line 91 |
cell line 929 |
cell line 952 |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rs1000000 |
AB |
BB |
BB |
BB |
BB |
BB |
AA |
AB |
AA |
AA |
BB |
AB |
AB |
BB |
AB |
BB |
AB |
BB |
AB |
BB |
BB |
BB |
BB |
BB |
BB |
AB |
BB |
BB |
BB |
BB |
BB |
AB |
AB |
BB |
AB |
BB |
BB |
AA |
BB |
BB |
BB |
BB |
BB |
BB |
BB |
BB |
AB |
BB |
BB |
BB |
AA |
BB |
BB |
BB |
BB |
BB |
BB |
AA |
AB |
rs1000002 |
BB |
BB |
BB |
AA |
AB |
NC |
AB |
AB |
BB |
AA |
AA |
AB |
AB |
AB |
AB |
AB |
AA |
AA |
AB |
AB |
AB |
BB |
BB |
AB |
AB |
AB |
BB |
AA |
AB |
AA |
BB |
BB |
AA |
BB |
BB |
BB |
AB |
AA |
AA |
AA |
AB |
AB |
AA |
AA |
AA |
BB |
BB |
AA |
BB |
AB |
AB |
AA |
BB |
AB |
BB |
BB |
BB |
AB |
AB |
rs10000023 |
AA |
AB |
AA |
AB |
AB |
BB |
AA |
BB |
AA |
AB |
AB |
BB |
AB |
AB |
AB |
AB |
AA |
AA |
AA |
AA |
AB |
BB |
BB |
AB |
BB |
AA |
AA |
AA |
BB |
BB |
BB |
AA |
AA |
BB |
AA |
AA |
AB |
AA |
AA |
AA |
AB |
AA |
AA |
AB |
AA |
AB |
NC |
AA |
BB |
AA |
AB |
AB |
AA |
AA |
AB |
BB |
BB |
AB |
AA |
rs1000003 |
AA |
AA |
AA |
AA |
AA |
AB |
AB |
AB |
AA |
AB |
AA |
AA |
AA |
BB |
AB |
AA |
AA |
AA |
AB |
AA |
AA |
AA |
AA |
AB |
AA |
AB |
AA |
AA |
AA |
AA |
AB |
AA |
AB |
AA |
AA |
AB |
AA |
AA |
AB |
BB |
AA |
AB |
AA |
AA |
AB |
AA |
AB |
BB |
AA |
AA |
BB |
AA |
AA |
AA |
AA |
AA |
AA |
AA |
AA |
rs10000030 |
BB |
BB |
BB |
BB |
BB |
BB |
NC |
BB |
BB |
BB |
BB |
BB |
NC |
BB |
BB |
BB |
BB |
NC |
BB |
BB |
BB |
AB |
BB |
BB |
BB |
AB |
BB |
NC |
BB |
BB |
BB |
BB |
BB |
BB |
BB |
BB |
BB |
AB |
BB |
NC |
BB |
BB |
BB |
BB |
BB |
BB |
BB |
BB |
BB |
BB |
BB |
AB |
NC |
BB |
AB |
BB |
BB |
BB |
BB |
rs10000037 |
BB |
AB |
BB |
AB |
BB |
BB |
AA |
BB |
BB |
BB |
BB |
AB |
AB |
AA |
BB |
AA |
AB |
BB |
BB |
BB |
BB |
AB |
BB |
AA |
AB |
AA |
BB |
BB |
BB |
BB |
BB |
BB |
BB |
AB |
BB |
AB |
BB |
AA |
BB |
AB |
BB |
BB |
AB |
AA |
BB |
BB |
BB |
AA |
BB |
BB |
BB |
AB |
BB |
BB |
BB |
BB |
AB |
AB |
BB |
Exercise: Pause and Review - Consider the two dataframes we have -
exprsdat
, and genodat.clean
. How many columns are in each dataframe?
How many rows? Are the columns the features, or the samples? - What
type(s) of data are we using? Is the SNP data numeric?
# YOUR CODE HERE
As a first step, we will check for any missing values in the data.
Hint: we can use the
complete.cases
function to test if there are any rows that contain a missing value.
# For our expression data
# Are the number of rows without any NA value, the same as total number of rows?
dim(exprsdat) == dim(exprsdat[complete.cases(exprsdat),])
## [1] TRUE TRUE
# For our genotype data
# Are the number of rows without any NA value, the same as total number of rows?
dim(genodat.clean) == dim(genodat.clean[complete.cases(genodat.clean),])
## [1] TRUE TRUE
Secondly, we will convert our SNP data to numeric. We will do so by ‘converting’ our genotype levels to numeric. The genotypes in our dataset are defined as ‘AA’, ‘BB’, ‘AB’, ‘NC’. (NC = No Call).
Exercise: Pause and Review - Which of these genotype labels is/are homozygous? - Which of these is/are heterozygous?
We will be setting the ‘NC’ values to NA, and AA, AB, and BB to 0, 1, and 2 respectively (representing the average count of the alternative allele, B, in the allele pair).
# AA = 0, AB = 1, BB = 2
genodat.numeric = genodat.clean
genodat.numeric[genodat.numeric == "NC"] = NA
genodat.numeric = data.frame(sapply(genodat.numeric, function(x) as.numeric(as.factor(x)) - 1))
rownames(genodat.numeric) = rownames(genodat.clean)
Before we go ahead and attempt to fit a linear model to our samples, let us review the assumptions of a linear model. A linear model assumes that the samples being fit are independent and with a constant variance.
- Are our samples independent? - Do our values have constant variance?
Count data from RNA-seq experiments generally do not meet the assumption of variance-mean independence. Our data here is microarray data that has been normalized and log2 transformed. This helps stabilize the variance of the data by reducing the ‘range’ of variability in cases where the gene expression can range from very high to very low. We can evaluate the assumptions in our data by plotting the mean vs variance of every gene in the dataset, as follows:
probe_means = rowMeans(exprsdat)
probe_vars = apply(exprsdat, 1 , sd)
data.frame(Mean = probe_means, Variance = probe_vars) %>%
ggplot(aes(x = Mean, y = Variance)) +
geom_point(alpha = 0.1) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Hmmm, do you think this looks symmetric, or is there an association between mean and variance?
Now let us look at the data on the original scale by undoing the log2 transformation.
exprsdat_raw = 2^(exprsdat)
probe_means = rowMeans(exprsdat_raw)
probe_vars = apply(exprsdat_raw, 1 , sd)
data.frame(Mean = probe_means, Variance = probe_vars) %>%
ggplot(aes(x = Mean, y = Variance)) +
geom_point(alpha = 0.1) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Compare this to our previous plot. Clearly there is a strong dependence between mean and variance.
For the subsequent analysis, we will be using the log transformed expression data as downloaded from GEO to fit our linear models.
Exercise: Pause and Review For RNA-seq data, you can also use the ‘rlog’ transformation using the rlog function in DESeq2, which can be used for differential expression analysis. How might this transformation be better than the log2 transformation? (i.e. what factors does it account for that log2 transform might not?)
In a simple test for linear association between a SNP and a particular
gene’s expression, we can fit a linear regression model for a given
gene/SNP pair. We’ll do this by carrying out the following steps: -
Let’s pick a random gene probe from the exprsdat
dataframe (Need to
consider: Are the rows genes? Or are the columns genes?) - Pick a random
SNP from the genodat.numeric
dataframe - Fit a linear model for the
gene/SNP data
set.seed(10)
random_gene_ix = sample(nrow(exprsdat), 1)
random_snp_ix = sample(nrow(genodat.numeric), 1)
#What random gene and SNP did we pick?
print(paste("My random gene probe from the expression data is", rownames(exprsdat)[random_gene_ix]))
## [1] "My random gene probe from the expression data is 210722_at"
print(paste("My random SNP from the genotype data is", rownames(genodat.numeric)[random_snp_ix]))
## [1] "My random SNP from the genotype data is rs7727450"
exprs_random = slice(exprsdat,random_gene_ix) %>% as.numeric()
snp_random = slice(genodat.numeric,random_snp_ix) %>% as.numeric()
lm_random = lm(exprs_random ~ snp_random)
summary(lm_random)
##
## Call:
## lm(formula = exprs_random ~ snp_random)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3722 -0.1079 -0.0183 0.1241 0.5143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.59736 0.04272 60.807 <2e-16 ***
## snp_random 0.01019 0.03143 0.324 0.747
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1922 on 57 degrees of freedom
## Multiple R-squared: 0.001839, Adjusted R-squared: -0.01567
## F-statistic: 0.105 on 1 and 57 DF, p-value: 0.7471
How do we interpret this?
Let us try to visualize the expression values (across our 59 cell lines), for our random probe. We will colour the data points by the genotype of the SNP (across our 59 cell lines) that we are testing for association with our probe.
plot(exprs_random ~ jitter(snp_random), col=(snp_random + 1), xaxt="n",
xlab="Genotype", ylab="Expression") ; axis(1, at=c(0:2),
labels=c("AA", "AB","BB"))
Now let us overlay this plot with the model we have fit.
plot(exprs_random ~ jitter(snp_random), col=(snp_random + 1), xaxt="n",
xlab="Genotype", ylab="Expression")
axis(1, at=c(0:2), labels=c("AA", "AB","BB"))
lines(lm_random$fitted ~ snp_random, type="b", pch=15, col="black")
What if we want to fit multiple eQTL models. If we fit a model for every pairwise combination of SNP and gene probe, we have way too many combinations to do this one by one (54,675 expression probes times 733,202 SNPs = 40 Billion models). We can undertake eQTL analysis for large datasets using specialized software such as the MatrixEQTL package. This software can restrict to local (cis) comparisons, to only include the SNPs that are nearby an expression probe. If you are interested in learning more about this sort of analysis, follow the link to the package. Also check out Jeff Leek’s tutorial on eQTL analysis using MatrixEQTL.
- Fill in the blanks using 3 of the 5 options listed below: An eQTL analysis usually involves treating ____________ as covariates and fitting a _______ model, where the outcome is _____________.
OPTIONS: gene expression values, indels, SNPs, normal, linear
- In the publication associated with the data used in this demo, the authors demonstrated an eQTL relationship between the SNP rs3219090 and the expression of the PARP1 gene (which is measured by probe 208644_at). Can you reproduce this finding?
Fit a linear model for the following probe and SNP: - Probe: 208644_at (this corresponds to PARP1) - SNP: rs3219090
From your results, what is the p-value for the SNP of interest? What does that imply about the association with the gene of interest?
Also make a similar plot as above for this SNP/gene pair, and compare it with the published result in Figure 1a.
HINT: You might have to remove a sample with an ‘NA’ genotype
Try on your own before peeking at the code chunk below
# YOUR CODE HERE