-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathsm04_data_and_dplyr.Rmd
397 lines (270 loc) · 17.8 KB
/
sm04_data_and_dplyr.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
---
output: github_document
---
# STAT 540 - Seminar 4: Reading in data and manipulating with dplyr
# Objectives
The objectives for this lecture will be to
- Understand that some freely available genomic, transcriptomic and proteomic data can be accessed through the [Gene Expression Omnibus (GEO)](https://www.ncbi.nlm.nih.gov/geo/)
- Explore and manipulate data using `dplyr` and `tidyr` verbs
- Use `dplyr` verbs in conjunction with `ggplot2` to visualize aspects of the data
# Packages required
All of the packages you will need are listed below. If you have never used them before, you will need to install them using the commented lines above the library() command.
```{r, results='hide', message=FALSE, warning=FALSE}
library(GEOquery)
library(biomaRt)
library(tidyverse)
```
# Part 1 - Accessing data using GEOquery
A variety of freely available gene expression data is available through the Gene Expression Omnibus (GEO) server. Most of these datasets have associated papers in which they detail data acquisition and analysis methods.
To simplify things for its users, GEO has four basic entities that act as containers for different types of data. The four main types are:
**GSM** - stores data associated with a single sample, and additional info about how the data was collected
**GSE** - stores information about each sample, as well as overall experiment info
**GPL** - stores platform info (i.e the machine used to collect the data)
**GDS** - stores curated matrices that are GSM objects in an "analysis-ready" format
In this seminar, we'll read in a **GDS**, and next seminar we'll work with a **GSE**.
The first thing we are going to do is download a dataset from the Gene Expression Omnibus (GEO) repository using the GEOquery package. This experiment is exploring gene expression differences between renal cell carcinoma cells (RCC) and adjacent normal cells using an Affymetric array. We are going to download data in the GDS format, as it is already in a nice table for us. Note: you can download any type of GEO data you want using the getGEO function.
```{r}
gds <- getGEO("GDS507")
#we can use str() to peak at the structure of a data object.
str(gds)
```
You can see that the GDS object is quite complex, and has many different slots in which to put data. For example, our GDS file has a slot for information about the machine (@GPL), meta data and actual gene expression (@data.table) and information about the experiment and its authors (@header). You could work with this and find the relevant data/info, but we'll convert it to a more recognizable object using the `GEOquery::GDS2eSet()` function. This function converts the GDS data structure above to an `ExpressionSet` object.
```{r}
eset <- GDS2eSet(gds)
eset
```
This format is really handy for downstream analyses, and we'll dissect its contents in much more detail in next seminar. But for now, we'll simply pull out the expression data matrix, sample metadata, and feature metadata into separate objects - don't worry about the details of this next chunk just yet.
```{r}
# get sample metadata with the pData() function
meta_data <- pData(eset)
# get expression data with the exprs() function
expr_mat <- exprs(eset)
# get feature info (probe/gene id)
feat_data <- featureData(eset)
```
# Part 2 - Exploring a gene expression dataset
Let's peak at the data to see its structure using `head()`. This gives us the first few rows of the dataset.
```{r}
head(expr_mat)
dim(expr_mat)
dim(meta_data)
dim(feat_data)
```
In our expression matrix, the row names correspond to probe IDs. The columns contain expression values for the 17 samples. In summary, we have an array with dimensions 22645 x 17 (row x column). The sample metadata has 17 rows - one for each sample. And the feature meta data has 22645 rows - one for each probe.
Taking a look at our sample and feature metadata, we can see we have information about each sample's disease state, and that we have a lot of annotation about each of our probes (Various Gene symbols/IDs/Accessions, Chromosome, and functional processes).
```{r}
head(meta_data)
colnames(feat_data)
# how many unique gene symbols are there?
length(unique(feat_data$'Gene symbol'))
```
Note that some gene names are duplicated, because there are multiple probes that map to the same gene. We will deal with this later!
Now we can start exploring the dataset a bit. Just for fun - let's compute the average count in each sample.
We will do this using a function called `apply()` in base R.
```{r}
#We exclude the first and second columns because they hold the probe and gene names, respectively.
apply(expr_mat, 2, median)
```
`apply()` is useful, but it is limited in that it can only operate on rows, columns, or individual elements of a dataframe directly. More complex operations get cumbersome.
# Part 3: Manipulations with `dplyr` and `tidyr`
One more versatile set of tools are the **`dplyr` verbs**. These are a set of functions designed for easy manipulation of data.
They are:
* **`%>%`** - Syntactic sugar for easily piping the result of one function into another.
* **`dplyr::group_by()`** - Commonly used with summarize() to derive summarized values for multiple rows sharing certain attributes
* **`dplyr::summarize()`** - summarize certain statistics from the data (i.e mean, median, mode, number of samples)
**`dplyr::filter()`** - extract rows that meet certain criteria from data frame
**`dplyr::select()`** - extract columns that meet certain criteria from data frame
**`dplyr::mutate()`** - add a new column
**`dplyr::arrange()`** - arrange the data in descending or ascending order
**`dplyr::left_join()`**, **`dplyr::right_join()`**, etc - a set of methods to combine two tidy datasets, roughly corresponding to typical notions of database joins, see the [join page](https://dplyr.tidyverse.org/reference/mutate-joins.html) of the tidyverse reference for more information
**`tidyr::pivot_longer()`** - "lengthens" data by collapsing several columns into two
Most, if not all, of these operations are available in the `data.table` package, albeit in a less readable syntax. This package was developed to quickly read, write, and manipulate large amounts of data. If you plan to work with large sets of features, it may be helpful to consider learning this framework as well. See the [Introduction to data.table](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) vignette for more information. The [`dtplyr` package](https://www.tidyverse.org/blog/2019/11/dtplyr-1-0-0/) is a package that unifies `dplyr` syntax with a `data.table` backend.
An important thing to know about the `dplyr` verbs (and `data.table` for that matter) is that they were developed to work efficiently on data frames that meet certain structural criteria.In data science, we call this "tidy" data. For a data to be called it must meet the following criteria:
1. Each column is a variable
2. Each row is an unique observational unit
Let's look at a few small datasets that are "tidy".
```{r}
head(iris) #data describing flower parts for several species
head(band_members) #Members of the Beatles and Rolling Stones
head(band_instruments) #Instruments of the above band members
```
The iris dataset contains information about certain species of flowers.
As you can see, each variable has its own column, and each row is an instance of that variable. There are no rownames. We can now use dplyr verbs (such as `summarize()`, `mutate()`, `filter()`, etc) to manipulate the data.
These verbs can be used together in a sequence of functions with the "pipe" operator. R will interpret the output of the previous function as the input to the subsequent function when you put the "pipe" operator (`%>%`) in between the functions. It is a syntactic sugar that simply takes the object on its left and feeds it into the function call on its right. This way, you can "pipe" an input through sequence of function calls, where output of the previous function call is fed into the next function call as its input. This makes for much more readable code.
For example, this:
foo0(input)
becomes this:
input %>%
foo0()
and this:
foo2(foo1(foo0(input)))
becomes this:
input %>%
foo0() %>%
foo1() %>%
foo2()
To read more about pipes click [here](http://r4ds.had.co.nz/pipes.html).
Take a look through the following examples to check out common manipulations.
```{r}
mpg %>%
group_by(class) %>%
summarise(fuel_efficiency = mean(hwy))
```
```{r}
#select all rows with sepal length greater than 5.
iris %>%
filter(Sepal.Length > 5) %>%
head()
#group all rows of the same species together.
iris %>%
group_by(Species) %>%
head()
#select the column called "Sepal.Width"
iris %>%
dplyr::select(Sepal.Width) %>%
head()
#create another column with the species name capitalized.
iris %>%
group_by(Species) %>%
mutate(Capitalized_names = toupper(Species)) %>%
head()
#summarize the average sepal length and number of rows belonging to each species.
iris %>%
group_by(Species) %>%
summarize(average_sepal_length = mean(Sepal.Length), n = n()) %>%
head()
#arrange the species in alphabetical order
iris %>%
arrange(desc(Species)) %>%
head()
#join band members with their instruments by "name"
band_members %>% left_join(band_instruments)
band_members %>% right_join(band_instruments)
band_members %>% full_join(band_instruments)
```
## Back to our gene expression dataset
Now let's apply these functions to our gene expression dataset!
One problem: our dataset is not "tidy". Rather, it's arranged like an excel spreadsheet. While intuitive for us to read, dplyr does not like this very much. So, we have to change it so that each row contains only one expression value. Luckily, the `pivot_longer()` function from the [`tidyr` package](https://raw.githubusercontent.com/rstudio/cheatsheets/main/tidyr.pdf) helps out with that. This function "lengthens" data by collapsing several columns into two. Column names move to a new `names_to` column and values to a new `values_to` column. Our data starts out with each sample in a separate column, so we will collapse all of those into two columns: one for the expression values themselves and one for the sample name.
We'll first convert our expression matrix to a data.frame add the probe IDs as their own column, since the `pivot_longer` expects a data.frame or tibble and will not keep information stored in row names.
```{r}
expr_data <- expr_mat %>%
as.data.frame() %>%
rownames_to_column(var = "ID")
long_data <- pivot_longer(expr_data,
cols = 2:ncol(expr_data),
values_to = "Expression",
names_to = "sample")
long_data
```
You can see that the first ~20,000 rows will correspond to data from the first probe identifier, and the next group of rows will correspond to data from the second probe identifier. You can think of this function as stretching out a dataset into its long form, so that each row has only one expression value. If that's not clear, I would suggest reading [this](https://tidyr.tidyverse.org/reference/pivot_longer.html) for more information about what the `pivot_longer()` function does.
Now we have three columns, each one corresponding to a variable: the probe name, the sample name and the expression value.
We can do a lot of stuff with this setup! Let's calculate the mean gene expression per sample.
```{r}
long_data %>%
group_by(sample) %>%
summarize(mean = mean(Expression))
```
## Summarizing by gene
Right now our gene expression data frame has probe IDs. But we would like to know which gene each probe maps to. We saw above that we already have this information in the feature metadata. Let's add it to the long expression data object we just created using a join operation.
```{r}
(long_data_annot <- long_data %>%
left_join(feat_data@data %>% select(ID, 'Gene symbol'),
by = "ID"))
```
Another thing we note is that there are multiple probes that map to a specific gene. In a real life analysis workflow, there are multiple ways to deal with this. Some popular options include picking the probe with the highest expression, or taking the mean/median of all probes' expression. For simplicity, we will use `summarize()` to take the mean of each probe's expression.
```{r}
(long_data_gene <- long_data_annot %>%
group_by(sample, `Gene symbol`) %>%
summarize(Expression = mean(Expression)))
```
Now, every gene will only have one value per sample.
## Adding annotations using `biomaRt`
Let's explore another way to annotate our data with extra information about genes. This time we'll show how to retrieve annotation that isn't already stored in our object.
The `biomaRt` package is very useful in this regard. It can access the ensembl database (or other databases) of gene names and annotations (ensembl.org). `biomaRt` can help us convert ensemble ids (e.g. ENSGXXXXX) into HGNC symbols (e.g. BRCA1), for example, along with a host of other things.
Say we want to learn more about the gene expression on a particular chromosome, across all samples (this info actually happens to already be in our object, but for illustration let's pretend we don't know that). We can use `biomaRt` to look up the chromosomal location of each gene. Read the [`biomaRt` vignettes and manual](https://bioconductor.org/packages/release/bioc/html/biomaRt.html) for more detailed explanation of the following bit of code.
```{r, biomart}
#open connection between biomaRt and R.
human = useMart("ensembl", dataset = "hsapiens_gene_ensembl")
#function that takes in data frame, and outputs same data frame with associated chromosome annotations.
identify_gene_names <- function(df){
names(df) <- c("sample", "hgnc_symbol", "Expression")
names <- getBM( attributes=c("hgnc_symbol", "chromosome_name"),
filters= "hgnc_symbol",
values = unique(df$hgnc_symbol),
mart = human, useCache = FALSE)
left_join(df, names, by = "hgnc_symbol")
}
#There's a lot of variation in how the chromosomal location is annotated. To simplify things, let's keep only the autosomes (chr 1:22), and convert the chromosome names to a factor with levels ordered numerically
(data_with_chromosome <- identify_gene_names(long_data_gene) %>%
filter(chromosome_name %in% c(1:22)) %>%
mutate(chromosome_name = as.factor(as.numeric(chromosome_name))))
```
Here we've only added chromosome name - you can use `listAttributes(human)` to get a list of many many different attributes available to add.
Let's say we're interested in how the average expression of genes on chromosome 19 changes between RCC and normal cells.
The first thing we will do is combine information from the meta data file (`meta_data`) with our expression table (`data_with_chromosome`). Then we will use dplyr verbs to first group all samples by disease status, pull out genes on chromosome 19, and then calculate the mean using `summarize()`.
```{r}
full_data <- left_join(data_with_chromosome, meta_data, by = "sample")
full_data %>%
group_by(disease.state) %>%
filter(chromosome_name == "19") %>%
summarize(mean = mean(Expression))
```
## Exercise (not graded)
Modify the code chunk labeled 'biomart' above to also identify the length of each gene captured in the dataset we have been working with in the above exercises. This can be done by adding "transcript_length" as an attribute in `getBM` function. You should end up with an extra column for transcript length.
```{r}
# YOUR CODE HERE
```
# Part 4: Plotting expression data
What if we want to graph our expression data? Time for `ggplot`!
Let's plot the expression values by sample
```{r}
full_data %>%
ggplot(aes(x = sample, y = Expression)) +
geom_boxplot()
```
The data is quite right-skewed, so let's try a log-transformed version. We'll also rotate the axis labels so they are legible and colour by disease status.
```{r}
full_data %>%
ggplot(aes(x = sample, y = Expression, fill = disease.state)) +
geom_boxplot() +
scale_y_log10() +
theme(axis.text.x = element_text(angle = 90))
```
Now, let's look at the values for a single sample by chromosome.
```{r}
full_data %>%
filter(sample == "GSM11810") %>%
ggplot(aes(x = chromosome_name, y = Expression)) +
geom_boxplot() +
scale_y_log10()
```
# Part 5 - Applying statistical tests
Being able to graph these results is useful, but what we really want to do is run statistical tests on the data. There are a variety of ways to do that which will be explored in subsequent lectures. But in this seminar we will focus on doing this using `dplyr`
In this case, we want to identify the genes that are differentially expressed between the normal and RCC samples. We will use `summarize()` to perform a t-test for each gene.
```{r}
genetests <- full_data %>%
group_by(hgnc_symbol) %>%
summarize( pvalue = t.test(Expression ~ disease.state)$p.value)
genetests
```
We can plot the p-value distributions of the above t-test.
```{r}
ggplot(genetests) +
geom_histogram(aes(x = pvalue))
```
We can see there is a high density of small p-values. This is just a small preview of a genome-wide analysis and we'll learn much more next time!
## Deliverable
Complete the following tasks to earn full marks for this seminar. Make sure to commit push all files to the repo generated when you accept this seminar assignment before the deadline.
Starting with the `full_data` object above, complete the following steps.
1. Calculate the sum of all expression values in each sample, and add it to the `full_data` object (hint: use `group_by()`, `summarize()` and `left_join()`)
```{r}
# YOUR CODE HERE
```
2. Add another column to the resulting object which is the Expression value divided by the sum values contained in the column just added in (1) (hint: use `mutate()`)
```{r}
# YOUR CODE HERE
```
3. Create boxplots by sample of the log-transformed values in the column added in (2)
```{r}
# YOUR CODE HERE
```