sm04_data_and_dplyr.Rmd

---
output: github_document
---

# STAT 540 - Seminar 4: Reading in data and manipulating with dplyr

# Objectives

The objectives for this lecture will be to

-  Understand that some freely available genomic, transcriptomic and proteomic data can be accessed through the [Gene Expression Omnibus (GEO)](https://www.ncbi.nlm.nih.gov/geo/)
-  Explore and manipulate data using `dplyr` and `tidyr` verbs
-  Use `dplyr` verbs in conjunction with `ggplot2` to visualize aspects of the data

# Packages required

All of the packages you will need are listed below. If you have never used them before, you will need to install them using the commented lines above the library() command. 

```{r, results='hide', message=FALSE, warning=FALSE}
library(GEOquery)
library(biomaRt)
library(tidyverse)
```

# Part 1 - Accessing data using GEOquery

A variety of freely available gene expression data is available through the Gene Expression Omnibus (GEO) server. Most of these datasets have associated papers in which they detail data acquisition and analysis methods. 

To simplify things for its users, GEO has four basic entities that act as containers for different types of data. The four main types are:

**GSM** - stores data associated with a single sample, and additional info about how the data was collected

**GSE** - stores information about each sample, as well as overall experiment info 

**GPL** - stores platform info (i.e the machine used to collect the data)

**GDS** - stores curated matrices that are GSM objects in an "analysis-ready" format

In this seminar, we'll read in a **GDS**, and next seminar we'll work with a **GSE**.

The first thing we are going to do is download a dataset from the Gene Expression Omnibus (GEO) repository using the GEOquery package. This experiment is exploring gene expression differences between renal cell carcinoma cells (RCC) and adjacent normal cells using an Affymetric array. We are going to download data in the GDS format, as it is already in a nice table for us. Note: you can download any type of GEO data you want using the getGEO function. 

```{r}
gds <- getGEO("GDS507")
#we can use str() to peak at the structure of a data object. 
str(gds)
```

You can see that the GDS object is quite complex, and has many different slots in which to put data. For example, our GDS file has a slot for information about the machine (@GPL), meta data and actual gene expression (@data.table) and information about the experiment and its authors (@header). You could work with this and find the relevant data/info, but we'll convert it to a more recognizable object using the `GEOquery::GDS2eSet()` function. This function converts the GDS data structure above to an `ExpressionSet` object. 

```{r}
eset <- GDS2eSet(gds)
eset
```

This format is really handy for downstream analyses, and we'll dissect its contents in much more detail in next seminar. But for now, we'll simply pull out the expression data matrix, sample metadata, and feature metadata into separate objects - don't worry about the details of this next chunk just yet.

```{r}
# get sample metadata with the pData() function
meta_data <- pData(eset)

# get expression data with the exprs() function
expr_mat <- exprs(eset)

# get feature info (probe/gene id)
feat_data <- featureData(eset)
```

# Part 2 - Exploring a gene expression dataset

Let's peak at the data to see its structure using `head()`. This gives us the first few rows of the dataset. 

```{r}
head(expr_mat)
dim(expr_mat)

dim(meta_data)
dim(feat_data)
```

In our expression matrix, the row names correspond to probe IDs. The columns contain expression values for the 17 samples. In summary, we have an array with dimensions 22645 x 17 (row x column). The sample metadata has 17 rows - one for each sample. And the feature meta data has 22645 rows - one for each probe. 

Taking a look at our sample and feature metadata, we can see we have information about each sample's disease state, and that we have a lot of annotation about each of our probes (Various Gene symbols/IDs/Accessions, Chromosome, and functional processes).

```{r}
head(meta_data)
colnames(feat_data)

# how many unique gene symbols are there?
length(unique(feat_data$'Gene symbol'))
```

Note that some gene names are duplicated, because there are multiple probes that map to the same gene. We will deal with this later! 

Now we can start exploring the dataset a bit. Just for fun - let's compute the average count in each sample. 

We will do this using a function called `apply()` in base R. 

```{r}
#We exclude the first and second columns because they hold the probe and gene names, respectively. 
apply(expr_mat, 2, median)
```

`apply()` is useful, but it is limited in that it can only operate on rows, columns, or individual elements of a dataframe directly. More complex operations get cumbersome. 

# Part 3: Manipulations with `dplyr` and `tidyr`

One more versatile set of tools are the **`dplyr` verbs**. These are a set of functions designed for easy manipulation of data. 

They are: 

* **`%>%`** - Syntactic sugar for easily piping the result of one function into another. 
* **`dplyr::group_by()`** - Commonly used with summarize() to derive summarized values for multiple rows sharing certain attributes  
* **`dplyr::summarize()`** - summarize certain statistics from the data (i.e mean, median, mode, number of samples) 

**`dplyr::filter()`** - extract rows that meet certain criteria from data frame

**`dplyr::select()`** - extract columns that meet certain criteria from data frame

**`dplyr::mutate()`** - add a new column 

**`dplyr::arrange()`** - arrange the data in descending or ascending order

**`dplyr::left_join()`**, **`dplyr::right_join()`**, etc - a set of methods to combine two tidy datasets, roughly corresponding to typical notions of database joins, see the [join page](https://dplyr.tidyverse.org/reference/mutate-joins.html) of the tidyverse reference for more information

**`tidyr::pivot_longer()`** - "lengthens" data by collapsing several columns into two

Most, if not all, of these operations are available in the `data.table` package, albeit in a less readable syntax. This package was developed to quickly read, write, and manipulate large amounts of data. If you plan to work with large sets of features, it may be helpful to consider learning this framework as well. See the [Introduction to data.table](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.html) vignette for more information. The [`dtplyr` package](https://www.tidyverse.org/blog/2019/11/dtplyr-1-0-0/) is a package that unifies `dplyr` syntax with a `data.table` backend.

An important thing to know about the `dplyr` verbs (and `data.table` for that matter) is that they were developed to work efficiently on data frames that meet certain structural criteria.In data science, we call this "tidy" data. For a data to be called it must meet the following criteria:

1. Each column is a variable
2. Each row is an unique observational unit

Let's look at a few small datasets that are "tidy".

```{r}
head(iris) #data describing flower parts for several species
head(band_members) #Members of the Beatles and Rolling Stones
head(band_instruments) #Instruments of the above band members
```

The iris dataset contains information about certain species of flowers. 

As you can see, each variable has its own column, and each row is an instance of that variable. There are no rownames. We can now use dplyr verbs (such as `summarize()`, `mutate()`, `filter()`, etc) to manipulate the data. 

These verbs can be used together in a sequence of functions with the "pipe" operator. R will interpret the output of the previous function as the input to the subsequent function when you put the "pipe" operator (`%>%`) in between the functions. It is a syntactic sugar that simply takes the object on its left and feeds it into the function call on its right. This way, you can "pipe" an input through sequence of function calls, where output of the previous function call is fed into the next function call as its input. This makes for much more readable code.

For example, this:

    foo0(input)
    
becomes this:

    input %>% 
      foo0()
    
and this: 

    foo2(foo1(foo0(input)))
    
becomes this:

    input %>% 
      foo0() %>% 
      foo1() %>% 
      foo2()

To read more about pipes click [here](http://r4ds.had.co.nz/pipes.html).

Take a look through the following examples to check out common manipulations.

```{r} 
mpg %>% 
  group_by(class) %>% 
  summarise(fuel_efficiency = mean(hwy))
```

```{r}
#select all rows with sepal length greater than 5. 
iris %>% 
	filter(Sepal.Length > 5) %>% 
	head()

#group all rows of the same species together.
iris %>% 
	group_by(Species) %>% 
	head()

#select the column called "Sepal.Width"
iris %>% 
	dplyr::select(Sepal.Width) %>% 
	head()

#create another column with the species name capitalized. 
iris %>%
	group_by(Species) %>% 
	mutate(Capitalized_names = toupper(Species)) %>% 
	head()

#summarize the average sepal length and number of rows belonging to each species.
iris %>% 
	group_by(Species) %>% 
	summarize(average_sepal_length = mean(Sepal.Length), n = n()) %>% 
	head()

#arrange the species in alphabetical order
iris %>% 
	arrange(desc(Species)) %>% 
	head()

#join band members with their instruments by "name"
band_members %>% left_join(band_instruments) 

band_members %>% right_join(band_instruments)

band_members %>% full_join(band_instruments)
```

## Back to our gene expression dataset

Now let's apply these functions to our gene expression dataset!

One problem: our dataset is not "tidy". Rather, it's arranged like an excel spreadsheet. While intuitive for us to read, dplyr does not like this very much. So, we have to change it so that each row contains only one expression value. Luckily, the `pivot_longer()` function from the [`tidyr` package](https://raw.githubusercontent.com/rstudio/cheatsheets/main/tidyr.pdf) helps out with that. This function "lengthens" data by collapsing several columns into two. Column names move to a new `names_to` column and values to a new `values_to` column. Our data starts out with each sample in a separate column, so we will collapse all of those into two columns: one for the expression values themselves and one for the sample name. 

We'll first convert our expression matrix to a data.frame add the probe IDs as their own column, since the `pivot_longer` expects a data.frame or tibble and will not keep information stored in row names.

```{r}
expr_data <- expr_mat %>% 
  as.data.frame() %>%
  rownames_to_column(var = "ID")

long_data <- pivot_longer(expr_data, 
                          cols = 2:ncol(expr_data),
                          values_to = "Expression",
                          names_to = "sample")
long_data
```

You can see that the first ~20,000 rows will correspond to data from the first probe identifier, and the next group of rows will correspond to data from the second probe identifier. You can think of this function as stretching out a dataset into its long form, so that each row has only one expression value. If that's not clear, I would suggest reading [this](https://tidyr.tidyverse.org/reference/pivot_longer.html) for more information about what the `pivot_longer()` function does. 

Now we have three columns, each one corresponding to a variable: the probe name, the sample name and the expression value. 

We can do a lot of stuff with this setup! Let's calculate the mean gene expression per sample. 

```{r}
long_data %>% 
	group_by(sample) %>% 
	summarize(mean = mean(Expression))
```
## Summarizing by gene

Right now our gene expression data frame has probe IDs. But we would like to know which gene each probe maps to. We saw above that we already have this information in the feature metadata. Let's add it to the long expression data object we just created using a join operation. 

```{r}
(long_data_annot <- long_data %>% 
	left_join(feat_data@data %>% select(ID, 'Gene symbol'),
	          by = "ID"))
```

Another thing we note is that there are multiple probes that map to a specific gene. In a real life analysis workflow, there are multiple ways to deal with this. Some popular options include picking the probe with the highest expression, or taking the mean/median of all probes' expression. For simplicity, we will use `summarize()` to take the mean of each probe's expression. 

```{r}
(long_data_gene <- long_data_annot %>% 
	group_by(sample, `Gene symbol`) %>% 
	summarize(Expression = mean(Expression)))
```

Now, every gene will only have one value per sample. 

## Adding annotations using `biomaRt`

Let's explore another way to annotate our data with extra information about genes. This time we'll show how to retrieve annotation that isn't already stored in our object. 

The `biomaRt` package is very useful in this regard. It can access the ensembl database (or other databases) of gene names and annotations (ensembl.org). `biomaRt` can help us convert ensemble ids (e.g. ENSGXXXXX) into HGNC symbols (e.g. BRCA1), for example, along with a host of other things. 

Say we want to learn more about the gene expression on a particular chromosome, across all samples (this info actually happens to already be in our object, but for illustration let's pretend we don't know that). We can use `biomaRt` to look up the chromosomal location of each gene. Read the [`biomaRt` vignettes and manual](https://bioconductor.org/packages/release/bioc/html/biomaRt.html) for more detailed explanation of the following bit of code.

```{r, biomart}
#open connection between biomaRt and R. 
human = useMart("ensembl", dataset = "hsapiens_gene_ensembl")

#function that takes in data frame, and outputs same data frame with associated chromosome annotations.
identify_gene_names <- function(df){
	names(df) <- c("sample", "hgnc_symbol", "Expression")
	names <- getBM( attributes=c("hgnc_symbol", "chromosome_name"), 
	                filters= "hgnc_symbol", 
	                values = unique(df$hgnc_symbol), 
	                mart = human, useCache = FALSE)
	left_join(df, names, by = "hgnc_symbol")
}

#There's a lot of variation in how the chromosomal location is annotated. To simplify things, let's keep only the autosomes (chr 1:22), and convert the chromosome names to a factor with levels ordered numerically
(data_with_chromosome <- identify_gene_names(long_data_gene) %>% 
	filter(chromosome_name %in% c(1:22)) %>%
  mutate(chromosome_name = as.factor(as.numeric(chromosome_name))))
```

Here we've only added chromosome name - you can use `listAttributes(human)` to get a list of many many different attributes available to add.

Let's say we're interested in how the average expression of genes on chromosome 19 changes between RCC and normal cells. 

The first thing we will do is combine information from the meta data file (`meta_data`) with our expression table (`data_with_chromosome`). Then we will use dplyr verbs to first group all samples by disease status, pull out genes on chromosome 19, and then calculate the mean using `summarize()`. 

```{r}
full_data <- left_join(data_with_chromosome, meta_data, by = "sample")

full_data %>% 
	group_by(disease.state) %>% 
  filter(chromosome_name == "19") %>%
	summarize(mean = mean(Expression))
```

## Exercise (not graded)

Modify the code chunk labeled 'biomart' above to also identify the length of each gene captured in the dataset we have been working with in the above exercises. This can be done by adding  "transcript_length" as an attribute in `getBM` function. You should end up with an extra column for transcript length. 

```{r}
# YOUR CODE HERE
```


# Part 4: Plotting expression data

What if we want to graph our expression data? Time for `ggplot`! 

Let's plot the expression values by sample

```{r}
full_data %>% 
	ggplot(aes(x = sample, y = Expression)) + 
    geom_boxplot()
```
The data is quite right-skewed, so let's try a log-transformed version. We'll also rotate the axis labels so they are legible and colour by disease status.

```{r}
full_data %>% 
	ggplot(aes(x = sample, y = Expression, fill = disease.state)) + 
    geom_boxplot() +
    scale_y_log10() +
    theme(axis.text.x = element_text(angle = 90))
```
Now, let's look at the values for a single sample by chromosome.

```{r}
full_data %>% 
  filter(sample == "GSM11810") %>%
	ggplot(aes(x = chromosome_name, y = Expression)) + 
    geom_boxplot() +
    scale_y_log10()
```

# Part 5 - Applying statistical tests

Being able to graph these results is useful, but what we really want to do is run statistical tests on the data. There are a variety of ways to do that which will be explored in subsequent lectures. But in this seminar we will focus on doing this using `dplyr` 

In this case, we want to identify the genes that are differentially expressed between the normal and RCC samples. We will use `summarize()` to perform a t-test for each gene. 

```{r}
genetests <- full_data %>% 
	group_by(hgnc_symbol) %>% 
	summarize( pvalue = t.test(Expression ~ disease.state)$p.value)
genetests
```

We can plot the p-value distributions of the above t-test. 

```{r}
ggplot(genetests) +
  geom_histogram(aes(x = pvalue))
```

We can see there is a high density of small p-values. This is just a small preview of a genome-wide analysis and we'll learn much more next time!

##  Deliverable

Complete the following tasks to earn full marks for this seminar. Make sure to commit push all files to the repo generated when you accept this seminar assignment before the deadline.

Starting with the `full_data` object above, complete the following steps.

  1. Calculate the sum of all expression values in each sample, and add it to the `full_data` object (hint: use `group_by()`, `summarize()` and `left_join()`)
  
```{r}
# YOUR CODE HERE
```

  2. Add another column to the resulting object which is the Expression value divided by the sum values contained in the column just added in (1) (hint: use `mutate()`)

```{r}
# YOUR CODE HERE
```

  3. Create boxplots by sample of the log-transformed values in the column added in (2) 

```{r}
# YOUR CODE HERE
```