exercise-sheet-7.Rmd

```{r, include=FALSE}
source("custom_functions.R")
library(flextable)
library(officer)

```

---
title: "Exercise sheet 7: BLAT"
---

---------------------------------

In the first step of the BLAT (Blast-like Alignment Tool) algorithm, regions that are likely to be homologous are detected.
In this exercise sheet, we will investigate the search stage of the BLAT algorithm on the example of the mouse genome.

We want to determine whether a region from the human genome aligns to a part of the mouse genome.
Therefore, a homologous region to our query sequence will be detected during the search stage.

We assume that:

* the human genome $G$ is approximately $2.9$ billion bases long.
* the mouse genome $G$ is approximately $2.5$ billion bases long.
* the match ratio $M$ between homologous areas of both species is $98\%$ for DNA and $89\%$ for protein alignments.
* in our example we assume that homologous areas $H$ are typically $50$ bases long. 
* our query sequence $Q$ is GTCCTCGGAACCAGGACCTCGGCGTGGCCTAGCG.

In the following exercises, we will focus on the DNA sequences.

# Exercise 1

For the $K$-mer sizes $K=7$ and $K=14$, respectively. 


### 1a)

::: {.question data-latex=""}
What is the probability of having a perfect match between a specific $K$-mer in a homologous region and a $K$-mer in the query sequence?
:::

#### {.tabset}

##### Hide

##### Formula
::: {.answer data-latex=""}
$$
p_{1} = M^K
$$
:::
##### Solution
::: {.answer data-latex=""}
For $K=7$:
$$
p_{1} = 0.98^{7} = 0.8681
$$
For $K=14$:
$$
p_{1} = 0.98^{14} = 0.7536
$$
:::
#### {-}


### 1b)

::: {.question data-latex=""}
What is the probability that at least one non-overlapping $K$-mer in the homologous region matches perfectly with the corresponding $K$-mer in the query sequence?
:::

#### {.tabset}

##### Hide

##### Formula
::: {.answer data-latex=""}
$$
P = 1 - (1 - p_{1})^{T} = 1 - (1 - M^{K})^{T},\\
\text{with}\quad T= \left\lfloor\frac{H}{K}\right\rfloor
$$
:::

##### Solution
::: {.answer data-latex=""}
For $K=7$:
$$
P = 1 - (1 - 0.98^7)^{\left\lfloor\frac{50}{7}\right\rfloor} = 0.9999... 
$$
For $K=14$:
$$
P = 1 - (1 - 0.98^{14})^{\left\lfloor\frac{50}{14}\right\rfloor} = 0.9850... 
$$
:::

#### {-}

### 1c)


::: {.question data-latex=""}
Calculate the number of False Positives (FPs), i.e. the number of $K$-mers that are expected to match by chance.
:::

#### {.tabset}

##### Hide

##### Formula
::: {.answer data-latex=""}
\begin{align*}
F &= (Q-K+1) \times \bigg(\frac{G}{K}\bigg) \times \bigg(\frac{1}{A}\bigg)^{K},\\
&A: \text{the alphabet size}\\
&Q: \text{the query length in bases}\\
&G: \text{the genome size in bases}
\end{align*}

:::: {#explaining .warning-box }

::: {#note-exp .warning-header}
```{r, include=knitr::is_html_output(), echo=FALSE,}
knitr::include_graphics("figures/warningicon.svg")
```
**Warning**
:::
::: {#note-exp .note-body}

This only holds under the assumption that all letters in the alphabet are equally likely!

:::
::::

:::

##### Solution
::: {.answer data-latex=""}
For $K=7$:
$$
F = (34-7+1) \times \frac{2500000000}{7} \times \frac{1}{4}^{7} = 610351.5625
$$
For $K=14$:
$$
F = (34-14+1) \times \frac{2500000000}{14} \times \frac{1}{4}^{14} = 13.9698
$$
:::
#### {-}

### 1d)

::: {.question data-latex=""}
Observe the True positive rate (TPR) and the number of False Positives for the $7$-mers and $14$-mers that you computed in part 1B and 1C.
What observation do you make?
:::

#### {.tabset}

##### Hide

##### Solution
::: {.answer data-latex=""}
When increasing the $K$-mer size both the TPR and the number of False Positives are reduced. 
But the number of False Positives reduces drastically, compared to the TPR.
:::
#### {-}


# Exercise 2

In order to increase the True Positive Rate, we want to allow single mismatches when checking for exact matches.

For the $K$-mer sizes $K=7$ and $K=14$, respectively.


### 2a)
::: {.question data-latex=""}
What is the probability that at least one non-overlapping $K$-mer in the homologous region matches perfectly with the corresponding $K$-mer in the query sequence? Given that we allow one mismatch.
:::

#### {.tabset}

##### Hide

##### Formula
::: {.answer data-latex=""}
$$
P = 1 - (1 - (M^{K}+K\times M^{K-1}\times (1-M)))^{T},\\
\text{with}\quad T= \left\lfloor\frac{H}{K}\right\rfloor
$$
:::

##### Solution
::: {.answer data-latex=""}
For $K=7$:
$$
P = 1 - (1 - (0.98^{7} + 7*0.98^{6}*0.02))^{\left\lfloor\frac{50}{7}\right\rfloor} \approx 1
$$
For $K=14$:
$$
P = 1 - (1 - (0.98^{14} + 14*0.98^{13}*0.02))^{\left\lfloor\frac{50}{14}\right\rfloor} \approx 1
$$
:::
#### {-}

### 2b)
::: {.question data-latex=""}
Calculate the number of False Positives (FPs), i.e. the number of $K$-mers that are expected to match by chance. Given that we allow one mismatch.

:::

#### {.tabset}

##### Hide

##### Formula
::: {.answer data-latex=""}
\begin{align*}
F &= (Q-K+1) \times \bigg(\frac{G}{K}\bigg) \times \Bigg[\bigg(\frac{1}{A}\bigg)^{K} + K \times \bigg(\frac{1}{A}\bigg)^{K-1} \times \bigg(1-\frac{1}{A}\bigg)\Bigg],\\
&A: \text{the alphabet size}\\
&Q: \text{the query length in bases}\\
&G: \text{the genome size in bases}
\end{align*}
:::

##### Solution
::: {.answer data-latex=""}
For $K=7$:
$$
F = (34-7+1) \times \frac{2500000000}{7} \times (\frac{1}{4}^{7} + 7 \times \frac{1}{4}^6 \times (1-\frac{1}{4})) = 1.3427... * 10^{7}  
$$
For $K=14$:
$$
F = (34-14+1) \times \frac{2500000000}{14} \times (\frac{1}{4}^{14} + 14 \times \frac{1}{4}^{13} \times (1-\frac{1}{4})) = 600.7031  
$$
:::

#### {-}

### 2c)
::: {.question data-latex=""}
What development do we see when observing the TPR and FPs for the updated formulae?
:::

#### {.tabset}

##### Hide

##### Solution
::: {.answer data-latex=""}
We observe an increase in both the TPR and the number of False positives.
:::
#### {-}

# Exercise 3

Finally, we want to reduce the number of False Positive results. To that end, instead of requiring single perfect matches, we now want at least $n$ perfect matches.

For the $K$-mer sizes $K=7$ and $K=14$, respectively.


### 3a)
::: {.question data-latex=""}
What is the probability that at least $2$ non-overlapping $K$-mer in the homologous region match perfectly with corresponding $K$-mers in the query sequence? 
:::

#### {.tabset}

##### Hide

##### Formula
::: {.answer data-latex=""}
$$
P_{n} = (M^{K})^{n} \times (1-M^{K})^{T-n} \times \frac{T!}{n! \times (T-n)!}\\
\text{with}\quad T= \left\lfloor\frac{H}{K}\right\rfloor\\\\
P = P_{n} + P_{n+1} + ... + P_{T}
$$
:::

##### Solution
::: {.answer data-latex=""}
For $K=7$:
$$
P_{2} = 0.98^{7 \times 2} \times (1-0.98^{7})^{\left\lfloor\frac{50}{7}\right\rfloor-2} \times \frac{\frac{50}{7}\text{!}}{2\text{!}\times(\frac{50}{7}-2)\text{!}}\\
\text{...}\\
\text{...}\\
$$
\begin{align*}
P &= P_{2} + P_{3} + P_{4} + P_{5} + P_{6} + P_{7}\\
  &= 0.000631 + 0.006925 + 0.045591 + 0.180074 + 0.395142 + 0.371601\\
  &= 0.999967...
\end{align*}

For $K=14$:
$$
P_{2} = 0.98^{14 \times 2} \times (1-0.98^{14})^{\left\lfloor\frac{50}{14}\right\rfloor-2} \times \frac{\frac{50}{14}!}{2!\times(\frac{50}{14}-2)!}\\
\text{...}\\
\text{...}\\
$$
\begin{align*}
P &= P_{2} + P_{3}\\
  &= 0.419776 + 0.428050\\
  &= 0.847827...
\end{align*}
:::
#### {-}

### 3b)
::: {.question data-latex=""}
We can observe a decrease in the TPR. Does it still make sense to use this method? Why?
:::

#### {.tabset}

##### Hide

##### Hint

```{r, echo=FALSE, out.width="100%", fig.align='center'}
knitr::include_graphics("figures/sheet-7/blat-multiple.png")
```

##### Solution
::: {.answer data-latex=""}
While it is true that this also decreases the TPR. It drastically decreases the amount of False Positives, which greatly improves the overall results.
In real examples, the size of the homologous regions is typically higher than 50 nucleotides that we chose for this example. When considering our example with $H=100$, we would end up with a TPR close to 1.0, while having nearly no False Positives.
:::
#### {-}