-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathexercise-sheet-7.Rmd
329 lines (247 loc) · 7.8 KB
/
exercise-sheet-7.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
```{r, include=FALSE}
source("custom_functions.R")
library(flextable)
library(officer)
```
---
title: "Exercise sheet 7: BLAT"
---
---------------------------------
In the first step of the BLAT (Blast-like Alignment Tool) algorithm, regions that are likely to be homologous are detected.
In this exercise sheet, we will investigate the search stage of the BLAT algorithm on the example of the mouse genome.
We want to determine whether a region from the human genome aligns to a part of the mouse genome.
Therefore, a homologous region to our query sequence will be detected during the search stage.
We assume that:
* the human genome $G$ is approximately $2.9$ billion bases long.
* the mouse genome $G$ is approximately $2.5$ billion bases long.
* the match ratio $M$ between homologous areas of both species is $98\%$ for DNA and $89\%$ for protein alignments.
* in our example we assume that homologous areas $H$ are typically $50$ bases long.
* our query sequence $Q$ is GTCCTCGGAACCAGGACCTCGGCGTGGCCTAGCG.
In the following exercises, we will focus on the DNA sequences.
# Exercise 1
For the $K$-mer sizes $K=7$ and $K=14$, respectively.
### 1a)
::: {.question data-latex=""}
What is the probability of having a perfect match between a specific $K$-mer in a homologous region and a $K$-mer in the query sequence?
:::
#### {.tabset}
##### Hide
##### Formula
::: {.answer data-latex=""}
$$
p_{1} = M^K
$$
:::
##### Solution
::: {.answer data-latex=""}
For $K=7$:
$$
p_{1} = 0.98^{7} = 0.8681
$$
For $K=14$:
$$
p_{1} = 0.98^{14} = 0.7536
$$
:::
#### {-}
### 1b)
::: {.question data-latex=""}
What is the probability that at least one non-overlapping $K$-mer in the homologous region matches perfectly with the corresponding $K$-mer in the query sequence?
:::
#### {.tabset}
##### Hide
##### Formula
::: {.answer data-latex=""}
$$
P = 1 - (1 - p_{1})^{T} = 1 - (1 - M^{K})^{T},\\
\text{with}\quad T= \left\lfloor\frac{H}{K}\right\rfloor
$$
:::
##### Solution
::: {.answer data-latex=""}
For $K=7$:
$$
P = 1 - (1 - 0.98^7)^{\left\lfloor\frac{50}{7}\right\rfloor} = 0.9999...
$$
For $K=14$:
$$
P = 1 - (1 - 0.98^{14})^{\left\lfloor\frac{50}{14}\right\rfloor} = 0.9850...
$$
:::
#### {-}
### 1c)
::: {.question data-latex=""}
Calculate the number of False Positives (FPs), i.e. the number of $K$-mers that are expected to match by chance.
:::
#### {.tabset}
##### Hide
##### Formula
::: {.answer data-latex=""}
\begin{align*}
F &= (Q-K+1) \times \bigg(\frac{G}{K}\bigg) \times \bigg(\frac{1}{A}\bigg)^{K},\\
&A: \text{the alphabet size}\\
&Q: \text{the query length in bases}\\
&G: \text{the genome size in bases}
\end{align*}
:::: {#explaining .warning-box }
::: {#note-exp .warning-header}
```{r, include=knitr::is_html_output(), echo=FALSE,}
knitr::include_graphics("figures/warningicon.svg")
```
**Warning**
:::
::: {#note-exp .note-body}
This only holds under the assumption that all letters in the alphabet are equally likely!
:::
::::
:::
##### Solution
::: {.answer data-latex=""}
For $K=7$:
$$
F = (34-7+1) \times \frac{2500000000}{7} \times \frac{1}{4}^{7} = 610351.5625
$$
For $K=14$:
$$
F = (34-14+1) \times \frac{2500000000}{14} \times \frac{1}{4}^{14} = 13.9698
$$
:::
#### {-}
### 1d)
::: {.question data-latex=""}
Observe the True positive rate (TPR) and the number of False Positives for the $7$-mers and $14$-mers that you computed in part 1B and 1C.
What observation do you make?
:::
#### {.tabset}
##### Hide
##### Solution
::: {.answer data-latex=""}
When increasing the $K$-mer size both the TPR and the number of False Positives are reduced.
But the number of False Positives reduces drastically, compared to the TPR.
:::
#### {-}
# Exercise 2
In order to increase the True Positive Rate, we want to allow single mismatches when checking for exact matches.
For the $K$-mer sizes $K=7$ and $K=14$, respectively.
### 2a)
::: {.question data-latex=""}
What is the probability that at least one non-overlapping $K$-mer in the homologous region matches perfectly with the corresponding $K$-mer in the query sequence? Given that we allow one mismatch.
:::
#### {.tabset}
##### Hide
##### Formula
::: {.answer data-latex=""}
$$
P = 1 - (1 - (M^{K}+K\times M^{K-1}\times (1-M)))^{T},\\
\text{with}\quad T= \left\lfloor\frac{H}{K}\right\rfloor
$$
:::
##### Solution
::: {.answer data-latex=""}
For $K=7$:
$$
P = 1 - (1 - (0.98^{7} + 7*0.98^{6}*0.02))^{\left\lfloor\frac{50}{7}\right\rfloor} \approx 1
$$
For $K=14$:
$$
P = 1 - (1 - (0.98^{14} + 14*0.98^{13}*0.02))^{\left\lfloor\frac{50}{14}\right\rfloor} \approx 1
$$
:::
#### {-}
### 2b)
::: {.question data-latex=""}
Calculate the number of False Positives (FPs), i.e. the number of $K$-mers that are expected to match by chance. Given that we allow one mismatch.
:::
#### {.tabset}
##### Hide
##### Formula
::: {.answer data-latex=""}
\begin{align*}
F &= (Q-K+1) \times \bigg(\frac{G}{K}\bigg) \times \Bigg[\bigg(\frac{1}{A}\bigg)^{K} + K \times \bigg(\frac{1}{A}\bigg)^{K-1} \times \bigg(1-\frac{1}{A}\bigg)\Bigg],\\
&A: \text{the alphabet size}\\
&Q: \text{the query length in bases}\\
&G: \text{the genome size in bases}
\end{align*}
:::
##### Solution
::: {.answer data-latex=""}
For $K=7$:
$$
F = (34-7+1) \times \frac{2500000000}{7} \times (\frac{1}{4}^{7} + 7 \times \frac{1}{4}^6 \times (1-\frac{1}{4})) = 1.3427... * 10^{7}
$$
For $K=14$:
$$
F = (34-14+1) \times \frac{2500000000}{14} \times (\frac{1}{4}^{14} + 14 \times \frac{1}{4}^{13} \times (1-\frac{1}{4})) = 600.7031
$$
:::
#### {-}
### 2c)
::: {.question data-latex=""}
What development do we see when observing the TPR and FPs for the updated formulae?
:::
#### {.tabset}
##### Hide
##### Solution
::: {.answer data-latex=""}
We observe an increase in both the TPR and the number of False positives.
:::
#### {-}
# Exercise 3
Finally, we want to reduce the number of False Positive results. To that end, instead of requiring single perfect matches, we now want at least $n$ perfect matches.
For the $K$-mer sizes $K=7$ and $K=14$, respectively.
### 3a)
::: {.question data-latex=""}
What is the probability that at least $2$ non-overlapping $K$-mer in the homologous region match perfectly with corresponding $K$-mers in the query sequence?
:::
#### {.tabset}
##### Hide
##### Formula
::: {.answer data-latex=""}
$$
P_{n} = (M^{K})^{n} \times (1-M^{K})^{T-n} \times \frac{T!}{n! \times (T-n)!}\\
\text{with}\quad T= \left\lfloor\frac{H}{K}\right\rfloor\\\\
P = P_{n} + P_{n+1} + ... + P_{T}
$$
:::
##### Solution
::: {.answer data-latex=""}
For $K=7$:
$$
P_{2} = 0.98^{7 \times 2} \times (1-0.98^{7})^{\left\lfloor\frac{50}{7}\right\rfloor-2} \times \frac{\frac{50}{7}\text{!}}{2\text{!}\times(\frac{50}{7}-2)\text{!}}\\
\text{...}\\
\text{...}\\
$$
\begin{align*}
P &= P_{2} + P_{3} + P_{4} + P_{5} + P_{6} + P_{7}\\
&= 0.000631 + 0.006925 + 0.045591 + 0.180074 + 0.395142 + 0.371601\\
&= 0.999967...
\end{align*}
For $K=14$:
$$
P_{2} = 0.98^{14 \times 2} \times (1-0.98^{14})^{\left\lfloor\frac{50}{14}\right\rfloor-2} \times \frac{\frac{50}{14}!}{2!\times(\frac{50}{14}-2)!}\\
\text{...}\\
\text{...}\\
$$
\begin{align*}
P &= P_{2} + P_{3}\\
&= 0.419776 + 0.428050\\
&= 0.847827...
\end{align*}
:::
#### {-}
### 3b)
::: {.question data-latex=""}
We can observe a decrease in the TPR. Does it still make sense to use this method? Why?
:::
#### {.tabset}
##### Hide
##### Hint
```{r, echo=FALSE, out.width="100%", fig.align='center'}
knitr::include_graphics("figures/sheet-7/blat-multiple.png")
```
##### Solution
::: {.answer data-latex=""}
While it is true that this also decreases the TPR. It drastically decreases the amount of False Positives, which greatly improves the overall results.
In real examples, the size of the homologous regions is typically higher than 50 nucleotides that we chose for this example. When considering our example with $H=100$, we would end up with a TPR close to 1.0, while having nearly no False Positives.
:::
#### {-}