-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathexercise-sheet-10.Rmd
199 lines (132 loc) · 3.85 KB
/
exercise-sheet-10.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
```{r, include=FALSE}
source("custom_functions.R")
library(flextable)
library(officer)
```
---
title: "Exercise sheet 10: Introduction to Mapping"
---
---------------------------------
# Exercise 1
### 1a)
::: {.question data-latex=""}
Extract the Burrows-Wheeler Transform B(S) of $S = TGGTGGTTGA\$$.
:::
#### {.tabset}
##### Hide
##### Solution
::: {.answer data-latex=""}
$B(S) = AGTTTGGT\$GG$
```{r, echo=FALSE, out.width="100%", fig.align='center'}
knitr::include_graphics("figures/sheet-10/exercise_01.png")
```
:::
#### {-}
### 1b)
::: {.question data-latex=""}
Invert the Burrows-Wheeler-Transform $B(S) = TCAACT\$AA$ to obtain S.
:::
#### {.tabset}
##### Hide
##### Hint 1
::: {.answer data-latex=""}
You can get the first column $F(S)$ of the Burrows Wheeler Matrix via
counting and sorting the letters since it has a very predictably structure.
$F(S) = \$AAAACCTT$
:::
##### Hint 2
::: {.answer data-latex=""}
Use the last-first mapping to assign indices to the corresponging letters
in the first $F(S)$ and the last column $B(S)$
$F(S) = \$_1A_1A_2A_3A_4C_1C_2T_3T_4$
$B(S) = T_1C_1A_1A_2C_2T_2\$_1A_3A_4$
:::
##### Solution 1
::: {.answer data-latex=""}
```{r, echo=FALSE, out.width="80%", fig.align='center'}
knitr::include_graphics("figures/sheet-10/exercise_01_solution.png")
```
:::
##### Solution 2
::: {.answer data-latex=""}
Step-by-step method:
```{r, echo=FALSE, out.width="100%", fig.align='center'}
knitr::include_graphics("figures/sheet-10/exercise_01_details.png")
```
:::
#### {-}
### 1c)
::: {.question data-latex=""}
Search for $CAT$ in $B(S) = TCAACT\$AA$.
:::
#### {.tabset}
##### Hide
##### Solution
::: {.answer data-latex=""}
```{r, echo=FALSE, out.width="80%", fig.align='center'}
knitr::include_graphics("figures/sheet-10/exercise_01C.png")
```
:::
#### {-}
# Exercise 2
::: {.question data-latex=""}
Read the publication on the bwa-mem aligner at https://arxiv.org/abs/1303.3997 and pay
particular attention to the re-seeding and chaining features of the algorithm. Now consider a read
$R = CCCCGTTTT$ and a reference genome $T = ...CCCCATTTT...CCCCGA...AGTTTT...$ and
explain step-by-step how re-seeding and chaining let bwa-mem let recover the correct best alignment
of $R$ to $T$.
:::
### 2a)
::: {.question data-latex=""}
What are the original SMEMs that get generated?
:::
#### {.tabset}
##### Hide
##### Solution
::: {.question data-latex=""}
The original MEMs are CCCCG and GTTTT.
:::
#### {-}
### 2b)
::: {.question data-latex=""}
Would these SMEMs lead to discovery of the best possible alignment?
:::
#### {.tabset}
##### Hide
##### Solution
::: {.answer data-latex=""}
No, because it would not lead to a best possible match in the reference genome. The current
seeds are too specific and we may miss the seeds that lead to the best mapping (CCC-
CATTTT), therefore we need to reseed.
:::
#### {-}
### 2c)
::: {.question data-latex=""}
Which shortened new SMEMs are discovered with re-seeding (assume re-seeding gets perfor-
med despite the below-threshold length of the SMEMs)?
:::
#### {.tabset}
##### Hide
##### Solution
::: {.answer data-latex=""}
Re-seeding around the central base of each MEM leads to discovery of CCCC and TTTT
(both occur once more often than the originals).
:::
#### {-}
### 2d)
::: {.question data-latex=""}
What is the effect of chaining of colinear seeds?
:::
#### {.tabset}
##### Hide
##### Solution
::: {.answer data-latex=""}
The two new seeds are colinear on R and T , and can be merged into a chain, so only one
local alignment has to be performed.
:::
#### {-}
---------------------------------
# Exercise 3 - Programming assignment
For the programming tasks, please follow the instructions given in GitHub Classroom under the following link.
[https://classroom.github.com/a/7TbHYZaw](https://classroom.github.com/a/7TbHYZaw)
-------------------------------------------