@@ -20,44 +20,49 @@ Note, to prevent large SVs which span regions from altering our counts, the vari
20
20
within the region's boundaries.
21
21
22
22
This creates two files:
23
- - ` counts_variants_to_regions.txt ` - input regions.bed entry annotated with number of variants and number of variant bases
24
- - ` filtered_variants_to_regions.txt ` - the counts file filtered to only regions containing any non-SNP variants
23
+ - ` counts_<output> ` - input region entries annotated with number of variants and number of variant bases
24
+ - ` filtered_<output> ` - the counts file filtered to only regions containing non-SNP variants
25
25
26
26
And reports:
27
27
```
28
28
v0.1 v0.3-dev
29
29
statistic count percent count percent
30
30
total regions 2232565 1 2170271 1
31
31
no variant 448124 0.2007 431781 0.1990
32
- only a SNP 372144 0.1667 112869 0.0520
33
- only SNPs 474209 0.2124 163636 0.0754
34
- remaining 938088 0.4202 1461985 0.6736
32
+ only a SNP 372144 0.1667 242294 0.1116
33
+ only SNPs 474209 0.2124 135780 0.0626
34
+ remaining 938088 0.4202 1360416 0.6268
35
35
```
36
36
37
37
Let's repeat this with the annotations we made previously
38
38
```
39
39
v0.1 v0.3-dev
40
- statistic count percent
41
- total regions 3298925 1
42
- no variant 1600118 0.4850
43
- only a SNP 505514 0.1532
44
- only SNPs 389598 0.1181
45
- remaining 803695 0.2436
40
+ statistic count percent count percent
41
+ total regions 3298925 1 3503876 1
42
+ no variant 1600118 0.4850 1716435 0.4899
43
+ only a SNP 505514 0.1532 332505 0.0949
44
+ only SNPs 389598 0.1181 160201 0.0457
45
+ remaining 803695 0.2436 1294735 0.3695
46
46
```
47
47
48
48
And again with the unannotated regions
49
49
```
50
50
v0.1 v0.3-dev
51
- statistic count percent
52
- total regions 439538 1
53
- no variant 128123 0.2915
54
- only a SNP 102119 0.2323
55
- only SNPs 126488 0.2878
56
- remaining 82808 0.1884
51
+ statistic count percent count percent
52
+ total regions 439538 1 428642 1
53
+ no variant 128123 0.2915 126221 0.2945
54
+ only a SNP 102119 0.2323 61672 0.1439
55
+ only SNPs 126488 0.2878 28007 0.0653
56
+ remaining 82808 0.1884 212742 0.4963
57
57
```
58
58
59
59
So it's interesting (promising) that our unannotated regions less frequently contain variants.
60
60
61
+ v0.3-dev ... We have a lot more regions 'remaining' in the unannotated. I gotta figure out what's happening here.
62
+
63
+ 1 . Adding these new regions (namely pbsv, trgt, and usc are expanding the boundaries.
64
+ Collect these stats for the first slide... Actually hold off at this point.
65
+
61
66
Question 2:
62
67
===========
63
68
Of the candidate regions with variation, what percent of the variants by count and bases effected are contained
@@ -99,17 +104,17 @@ Question 3
99
104
Can we find expansions/contractions of the tr_annotations inside the variants?
100
105
101
106
The ` filtered_variants_to_regions.txt ` is now our new version of the tr_regions.bed. We'll use that to repeat the
102
- 'Defining Repeats' steps described in ` ../README.md `
103
-
104
-
105
- ``` bash
106
- samtools faidx -r <( zcat tr_regions.bed.gz | awk ' {print $1 ":" $2 "-" $3}' )
107
- ~ /scratch/insertion_ref/msru/data/reference/grch38/GRCh38_1kg_mainchrs.fa > tr_regions.fasta
108
- ```
109
-
107
+ 'Defining Repeats' steps described in ` ../README.md `
110
108
Then run TRF on the reference sequence of regions:
109
+
111
110
``` bash
112
- trf409.linux64 data/tr_regions.fasta 3 7 7 80 5 5 500 -h -ngs > data/grch38.tandemrepeatfinder.txt
111
+ samtools faidx -r <( cat filtered_variants_to_regions.txt | awk ' {print $1 ":" $2 "-" $3}' ) \
112
+ ~ /scratch/insertion_ref/msru/data/reference/grch38/GRCh38_1kg_mainchrs.fa > tr_regions.fasta
113
+ trf409.linux64 tr_regions.fasta 3 7 7 80 5 5 500 -h -ngs > grch38.tandemrepeatfinder.txt
114
+ python ../scripts/trf_reformatter.py grch38.tandemrepeatfinder.txt final_something
115
+ bedtools sort -i final_something.bed | bgzip > final_something.bed.gz
116
+ tabix final_something.bed.gz
117
+ python ../scripts/tr_reganno_maker.py filtered_variants_to_regions.txt final_something.bed.gz > candidate_v0.3_anno.bed
113
118
```
114
119
115
120
Because we're going to be using the variants to filter these repeat annotations, we lower the min-score to 5 from 40
0 commit comments