Skip to content

Commit b49b00f

Browse files
committed
adding v0.3 tr annotations
1 parent 6247152 commit b49b00f

8 files changed

+110
-38
lines changed

codis/get_coverage_counts.py

+4-4
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,12 @@
55

66

77
#cols = ['HG002', "NA24385", "li:NA24385"]
8-
cols = [_.strip() for _ in open('males.txt', 'r')] # males
9-
looking_for = ['1,1,0,0', '0,0,1,1']
8+
#cols = [_.strip() for _ in open('males.txt', 'r')] # males
9+
#looking_for = ['1,1,0,0', '0,0,1,1']
1010
#v = (~d[cols].isin(looking_for)).sum()
1111

12-
#cols = d.columns[4:]
13-
#looking_for = ['1,1,1,1']
12+
cols = d.columns[4:]
13+
looking_for = ['1,1,1,1']
1414
v = (~d[cols].isin(looking_for)).sum()
1515
# sample, inadequate counts
1616
print(v.value_counts())

regions/.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,6 @@ data/tr_annotated.jl
2020
data/tr_regions.fasta.fai
2121
data/trf_annos_df.jl
2222
data/tr_regions.bed.gz
23+
data/usc/
24+
data/pbsv/
25+
data/trgt/

regions/DataDescription.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,17 @@
11
# Versions:
22

33
## v0.3 - More Regions
4+
(Click the badge to go to the download page)
5+
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.7226352.svg)](https://doi.org/10.5281/zenodo.7226352)
46

57
### CHANGES:
68
* Added new annotations sources from:
79
* [TRGT](https://github.com/PacificBiosciences/trgt/tree/main/repeats) - Both full regions and pathogenic
810
* [pbsv](https://github.com/PacificBiosciences/pbsv/blob/master/annotations/human_GRCh38_no_alt_analysis_set.trf.bed)
911
* [Vamos](https://zenodo.org/record/7155334/)
10-
* Same File structure as v0.2..?
12+
* See [slides](https://github.com/ACEnglish/adotto/blob/main/slides/GIABTR_English_October172022.pdf) for details
13+
* Same file structure as v0.2
1114

12-
KnownPathogenic name - Gene Name (could be non) Gene feature (intron/exon/etc)
1315

1416
## v0.2 - Useable version
1517
(Click the badge to go to the download page)

regions/intersection/README.md

+31-26
Original file line numberDiff line numberDiff line change
@@ -20,44 +20,49 @@ Note, to prevent large SVs which span regions from altering our counts, the vari
2020
within the region's boundaries.
2121

2222
This creates two files:
23-
- `counts_variants_to_regions.txt` - input regions.bed entry annotated with number of variants and number of variant bases
24-
- `filtered_variants_to_regions.txt` - the counts file filtered to only regions containing any non-SNP variants
23+
- `counts_<output>` - input region entries annotated with number of variants and number of variant bases
24+
- `filtered_<output>` - the counts file filtered to only regions containing non-SNP variants
2525

2626
And reports:
2727
```
2828
v0.1 v0.3-dev
2929
statistic count percent count percent
3030
total regions 2232565 1 2170271 1
3131
no variant 448124 0.2007 431781 0.1990
32-
only a SNP 372144 0.1667 112869 0.0520
33-
only SNPs 474209 0.2124 163636 0.0754
34-
remaining 938088 0.4202 1461985 0.6736
32+
only a SNP 372144 0.1667 242294 0.1116
33+
only SNPs 474209 0.2124 135780 0.0626
34+
remaining 938088 0.4202 1360416 0.6268
3535
```
3636

3737
Let's repeat this with the annotations we made previously
3838
```
3939
v0.1 v0.3-dev
40-
statistic count percent
41-
total regions 3298925 1
42-
no variant 1600118 0.4850
43-
only a SNP 505514 0.1532
44-
only SNPs 389598 0.1181
45-
remaining 803695 0.2436
40+
statistic count percent count percent
41+
total regions 3298925 1 3503876 1
42+
no variant 1600118 0.4850 1716435 0.4899
43+
only a SNP 505514 0.1532 332505 0.0949
44+
only SNPs 389598 0.1181 160201 0.0457
45+
remaining 803695 0.2436 1294735 0.3695
4646
```
4747

4848
And again with the unannotated regions
4949
```
5050
v0.1 v0.3-dev
51-
statistic count percent
52-
total regions 439538 1
53-
no variant 128123 0.2915
54-
only a SNP 102119 0.2323
55-
only SNPs 126488 0.2878
56-
remaining 82808 0.1884
51+
statistic count percent count percent
52+
total regions 439538 1 428642 1
53+
no variant 128123 0.2915 126221 0.2945
54+
only a SNP 102119 0.2323 61672 0.1439
55+
only SNPs 126488 0.2878 28007 0.0653
56+
remaining 82808 0.1884 212742 0.4963
5757
```
5858

5959
So it's interesting (promising) that our unannotated regions less frequently contain variants.
6060

61+
v0.3-dev ... We have a lot more regions 'remaining' in the unannotated. I gotta figure out what's happening here.
62+
63+
1. Adding these new regions (namely pbsv, trgt, and usc are expanding the boundaries.
64+
Collect these stats for the first slide... Actually hold off at this point.
65+
6166
Question 2:
6267
===========
6368
Of the candidate regions with variation, what percent of the variants by count and bases effected are contained
@@ -99,17 +104,17 @@ Question 3
99104
Can we find expansions/contractions of the tr_annotations inside the variants?
100105

101106
The `filtered_variants_to_regions.txt` is now our new version of the tr_regions.bed. We'll use that to repeat the
102-
'Defining Repeats' steps described in `../README.md`
103-
104-
105-
```bash
106-
samtools faidx -r <(zcat tr_regions.bed.gz | awk '{print $1 ":" $2 "-" $3}')
107-
~/scratch/insertion_ref/msru/data/reference/grch38/GRCh38_1kg_mainchrs.fa > tr_regions.fasta
108-
```
109-
107+
'Defining Repeats' steps described in `../README.md`
110108
Then run TRF on the reference sequence of regions:
109+
111110
```bash
112-
trf409.linux64 data/tr_regions.fasta 3 7 7 80 5 5 500 -h -ngs > data/grch38.tandemrepeatfinder.txt
111+
samtools faidx -r <(cat filtered_variants_to_regions.txt | awk '{print $1 ":" $2 "-" $3}') \
112+
~/scratch/insertion_ref/msru/data/reference/grch38/GRCh38_1kg_mainchrs.fa > tr_regions.fasta
113+
trf409.linux64 tr_regions.fasta 3 7 7 80 5 5 500 -h -ngs > grch38.tandemrepeatfinder.txt
114+
python ../scripts/trf_reformatter.py grch38.tandemrepeatfinder.txt final_something
115+
bedtools sort -i final_something.bed | bgzip > final_something.bed.gz
116+
tabix final_something.bed.gz
117+
python ../scripts/tr_reganno_maker.py filtered_variants_to_regions.txt final_something.bed.gz > candidate_v0.3_anno.bed
113118
```
114119

115120
Because we're going to be using the variants to filter these repeat annotations, we lower the min-score to 5 from 40
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
import sys
2+
from truvari.annotations.trf import iter_tr_regions
3+
from intervaltree import IntervalTree
4+
5+
def tree_and_merge(line):
6+
7+
m_tree = IntervalTree()
8+
for i in line['annos']:
9+
m_tree.addi(i['start'], i['end'])
10+
line['regcnt'] = len(m_tree)
11+
m_tree.merge_overlaps()
12+
line['mrgcnt'] = len(m_tree)
13+
return line
14+
15+
parts = []
16+
for entry in iter_tr_regions(sys.argv[1]):
17+
entry = tree_and_merge(entry)
18+
parts.append(entry)
19+
#entry['mrgcnt'])
20+
import joblib
21+
joblib.dump(parts, 'tr_regmrgcnts.jl')

regions/intersection/variant_region_intersection.py

+10-6
Original file line numberDiff line numberDiff line change
@@ -33,17 +33,21 @@ def main(in_bed, in_vcf, out_name):
3333
start = int(start)
3434
end = int(end)
3535
cnt = 0
36-
bases = 0
36+
snps = 0
37+
non_snp = 0
3738
for i in variants.fetch(chrom, int(start), int(end)):
3839
# check only svs.. take this out for core analysis but keep in for extra analysis
3940
#if 'SVLEN' not in i.info or i.info["SVLEN"] < 50:
4041
#continue
4142
vs, ve = truvari.entry_boundaries(i)
4243
if start <= vs and ve <= end:
4344
cnt += 1
44-
bases += truvari.entry_size(i)
45-
fout.write(f"{line}\t{cnt}\t{bases}\n")
46-
data = pd.read_csv(f"counts_{out_name}", sep='\t', header=None, names=['chrom', 'start', 'end', 'num_vars', 'num_bases'])
45+
if truvari.entry_variant_type(i) != truvari.SV.SNP:
46+
non_snp += 1
47+
else:
48+
snps += 1
49+
fout.write(f"{line}\t{cnt}\t{snps}\t{non_snp}\n")
50+
data = pd.read_csv(f"counts_{out_name}", sep='\t', header=None, names=['chrom', 'start', 'end', 'num_vars', 'snps', 'non_snp'])
4751
print("statistic\tcount\tpercent")
4852

4953
tot = len(data)
@@ -53,11 +57,11 @@ def main(in_bed, in_vcf, out_name):
5357
i = no_var.sum()
5458
print("no variant\t%d\t%.4f" % (i, i / tot))
5559

56-
single_snp = (data['num_vars'] == 1) & (data['num_bases'] == 1)
60+
single_snp = (data['snps'] == 1) & (data['non_snp'] == 0)
5761
i = single_snp.sum()
5862
print("only a SNP\t%d\t%.4f" % (i, i / tot))
5963

60-
only_snps = (data['num_vars'] > 1) & (data['num_vars'] == data['num_bases'])
64+
only_snps = (data['snps'] > 1) & (data['non_snp'] == 0)
6165
i = only_snps.sum()
6266
print("only SNPs\t%d\t%.4f" % (i, i / tot))
6367

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
echo v0.1
2+
echo pbsv
3+
bedtools intersect -u -a data/pbsv/merged.bed.gz -b delme_v0.1/data/tr_regions.bed.gz | wc -l
4+
echo usc
5+
bedtools intersect -u -a data/usc/merged.bed.gz -b delme_v0.1/data/tr_regions.bed.gz | wc -l
6+
echo trgt
7+
bedtools intersect -u -a data/trgt/merged.bed.gz -b delme_v0.1/data/tr_regions.bed.gz | wc -l
8+
9+
echo v0.2
10+
echo pbsv
11+
bedtools intersect -u -a data/pbsv/merged.bed.gz -b adotto_TRannotations_v0.2.bed.gz | wc -l
12+
echo usc
13+
bedtools intersect -u -a data/usc/merged.bed.gz -b adotto_TRannotations_v0.2.bed.gz | wc -l
14+
echo trgt
15+
bedtools intersect -u -a data/trgt/merged.bed.gz -b adotto_TRannotations_v0.2.bed.gz | wc -l
16+
17+
echo v0.3
18+
echo pbsv
19+
bedtools intersect -u -a data/pbsv/merged.bed.gz -b adotto_TRannotations_v0.3.bed.gz | wc -l
20+
echo usc
21+
bedtools intersect -u -a data/usc/merged.bed.gz -b adotto_TRannotations_v0.3.bed.gz | wc -l
22+
echo trgt
23+
bedtools intersect -u -a data/trgt/merged.bed.gz -b adotto_TRannotations_v0.3.bed.gz | wc -l

variants/.gitignore

+14
Original file line numberDiff line numberDiff line change
@@ -1 +1,15 @@
11
data/adotto_variants.grch38.sqoff.vcf.gz
2+
benchmarking/cmrg_comb/
3+
benchmarking/cmrg_region_phab/
4+
benchmarking/cmrg_svsmall/
5+
benchmarking/delme.txt
6+
benchmarking/no_bed/
7+
benchmarking/phab.raw.jl
8+
benchmarking/redos.bed
9+
benchmarking/redos/
10+
benchmarking/test/
11+
benchmarking/trash/
12+
benchmarking/x.sh
13+
data/adotto_variants.grch38.sqoff.vcf.gz.tbi
14+
data/delme/
15+
x.sh

0 commit comments

Comments
 (0)