Skip to content

Commit 85c5cdc

Browse files
committed
Benchmarking and v0.2 regions progress
Made a new version of the tr regions that's a single, consolidated file Working on the benchmarking (this will need to be redone) updating a lot of .gitignores because I have data littered throughout as I work on this
1 parent 306db7c commit 85c5cdc

File tree

10 files changed

+86
-7
lines changed

10 files changed

+86
-7
lines changed

realignment/.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
simulation/
2+
v0.1variants_v0.2regions_realignment/

regions/.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
adotto_TRannotations_v0.2.bed.gz
12
*.tbi
23
data/merged.slop25.bed.gz
34
data/tr_regions.bed.gz

regions/README.md

+26-2
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,31 @@ bcftools view -r chr:start-end some.vcf.gz # Query variants
119119
Note `samtools faidx` is not the same coordinate system as `tabix`. But the bed files are 0-based and vcfs are 1-based,
120120
both half-open(?).
121121

122-
Next steps
123-
==========
122+
Variants
123+
========
124124
After creating variants, go to `intersection/` to use the variants to perform more filtering and analysis on the
125125
tr_regions/annotations.
126+
127+
Creating the final annotation
128+
=============================
129+
As of right now, we have tr regions (simple chr:start-end) and annotations (coordinates + other info). Since there can
130+
be multiple annotations per-region, it makes sense to combine these two files into one to facilitate analysis. We'll
131+
design this file for use with `truvari anno trf`, but it can be adopted to other uses. The final file format will be a
132+
tab-delimited file with columns:
133+
134+
* chrom - tr_region chromosome
135+
* start - tr_region 0-based start position
136+
* end - tr_region 0-based end position
137+
* annos - json of annotations
138+
139+
The json of annotations will be a list of dicts with key:values listed above in "Defining Repeats".
140+
141+
To create this file, simply run:
142+
143+
```bash
144+
python scripts/tr_reganno_maker.py intersection/data/tr_regions.bed intersection/data/tr_annotations.bed.gz \
145+
| bedtools sort -i - | bgzip > adotto_TRannotations_v0.2.bed.gz
146+
```
147+
148+
The tr_regions can be `.bed` or `.bed.gz` but the tr_annotations need to be `.bed.gz` with a `.tbi` index.
149+

regions/intersection/.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
counts_tr_annotations_vinter.bed.gz
2+
data/
3+
filtered_tr_annotations_vinter.bed

regions/scripts/tr_reganno_maker.py

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
import sys
2+
import json
3+
import tabix
4+
import truvari
5+
6+
def main(reg_fn, anno_fn):
7+
"""
8+
Combine regions and annotations into a file
9+
"""
10+
header = [('chrom', str), ('start', int), ('end', int), ('period', float),
11+
('copies', float), ('score', int), ('entropy', float), ('repeat', str)]
12+
13+
tb = tabix.open(anno_fn)
14+
for line in truvari.opt_gz_open(reg_fn):
15+
chrom, start, end = line.strip().split('\t')[:3]
16+
start = int(start)
17+
end = int(end)
18+
m_data = []
19+
for i in tb.query(chrom, start, end):
20+
try:
21+
m_data.append({fmt[0]: fmt[1](x) for x, fmt in zip(i, header)})
22+
except tabix.TabixError as e:
23+
pass # allow empty regions
24+
data_str = json.dumps(m_data)
25+
print(f"{chrom}\t{start}\t{end}\t{data_str}")
26+
27+
if __name__ == '__main__':
28+
reg, anno = sys.argv[1:]
29+
main(reg, anno)

variants/benchmarking/.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
inputs/merged.vcf.gz
2+
inputs/merged.vcf.gz.tbi
23
inputs/GRCh38.variants.sqoff.vcf.gz
34
inputs/HG002_comp.vcf.gz
45
inputs/HG002_comp.vcf.gz.tbi
+12
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
HG002_comp.jl
2+
NA24385_comp.jl
3+
base_cnt.jl
4+
base_sv.jl
5+
by_size_performance_maker.py
6+
deliver.tgz
7+
li:NA24385_comp.jl
8+
multi_vcfeval/
9+
old_regions/
10+
test.jl
11+
test2.jl
12+
x.sh

variants/benchmarking/scripts/run_bench.sh

+7-5
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ comp_vcfs[eichler]=/users/u233287/scratch/code/adotto/variants/benchmarking/inpu
1010
bdir=/users/u233287/scratch/code/adotto/variants/benchmarking/
1111
thfa_vcf=$bdir/truthsets/HPRC-cur.20211005-align2-GRCh38.dip.singlealleles.vcf.gz
1212
thfa_bed=$bdir/truthsets/HPRC-cur.20211005-align2-GRCh38.dip.bed
13+
thfa_sm_bed=$bdir/truthsets/temp_hc_regions/GRCh38_HG2-HPRC-20211005_dipcall-z2k.smallvar.benchmark.bed
14+
thfa_lg_bed=$bdir/truthsets/temp_hc_regions/GRCh38_HG2-HPRC-20211005_dipcall-z2k.SV.benchmark.bed
1315
cmrg_vcf=$bdir/truthsets/HG002_GRCh38_CMRG_SV_v1.00.vcf.gz
1416
cmrg_bed=$bdir/truthsets/HG002_GRCh38_CMRG_SV_v1.00.bed
1517

@@ -21,13 +23,13 @@ mkdir -p results/
2123
for samp_name in "${!comp_vcfs[@]}"
2224
do
2325
m_vcf="${comp_vcfs[$samp_name]}"
24-
#$rtg vcfeval -t $rtg_ref --squash-ploidy -e $cmrg_sm_bed \
25-
#-b $cmrg_sm_vcf -c $m_vcf -o results/rtg_cmrg_${samp_name}
26-
#$rtg vcfeval -t $rtg_ref --squash-ploidy -e $thfa_bed \
27-
#-b $thfa_ma_vcf -c $m_vcf -o results/rtg_thfa_${samp_name}
26+
$rtg vcfeval -t $rtg_ref --squash-ploidy -e $cmrg_sm_bed \
27+
-b $cmrg_sm_vcf -c $m_vcf -o results/rtg_cmrg_${samp_name}
28+
$rtg vcfeval -t $rtg_ref --squash-ploidy -e $thfa_sm_bed \
29+
-b $thfa_vcf -c $m_vcf -o results/rtg_thfa_${samp_name}
2830

2931
truvari bench -b $cmrg_vcf -c $m_vcf -f $ref --includebed $cmrg_bed \
3032
--passonly -o results/truvari_cmrg_${samp_name}
31-
truvari bench -b $thfa_vcf -c $m_vcf -f $ref --includebed $thfa_bed \
33+
truvari bench -b $thfa_vcf -c $m_vcf -f $ref --includebed $thfa_lg_bed \
3234
--passonly -o results/truvari_thfa_${samp_name}
3335
done

variants/data/.gitignore

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
GRCh38.variants.sqoff.vcf.gz
2+
GRCh38.variants.sqoff.vcf.gz.tbi
3+
adotto_variants.grch38.sqoff.bcf.gz
4+
adotto_variants.grch38.sqoff.bcf.gz.csi

variants/diversity/.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
results/

0 commit comments

Comments
 (0)