- BAM file(s) (or FastQ files).
- BED file (optional).
summary.html
– sample-level coverage statistics and plots.summary.tsv
– sample-level coverage, parsable versionregions.tsv
– region-level coverage statistics.
To view regions.tsv
contents in a nice aligned table, you can use tsv
command, e.g.:
tsv regions.tsv
chr start end size gene exon strand feature biotype transcript trx_overlap exome_overlap cds_overlap avg_depth at1x at5x at10x at20x at50x at100x at250x at500x at1000x at5000x at10000x at50000x
chr21 9907173 9907501 328 TEKT4P2 3 - capture transcribed_unprocessed_pseudogene ENST00000400754 95.1% 95.1% 0% 115.137 100 100 100 100 100 66.7683 0 0 0 0 0 0
chr21 10863011 10863111 100 IGHV1OR21-1 2 + capture IG_V_gene ENST00000559480 56.0% 56.0% 53.0% 106.93 100 100 100 100 100 100 0 0 0 0 0 0
chr21 10910285 10910401 116 TPTE 22 - capture protein_coding ENST00000361285 100.0% 80.2% 80.2% 36.3621 100 100 100 100 0 0 0 0 0 0 0 0
Or you can use bioawk (conda install -c bioconda bioawk
) to query certain columns with:
cat ./tests/gold_standard/bed3/syn3-tumor/regions.tsv | bioawk -tc hdr '{ print $chr, $start, $end, $gene, $avg_depth, $at100x }' | tsv
chr start end gene avg_depth at100x
chr21 9907173 9907501 TEKT4P2 115.137 66.7683
chr21 10863011 10863111 IGHV1OR21-1 106.93 100
chr21 10910285 10910401 TPTE 36.3621 0
-
trx_overlap
,exome_overlap
,cds_overlap
- percentage of the region that overlaps transcripts, exons, or CDS (coding regions) in Ensembl database, correspondingly. -
at1x
,at5x
, etc - percentage of the region that is covered at least at this depth (1x, 5x) by reads, excluding duplicated, quality-failed, secondary and supplemetrary aligned reads.
Install miniconda if not already:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh # linux
wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh # macos
bash miniconda.sh -b -p ./miniconda
unset PYTHONPATH
export PATH=$(pwd)/miniconda/bin:$PATH
conda activate base # or create an environment with `conda env create -n targqc`
Install targqc:
conda install -c vladsaveliev -c bioconda -c conda-forge targqc
This approach assumes you have dependencies like qualimap, bedtools, sambamba already installed on your system and available in PATH.
git clone --recursive https://github.com/vladsaveliev/TargQC
cd TargQC
virtualenv venv_targqc && source venv_targqc/bin/activate # optional, but recommended if you are not an admin
pip install --upgrade pip setuptools
pip install -r requirements.txt
python setup.py install
targqc *.bam --bed target.bed -g hg19 -o targqc_results
The results will be written to targqc_results
folder.
The BED file may be omitted. In this case statistics reported will be based of off the whole genome.
The accepted values for -g
are hg19
, hg38
, or a full path to any indexed reference fasta file:
targqc *.bam --bed target.bed -g /path/to/genomes/some_genome.fa -o targqc_results
When running from BAMs, only the .fai
index is used, and the fasta file itself can be non-existent.
Instead of the BAM files, input FastQ are also allowed. The reads will be aligned by BWA to the reference
genome specified by --bwa-prefix
(unless -g
is already a fasta path bwa-indexed).
targqc *.fastq --bed target.bed -g hg19 -o targqc_results --bwa-prefix /path/to/ref.bwa
Option --downsample-to <N>
(default value 5e5
) specifies the number of
read pairs will be randomly selected from each input set. This feature allows to quickly estimate approximate
coverage quality before full alignment. To turn downsampling off and align all reads, set --downsample-to off
.
Run using 3 threads:
targqc *.bam --bed target.bed -g hg19 -o targqc_results -t 3
Run using 3 jobs, using SGE scheduler, and queue "queue":
targqc *.bam --bed target.bed -g hg19 -o targqc_results -t 3 -s sge -q batch.q -r pename=smp
If the number of samples is higher than the requested number of jobs, the processes within job will be additionally parallelized using threads, so the full number of occupied cores will equal the number of requested threads (-t)
Other supported schedulers: Platform LSF ("lsf"), Sun Grid Engine ("sge"), Torque ("torque"), SLURM ("slurm") (see details at https://github.com/roryk/ipython-cluster-helper)
A tool that assings gene names to regions in a BED file based on Ensembl genomic features overlap.
UPD: Moved into a separate repository https://github.com/vladsaveliev/bed_annotation
Build a web-page with size-proportional Venn diagrams for an unlimited set of BED files:
UPD: Moved into a separate repository https://github.com/vladsaveliev/Venn