hairpin2 – read-aware artefactual variant flagging
hairpin2 is designed to flag variants that are likely artefactual via a series of tests performed upon the read data associated with each variant. Initially, it was concieved to flag possible cruciform artefacts for LCM sequence data, but the concept has been extended and can detect a variety of potentially spurious variants (including indels). The tool operates on a VCF file containing one or more samples, and alignment files for all samples to be tested.
Given a VCF, and BAM files for the samples of that VCF, return a VCF with variants flagged with ADF if variants have anomalous distributions indicating that they are likely to be artefactual, ALF if relevant reads have lower median alignment score per base than a specified threshold, DVF if variants appear to be the result of PCR error, and LQF if the variant is largely supported by low quality reads.
-
ADF; TheADFflag is an implementation of the artifact detection algorithm described in Ellis et al, 2020. It detects variants which appear with anomalously regular positional distribution in supporting reads. -
ALF; TheALFflag indicates variants which are supported by reads with poor signal-to-noise, per the alignment score. It is complementary to theADFflag – artefacts with anomalous distributions often cause a marked decrease in alignment score. -
DVF; TheDVFflag is a naive but effective algorithm for detecting variants which are the result of PCR error - in regions of low complexity, short repeats and homopolymer tracts can cause PCR stuttering. PCR stuttering can lead to, for example, an erroneous additional A on the read when amplifying a tract of As. If duplicated reads contain stutter, this can lead to variation of read length and alignment to reference between reads that are in fact duplicates. Because of this, these duplicates both evade dupmarking and give rise to spurious variants when calling. TheDVFflag attempts to catch these variants by examining the regularity of the start and end coordinates of collections of supporting reads and their mates. -
LQF; TheLQFflag is a superset of theDVFflag - it tests whether a read is largely supported by both low quality reads and stutter duplicate reads (which are also considered low quality). Note that because the parameters for eachLQFandDVFare independent, you can indepedently set the sensitivity of each - so the result of LQF is not necessarily a complete overlap with DVF (and usually is not).
All flags are tunable such that their parameters can be configured to a variety of use cases and sequencing methods.
Python >= 3.12
further dependencies (pysam, pydantic, and optionally pytest and pytest-cov) are detailed in pyproject.toml, and will be downloaded automatically if following the recommend install process
The easiest end-user approach is to install into a virtual environment:
python -m venv .env
source .env/bin/activate
pip install .
hairpin -h
for development, substitute:
pip install -e ".[dev,doc]"
hairpin2 is designed for paired data where alignment records have the MC tag and the complete CIGAR string is present in the CIGAR field (rather than the CG:B,I tag). If the MC tag is not present in your data, it can be added using samtools fixmate or biobambam2 bamsormadup. The tool can handle substitions, insertions, and deletions formatted per the VCF specification. At this time, the tool will not investigate mutations notated with angle brackets, e.g. <DEL>, complex mutations, or monomorphic reference. No further assumptions are made – other alignment tags and VCF fields are used, however they are mandatory per the relevant format specifications. If these requirements are limiting and you need the tool to be extended in some way, please request it.
- automated regression testing
- disscussions to be had on multisample VCF support, and multiallelic variant support
hairpin2 was developed at the Wellcome Sanger Institue in collaboration between CASM Informatics and CASM faculty. We thank all contributors and scientific collaborators for their input and expertise.
hairpin2 is chiefly the work of:
Alex Byrne - CASM Informatics - blex.bio - Lead Developer/Contact
Anh Phuong Le - CASM - GitHub
Peter Campbell - Quotient Therapeutics