-
-
Notifications
You must be signed in to change notification settings - Fork 75
Calling Duplex Consensus Reads
Calling consensus reads from duplex data is the process of taking reads that were generated in a way such as to allow post-hoc identification of which sequencing reads are derived from the paired strands of an original duplex or double-stranded source molecule of DNA. One such example is the process outlined by Kennedy et al which attaches UMIs to each end of a source molecule.
Mathematically the process is very similar to the one outlined for calling consensus reads from single-UMI data, though the mechanics are somewhat different.
The process outlined below is implemented in the CallDuplexConsensusReads program in fgbio, which is run after first grouping reads with GroupReadsByUmi --strategy=paired.
The high level process starts with a group of reads identified as originating from the same double-stranded source molecule. The two strands of the original molecule are labeled, arbitrarily, as A and B and each read is known to have originated from either the A strand or the B strand. The process proceeds through the following steps:
- Reads are split into four sub-groups:
- Strand
Aand read 1 (A1s) - Strand
Aand read 2 (A2s) - Strand
Band read 1 (B1s) - Strand
Band read 2 (B2s)
- Strand
- Reads are unmapped and, if necessary, reverted to sequencing order
- Quality trimming, if enabled
- Remaining low-quality bases are masked (i.e. converted to
Ns) - Reads are further trimmed to the length of the insert if the insert is shorter than the read length
- Reads are filtered based on their Cigar (alignment structure) to ensure reads are always in phase
- Four single-strand consensus reads are generated, one each for
A1s,A2s,B1s, andB2s - Two duplex consensus reads are generated by combining the
A1andB2consensus reads, and theA2andB2consensus reads