-
-
Notifications
You must be signed in to change notification settings - Fork 70
Calling Duplex Consensus Reads
Calling consensus reads from duplex data is the process of taking reads that were generated in a way such as to allow post-hoc identification of which sequencing reads are derived from the paired strands of an original duplex or double-stranded source molecule of DNA. One such example is the process outlined by Kennedy et al which attaches UMIs to each end of a source molecule.
Mathematically the process is very similar to the one outlined for calling consensus reads from single-UMI data, though the mechanics are somewhat different.
The process outlined below is implemented in the CallDuplexConsensusReads
program in fgbio, which is run after first grouping reads with GroupReadsByUmi --strategy=paired
.
The high level process starts with a group of reads identified as originating from the same double-stranded source molecule. The two strands of the original molecule are labeled, arbitrarily, as A and B and each read is known to have originated from either the A strand or the B strand. The process proceeds through the following steps:
- Reads are split into four sub-groups:
- Strand
A
and read 1 (A1
s) - Strand
A
and read 2 (A2
s) - Strand
B
and read 1 (B1
s) - Strand
B
and read 2 (B2
s)
- Strand
- Reads are unmapped and, if necessary, reverted to sequencing order
- Quality trimming, if enabled
- Remaining low-quality bases are masked (i.e. converted to
N
s) - Reads are further trimmed to the length of the insert if the insert is shorter than the read length
- Reads are filtered based on their Cigar (alignment structure) to ensure reads are always in phase
- Four single-strand consensus reads are generated, one each for
A1
s,A2
s,B1
s, andB2
s - Two duplex consensus reads are generated by combining the
A1
andB2
consensus reads, and theA2
andB2
consensus reads