Skip to content

Calling Duplex Consensus Reads

Tim Fennell edited this page Jun 7, 2017 · 6 revisions

Overview

Calling consensus reads from duplex data is the process of taking reads that were generated in a way such as to allow post-hoc identification of which sequencing reads are derived from the paired strands of an original duplex or double-stranded source molecule of DNA. One such example is the process outlined by Kennedy et al which attaches UMIs to each end of a source molecule.

Mathematically the process is very similar to the one outlined for calling consensus reads from single-UMI data, though the mechanics are somewhat different.

The process outlined below is implemented in the CallDuplexConsensusReads program in fgbio, which is run after first grouping reads with GroupReadsByUmi --strategy=paired.

Process

The high level process starts with a group of reads identified as originating from the same double-stranded source molecule. The two strands of the original molecule are labeled, arbitrarily, as A and B and each read is known to have originated from either the A strand or the B strand. The process proceeds through the following steps:

  1. Reads are split into four sub-groups:
    • Strand A and read 1 (A1s)
    • Strand A and read 2 (A2s)
    • Strand B and read 1 (B1s)
    • Strand B and read 2 (B2s)
  2. Reads are unmapped and, if necessary, reverted to sequencing order
  3. Quality trimming, if enabled
  4. Remaining low-quality bases are masked (i.e. converted to Ns)
  5. Reads are further trimmed to the length of the insert if the insert is shorter than the read length
  6. Reads are filtered based on their Cigar (alignment structure) to ensure reads are always in phase
  7. Four single-strand consensus reads are generated, one each for A1s, A2s, B1s, and B2s
  8. Two duplex consensus reads are generated by combining the A1 and B2 consensus reads, and the A2 and B2 consensus reads