-
Notifications
You must be signed in to change notification settings - Fork 179
Description
A pileup of uncorrected Guppy v3.6 reads produces a better guess at the true sequence compared to corrected reads. The errors are in deletion variants and only in specific locations; for the most part, reads are more accurately called after correction (especially at SNPs). I acknowledge that this difference may be purely in the mapper (I used mmap, rather than the default).
I've started re-assembling my January 2017 reads from Nippostrongylus brasiliensis, recalled using Guppy v3.6, using Canu v2.0. We've got a pretty accurate mitochondrial genome that's been previously assembled from Illumina-corrected nanopore reads (confirmed by our own Illumina cDNA reads, and by the Sanger Institute's own Illumina reads), so I've been using that as a yard stick to measure the accuracy of basecalled unmethylated reads. I use LAST with a trained alignment matrix for this mapping, because it seems to have better mapping error profiles in comparison to minimap2.
Here is a combined coverage / variant plot showing Canu-corrected reads:
And here is a combined coverage / variant plot showing the uncorrected version of those same reads:
The identified variants with frequency >40% on both the forward and reverse strand are indicated on the outer portion of the plot. Only deletion variants exist in both plots. Note that there are more identified variants for the corrected reads (11) than for the uncorrected reads (1). Based on Illumina reads that I've looked at, I believe that uncorrected variant to be a true reflection of polymorphic mitochondrial sequence, rather than an error.
This is not an issue, as such. More of an observation, just in case it's helpful for improving and/or speeding up Canu.
In case anyone wants to have a look at these reads, the uncorrected Guppy v3.6-called reads that map to the mitochondrial genome can be found here. Coverage for the mitochondrial genome is about 650X.
In order to map reads to the mitochondrial genome I use LAST. Here's an example command sequence:
lastal -P 10 -p ~/db/last/bc.mat ~/db/fasta/nippo/Nb_mtDNA_MRSR_corrected.fasta uncorrectedReads_vs_Nb_mtDNA.fa.gz | last-split | maf-convert sam | samtools view -h --reference ~/db/fasta/nippo/Nb_mtDNA_MRSR_corrected.fasta | samtools sort > mtDNA_called_uncorrected_vs_nippo_mtDNA.bam
I have a semi-automated visualisation script that shows me coverage and variant frequencies, and determines any observed variants that are consistent on forward-mapped and reverse-mapped reads. I use this consistency check to exclude methylation-related variant signals (because it's been my experience that typically only one strand is methylated in the same place):
samtools view mtDNA_called_uncorrected_vs_nippo_mtDNA.bam -F 0x10 -b | samtools mpileup --reference ~/db/fasta/nippo/Nb_mtDNA_MRSR_corrected.fasta -d 100000 -Q 0 - | ~/scripts/readstomper.pl -c > stompedCounts_mtDNA_called_uncorrected_vs_nippo_mtDNA_fwd.csv
samtools view mtDNA_called_uncorrected_vs_nippo_mtDNA.bam -f 0x10 -b | samtools mpileup --reference ~/db/fasta/nippo/Nb_mtDNA_MRSR_corrected.fasta -d 100000 -Q 0 - | ~/scripts/readstomper.pl -c > stompedCounts_mtDNA_called_uncorrected_vs_nippo_mtDNA_rev.csv
~/scripts/stomp_plotter.r -f stompedCounts_fwd_4T1_ρ0SC_vs_chrM_9457.csv -r stompedCounts_rev_4T1_ρ0SC_vs_chrM_9457.csv -circular 16299 -adj 9457 -log -type png -scale -max 60000 -nodels -mindepth 2
command:
~/install/canu/canu-2.0/Linux-amd64/bin/canu -nanopore-raw called_Nb_CFED_65bptrim_guppy_3.6.0.fq.gz -p Nb_ONTCFED
_guppy360_65bpTrim -d Nb_ONTCFED_guppy360_65bpTrim genomeSize=400M corOverlapper=minimap
version: Canu 2.0
system: Debian Linux desktop
Linux elegans 5.6.0-1-amd64 #1 SMP Debian 5.6.7-1 (2020-04-29) x86_64 GNU/Linux