You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear Oxford Nanopore team, thank you very much for providing awesome tool and technology.
I'm running dorado v. 0.8.3+98456f7 on a Direct RNA sequencing sample (SQK-RNA004, mouse).
I'm utilizing the model [email protected] and [email protected].
I'm running the tool on an Ubuntu HPC cluster with NVIDIA A100-SXM4-40GB, Driver Version: 555.42.02.
I'm primarily interested in the estimation of polyA-tail lengths.
To test, I tool one read from the sample having high-quality mapping to the mt-Nd2 gene in GRCm38 reference.
I used the information from "Move Table" to see which samples in the raw signal emit bases, which get trimmed, and which contribute to polyA-tail length estimation.
I noticed that in default HAC model, many samples from presumable polyA tail region emit bases, resulting in the polyA-stretch in the 3'end of the basecalled sequence, which then yields a soft-clipped region after mapping with minimap2 (within Dorado).
With the SUP model, there was one strided block with emission of the base right at the edge of the presumable adapter and polyA-tail region, and, in addition, there were no artifactual indels in the basecalled transcript sequence, in comparison to default HAC model.
For comparison, I also utilized Nanopolish v. 0.14.0 on a .pod5 file converted to .fast5.
Please download the file toy.pod5 and polya_config.toml from the mentioned above google drive.
Please download reference genome fasta file from GENCODE website: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/GRCm38.primary_assembly.genome.fa.gz
Please run Dorado and samtools with the following commands: dorado basecaller [email protected] toy.pod5 --emit-moves --estimate-poly-a --poly-a-config polya_config.toml --mm2-opts "-x splice -Y" --reference GRCm38.primary_assembly.genome.fa > toy.dorado.sup.bam; samtools sort -@ 12 -o toy.dorado.sup.sorted.bam toy.dorado.sup.bam && samtools index -@ 12 toy.dorado.sup.sorted.bam;
Samtools commands are not necessary, but indexing might be useful to visualize in IGV genomic browser.
Thanks for the interesting analysis! None of this looks very surprising to me - long homopolymers like the polyA are known to generate an artificially short sequence with few moves. This is why dorado uses a different method for its own polyA estimation. Note that this performs the calculation and puts the result in the pt:i tag in the BAM file, but it does not adjust the basecalled sequence.
Issue Report
Please describe the issue:
Dear Oxford Nanopore team, thank you very much for providing awesome tool and technology.
I'm running dorado v. 0.8.3+98456f7 on a Direct RNA sequencing sample (SQK-RNA004, mouse).
I'm utilizing the model [email protected] and [email protected].
I'm running the tool on an Ubuntu HPC cluster with NVIDIA A100-SXM4-40GB, Driver Version: 555.42.02.
I'm primarily interested in the estimation of polyA-tail lengths.
To test, I tool one read from the sample having high-quality mapping to the mt-Nd2 gene in GRCm38 reference.
I used the information from "Move Table" to see which samples in the raw signal emit bases, which get trimmed, and which contribute to polyA-tail length estimation.
I noticed that in default HAC model, many samples from presumable polyA tail region emit bases, resulting in the polyA-stretch in the 3'end of the basecalled sequence, which then yields a soft-clipped region after mapping with minimap2 (within Dorado).
With the SUP model, there was one strided block with emission of the base right at the edge of the presumable adapter and polyA-tail region, and, in addition, there were no artifactual indels in the basecalled transcript sequence, in comparison to default HAC model.
For comparison, I also utilized Nanopolish v. 0.14.0 on a .pod5 file converted to .fast5.
Steps to reproduce the issue:
All the necessary materials are available in this google drive directory:
https://drive.google.com/drive/folders/1aehNJwQqoiglqUhDFCl1BKYMXszEV39g?usp=sharing
Please download the file toy.pod5 and polya_config.toml from the mentioned above google drive.
Please download reference genome fasta file from GENCODE website:
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/GRCm38.primary_assembly.genome.fa.gz
Please run Dorado and samtools with the following commands:
dorado basecaller [email protected] toy.pod5 --emit-moves --estimate-poly-a --poly-a-config polya_config.toml --mm2-opts "-x splice -Y" --reference GRCm38.primary_assembly.genome.fa > toy.dorado.sup.bam; samtools sort -@ 12 -o toy.dorado.sup.sorted.bam toy.dorado.sup.bam && samtools index -@ 12 toy.dorado.sup.sorted.bam;
Samtools commands are not necessary, but indexing might be useful to visualize in IGV genomic browser.
Run environment:
Logs
dorado.log
The text was updated successfully, but these errors were encountered: