Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bases are emitted from strided blocks within polyA region (SQK-RNA004) #1131

Open
magmir71 opened this issue Nov 13, 2024 · 1 comment
Open
Labels
polyA Issue related to polyA tail estimation

Comments

@magmir71
Copy link

Issue Report

Please describe the issue:

Dear Oxford Nanopore team, thank you very much for providing awesome tool and technology.

I'm running dorado v. 0.8.3+98456f7 on a Direct RNA sequencing sample (SQK-RNA004, mouse).
I'm utilizing the model [email protected] and [email protected].
I'm running the tool on an Ubuntu HPC cluster with NVIDIA A100-SXM4-40GB, Driver Version: 555.42.02.

I'm primarily interested in the estimation of polyA-tail lengths.
To test, I tool one read from the sample having high-quality mapping to the mt-Nd2 gene in GRCm38 reference.

I used the information from "Move Table" to see which samples in the raw signal emit bases, which get trimmed, and which contribute to polyA-tail length estimation.
I noticed that in default HAC model, many samples from presumable polyA tail region emit bases, resulting in the polyA-stretch in the 3'end of the basecalled sequence, which then yields a soft-clipped region after mapping with minimap2 (within Dorado).

With the SUP model, there was one strided block with emission of the base right at the edge of the presumable adapter and polyA-tail region, and, in addition, there were no artifactual indels in the basecalled transcript sequence, in comparison to default HAC model.

For comparison, I also utilized Nanopolish v. 0.14.0 on a .pod5 file converted to .fast5.

SUP_model_Move_Table_annotation
HAC_model_Move_Table_annotation
Nanopolish_annotation

Steps to reproduce the issue:

All the necessary materials are available in this google drive directory:
https://drive.google.com/drive/folders/1aehNJwQqoiglqUhDFCl1BKYMXszEV39g?usp=sharing

Please download the file toy.pod5 and polya_config.toml from the mentioned above google drive.
Please download reference genome fasta file from GENCODE website:
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M25/GRCm38.primary_assembly.genome.fa.gz
Please run Dorado and samtools with the following commands:
dorado basecaller [email protected] toy.pod5 --emit-moves --estimate-poly-a --poly-a-config polya_config.toml --mm2-opts "-x splice -Y" --reference GRCm38.primary_assembly.genome.fa > toy.dorado.sup.bam; samtools sort -@ 12 -o toy.dorado.sup.sorted.bam toy.dorado.sup.bam && samtools index -@ 12 toy.dorado.sup.sorted.bam;

Samtools commands are not necessary, but indexing might be useful to visualize in IGV genomic browser.

Run environment:

  • Dorado version: 0.8.3+98456f7
  • Dorado command: dorado basecaller [email protected] toy.pod5 --emit-moves --estimate-poly-a --poly-a-config polya_config.toml --mm2-opts "-x splice -Y" --reference GRCm38.primary_assembly.genome.fa > toy.dorado.sup.bam
  • Operating system: Ubuntu
  • Hardware (CPUs, Memory, GPUs): NVIDIA A100-SXM4-40GB
  • Source data type: pod5
  • Source data location: google drive
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): FLO-MIN004RA, SQK-RNA004
  • Dataset to reproduce, if applicable: see https://drive.google.com/drive/folders/1aehNJwQqoiglqUhDFCl1BKYMXszEV39g?usp=sharing

Logs

dorado.log

@malton-ont
Copy link
Collaborator

Hi @magmir71,

Thanks for the interesting analysis! None of this looks very surprising to me - long homopolymers like the polyA are known to generate an artificially short sequence with few moves. This is why dorado uses a different method for its own polyA estimation. Note that this performs the calculation and puts the result in the pt:i tag in the BAM file, but it does not adjust the basecalled sequence.

@malton-ont malton-ont added the polyA Issue related to polyA tail estimation label Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
polyA Issue related to polyA tail estimation
Projects
None yet
Development

No branches or pull requests

2 participants