Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plasmid poly(A) Disagreement Between 0.8.0 and 0.9.1 #1233

Open
VBHerrenC opened this issue Jan 28, 2025 · 6 comments
Open

Plasmid poly(A) Disagreement Between 0.8.0 and 0.9.1 #1233

VBHerrenC opened this issue Jan 28, 2025 · 6 comments
Labels
polyA Issue related to polyA tail estimation

Comments

@VBHerrenC
Copy link

Issue Report

Please describe the issue:

We ran basecalling on an SQK-RBK114-24 plasmid dataset with --estimate-poly-a and a config file. We initially ran basecalling with 0.9.0 and [email protected], and got the following results:

Image

Although these were somewhat unexpected, they were definitely feasible and so we did not question the results. However, after receiving poly(A) data via another instrument and method, we became suspicious that these results were not accurate - nanopore and the alternate method usually agree quite closely. I re-ran the same dataset on v0.9.1 and [email protected] and got the same results. Still suspicious, I bumped us back down to Dorado 0.8.0 and [email protected] and then got this distribution of poly(A) estimations:

Image

This distribution matches much more closely to the alternate method, and the distribution shape in general matches our historical data much better.

Steps to reproduce the issue:

Basecalling with same dataset, model, parameters, and config file. Only difference is dorado 0.8.0 vs 0.9.X.

Run environment:

  • Dorado version: 0.9.X
  • Dorado command: ~/packages/dorado-0.8.0-linux-x64/bin/dorado basecaller ~/packages/dorado-0.8.0-linux-x64/models/[email protected]
    /path/pod5
    --min-qscore 14
    --estimate-poly-a
    --poly-a-config /path/poly_a_config.toml
    --no-trim
    --device 'cuda:all' --verbose > dorado_sup.bam
  • Operating system: WSL
  • Hardware (CPUs, Memory, GPUs): NVIDIA A5000
  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): POD5
  • Source data location (on device or networked drive - NFS, etc.): On device
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): FLOMIN114, SQKRBK114-24, 21.21 k reads
  • Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

  • Please provide output trace of dorado (run dorado with -v, or -vv on a small subset)
@malton-ont
Copy link
Collaborator

Hi @VBHerrenC,

PolyA estimation is under continuous review, and there were some changes between those versions. Does your polyA transcript have a non-A linker section? dorado-0.9.0 is a bit stricter about breaking at non-A sections unless the appropriate tail.tail_interrupt_length value is specified in the --poly-a-config file.

@malton-ont malton-ont added the polyA Issue related to polyA tail estimation label Jan 28, 2025
@VBHerrenC
Copy link
Author

Hi @malton-ont,

There aren't any non-A linkers - tail_interrupt_length in the config file was set to 0.

Thanks,
Calleigh

@malton-ont
Copy link
Collaborator

Hi @VBHerrenC,

Are you able to share any data? I think this will be very hard to diagnose without.

One thing to note is that dorado expects the polyA section to be somewhere within the sequence - i.e. the cleave point for the plasmid can't be within the polyA sequence or flanks.

You can get some useful insights into how the region is determined by adding the -vv flag - I'd suggest only running this on a small set of reads as it generates a lot of output.

@VBHerrenC
Copy link
Author

Hi @malton-ont,

Unfortunately we can't share the data, but happy to try and do some testing and report back. Since we input circular plasmid to the library prep and it randomly cleaves, I would expect the amount of times it happens to cut in the poly(A) or flanks to be relatively low.

Calleigh

@malton-ont
Copy link
Collaborator

malton-ont commented Jan 28, 2025

Hi @VBHerrenC,

In that case, if you could gather a small (~20 reads) dataset of reads that report significantly differently between the two versions and run these with -vv, we can attempt to parse the logs and see if there's anything obviously different? If you'd rather not share these publicly, please raise a ticket with our support team and reference this issue and my name, and they should pass it on to me.

@VBHerrenC
Copy link
Author

Hi @malton-ont,

Thanks so much! Just opened a ticket and attached the requested logs. Let me know if you need anything else.

Calleigh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
polyA Issue related to polyA tail estimation
Projects
None yet
Development

No branches or pull requests

2 participants