Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

estimate polyA doesn't find tails well with custom primer #1108

Open
shaohuishi opened this issue Oct 29, 2024 · 3 comments
Open

estimate polyA doesn't find tails well with custom primer #1108

shaohuishi opened this issue Oct 29, 2024 · 3 comments
Labels
polyA Issue related to polyA tail estimation

Comments

@shaohuishi
Copy link

shaohuishi commented Oct 29, 2024

Dear developers,

Recently I am trying some customized methods to complete the full-length measurement of the polya tail on nanopore.
However, I don’t know whether it is because the same primer is used at both ends of the cDNA or for other reasons. Many polya that can be recognized by basecaller cannot be counted by estimate-poly-a function.

Thank you for your time,
shaohui

My Library Diagram

5' ---- ADAPTER ---- AAGCAGTGGTATCAACGCAGAGTAC----ATGGG ---- cDNA ---- poly(A) ---- GTACTCTGCGTTGATACCACTGCTT ---- 3'

As sample

read’s signal

屏幕截图 2024-10-29 193444

bam output

fd4fc9af-6ca3-49d0-9699-2af7174cd85f 4 * 0 0 * * 0 0 TGTGTACGTACTTCGTTCAGCGTATTGCTGCGCACGCACTACAGAAAGCAGTGGTATCAACGCAGAGTACATGGGCTTGTTCTGGGGACATTTCGCGATTGCGGACGTTGAGAGGCCGCTGCCAAAATGCCAGAACGAGATAGTGAGCCCTTCTCTAACCCTTTGGCTCCAGATGGCCACGATGTGGATGATCCTCATTCCTTCCACCAATCAAAACTTACCAATGAAGACTTCAGGAAACTTCTTATGACCCCAAGAGCTGCACCTACTTCTGCGCCACCTTCTAAGTCACGTCACCATGAGATGCCAAGGGAGTACAATGAGGATGAAGACCCAGCTGCACGAAGGAGGAAAAAGAAAAGTTATTATGCCAAGCTTCGCCAGCAAGAAATTGAGAGAGAGAGAACTCGCAGAGAAATACCGGGACCGTGCCAAGGAACGGAGAGATGGTGTGAACAAAGACTATGAGGAAACTGAGCTGATAAGTACCACAGCCAACTACAGGGCTGTGGGCCCCACTGCTGAGGCGGACAAATCAGCAGCAGAGAAGAGAAGACAGTTGATTCAGGAGTCCAAATTCTTGGGTGGTGATATGGAACACGCCCATTTGGTGAAAGGTTTGGATTTTGCGTTGCTTCAAAAGGTGCGCGCTGAGATTGCCAGCAAAGAGAAGGAGGAAGAGGAACTCATGGAAAAGCCCCAAAAGGAAACCAAGAAAGATGAGGATCCTGAGAACAAAATTGAATTTAAAACACGCCTTGGCCGGAATGTGTATCGGATGCTTTTCAAGAGTAAATCATATGAGCGAAATGAGCTGTTCTTACCAGGACGTATGGCCTATGTAGTAGACCTGGATGATGAGTACGCAGACACAGATATCCCCACCACTCTCATACGCAGCAAAGCTGATTGCCCCACTATGGAGGCCCAGACTACACTGACTACAAATGACATTGTTATTAGCAAGCTCACCCAGATTTTGTCATACCTGAGGCAGGGGACCCGAAACAAGAAGCTCAAGAAGAAGGATAAAGGAAAACTGGAAGAGAAGAAACCTCCTGAGGCTGACATGAACATTTTTGAAGACATTGGGGATTACGTTCCTTCTACAACCAAGACACCTCGGGACAAGGAACGTGAGAGATACCGGGAACGTGAACGTGATCGGGAACGGGACAGAGACAGGGAGCGAGACAGGGAGCGAGACCGTGAGAGAGAGGGAGAGAGAGCGAGACCGGGAACGGGAACGAGGAGGAAAAGAAAAGGCACAGCTACTTTGAGAAGCCAAAAGTGGATGATGAGCCCATGGATGTTGACAAAGGACCTGGATCTGCAAAAGAGTTGATCAAGTCCATCAATGAAAAATTCGCTGGGTCTGCTGGCTGGGAAGGCACTGAATCGTTGAAGAAGCCAGAAGATAAGAAGCAGCTGGGCGATTTCTTTGGCATGTCCAACAGTTACGCAGAATGCTATCCAGCCACGATGGATGACATGGCTGTAGATAGTGATGAAGAGGTAGATTATAGCAAAATGGACCAGGGTAACAAGAAGGGTCCCTTAGGCCGCTGGGACTTCGATACTCAGGAGGAAACAGCGAGTACATGAACAACAAGGAGGCTCTGCCCAAGGCTGCATTCCAGTATGGCATCAAGATGTCTGAAGGACGGAAAACCAGACGATTCAAAGAAACCAATGATAAGGCAGAGCTTGATCGACAGTGGAAGAAAATAAGTGCAATCATTGAGAAGAGGAAGAGGATGGAAGCAGATGGGGTCGAAGTGAAAAGACCAAAGTACTAATCTCTAGTTCCAGCTGTCACCACGTGGCTGTTCTTAGTTGCTTGCTTCTACAATTCCTCAGACGGTTGCAAACTGTTGTTGTTTGTGAAAGTTTATAAATGTTTATTGTATAACTCTTTATAGATCTGTGTCCCACATGCTAAGATTAATGGCAATGCAACACCATGTCCAGCATGTTCCATTAAATGTAGTTAAACCTGCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGTACTCTGCGTTGATACCACTGCTTTCTGTAGTGCGTGCGCAGCAATA %&&&'&&%%&'''''(*)++&&&'7888HSSNSQSSSSSSIHSSLKPSSSKSMSSLSSNSSNS::::S:::::NOMLRC@;.-,,,FKLSSMSRSSSKSSSSNSSSKSSSSSSSSLISLSKD@FLMKJMLISSLSSLOSSSPLISRSSSSSSSSSSOSSJNSSSSOIISSGHIISMJSSSLKLPIIJIMMSSSLHKKRSDHMGHHHIIPSSSSSSSSSSPIS::::96=;<7/.(()3DMLSSB@@@@JLOMNSSMMSSSG@@>>10001ABCEFDDCFDJI9333DEJRNSSSOSLSSSMSSSSSSSSSSSSSSMPP===MSSSNKSSSSMSSSMSSSNSSS<;713:;C===FHILQSSSSSSMJSSRLKRSSSSSQSSSSSPJODSSSSSG>=<?3B8HSNSSSSQSSLSJFSSSSMQHGGKNSSPSSSSSSSSSSSOSSSSSSSSSSQSQSSSOSSSSSSSSOSSSSSSSSSSSSSSSSSSSSNSSSSSSSSNSSSSSQSSSSSSSRRLOJLFAAA@?LJLOSSNPNKSSSSMSMKNSSSKGGNJOIHSSLOOJLSSJEFF;;HIHSSOJKLSSSOSSSSSSSSSSQSSSPSSSSLSSSQSNSSSSSSSSSSSSSSSKONJSSSSSSSSSQMMSSSSSSSSNSMSSQNLSGSMMKLSKSPOSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSJSSSSSSSSSSPLRSSSSSSKJJSSSOSSKSSSSSSSSSSSKQSMJKKNSSSSSIOSSSQSSSLRSSSSSLSSSSMNSSNMNRMEDIFFSSSOSSSSSSSSSSMISSSSSSSSSSSSSSSSQSSSSSSSSSKSSSSSSSSOSQSSRSSSSSNSSSSSSSSSSQSSSOSNJSSSSLHKSSSSSSSSSSSQSPSSSSSSOSSSSSIMSNSSSSSSSSSSSJISSSSSSSSSGGGHSSSSSSSSSSSHKSSQSSSLIKLHQMJSSSNSSSSSSSSSSSSSSSSNSSSSKSHSSMLJGSSHSHJHJJILSSSMSSSSSSSRSSSSSSPSJHHIFFFSSSFFGFIGGFHHFSLILSSSSOSJSSNSSSSSSKHQSHOKSKKMPSSSSLLJGJ6334CDB>:43100./3:;<:96666GKOSSFEA@ABASKSHHGGHKNKHOKLHSSSPFOJSPOMIOMSMIK>JSMMHGFJFNSGSIISFCCECHHAIEHJPKSILSF>HDGSSIFFDD7333463429;CA?:9?HE@@@?9998<=IIPSSNSHSKIBB@;@GFOMHKMSSSSSSSSSSMSSSPNMMIOSSSSSSSOMSSSSNSSSSSSSSSPOMJSSSSSPSSQSSLMMSSSSSSSSSSQSMSSSLHIISDBCDDJPSSSQSQSSSSSSSSSSSSSJMSSSSSSRSQSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSRSSSSSSSSSSSSSRSSSSSSSNSSSSSSSHGHSQSQSLSSSSNSLISSSSSPSSSSSSSSSSOSSSOSSSSSSSSSSSPSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSILSSSSSSSSSPMMKSSSSQNSSSSSSSSSOSSSSSSSSSCCCDDSSSMJSSSSSSLMMLSSSSSSSSKSSJFGFHSNSSSNKSSSSMISSSSSSSSSSSSSSSSSSSSNSSSSIHD42436888999:==><>=GSSSSSMSSSSSSSRSSSSSQMSHHKSSSSSSPSSSSSSSSSSSSSJMSSMJISEIE1000007@?CCIMKNSSSSNSSSSPMPSSPMSSSSSSSSSSSSNSSSSSSSSSNQSSSNJHHGIILSSSJSKLSSSSSRSSSSSNSSSSSNLSSOSSSQSSSSSSSLSSSSSNLISSSSSSSOSSSMPSSSSOSRSSSOSSSSSSSNSSSNSRSSSSSSSSSSSSSSSSSSSSSSSPSSSSSSIMSSOSSSSOSSSSSSSSSSSSSSSSSSSSSSSSNSSSSSSSSSSSSJMSSSSLSSSSSSSSSSSSSSSSSLIQLAHSLHMSSSSSSSSQSMJSSSSSS6122374555556667789;=?@BDCGGNKISSPHIHIIHHJJCEDEFSSSFGFHGSSSSSSSSSKGFGHGFJIHIFDCDCBFGIG::92 qs:f:29.0782 du:f:5.587 ns:i:27935 ts:i:10 mx:i:1 ch:i:174 st:Z:2024-08-20T13:54:18.740+00:00 rn:i:22003 fn:Z:PAY63012_pass_2d2e6b65_9300ab27_1974.pod5 sm:f:720.842 sd:f:125.658 sv:Z:pa dx:i:0 RG:Z:9300ab2755c6080067ea07aae5b884a17ba2bb21_dna_r10.4.1_e8.2_400bps_sup@v5.0.0

poly-a-config

[anchors]
front_primer = "ATGGG"
rear_primer = "AAGCAGTGGTATCAACGCAGAGTAC"

[threshold]
flank_threshold = 0.6

[tail]
tail_interrupt_length = 10

Run environment:

  • Dorado version: 0.8.2+6b413c9
  • Dorado command: dorado basecaller --barcode-both-ends --device cuda:1 --no-trim --estimate-poly-a --poly-a-config ./conf/polya_conf.tmol --verbose ./[email protected] ./pod5s > ./call.bam
  • Operating system: Linux
  • Hardware (CPUs, Memory, GPUs): A100 80G
  • Source data type: pod5
@HalfPhoton
Copy link
Collaborator

Hi @shaohuishi,

I believe the issue could be that your front_primer definition is incomplete.

Dorado expects cDNA reads to take the form:

 5' --- ADAPTER --- FRONT_PRIMER
... --- cDNA
... --- poly(A) --- RC(REAR_PRIMER) --- 3'

But you have described this format

 5' --- ADAPTER --- ????? --- FRONT_PRIMER 
... --- cDNA 
... --- poly(A) --- RC(REAR_PRIMER) --- 3'

I believe the solution is to extend the definition of your front primer to include the additional part.

[anchors]
front_primer = "AAGCAGTGGTATCAACGCAGAGTACATGGG"
rear_primer = "AAGCAGTGGTATCAACGCAGAGTAC"

Kind regards,
Rich

@HalfPhoton HalfPhoton added the polyA Issue related to polyA tail estimation label Oct 31, 2024
@shaohuishi
Copy link
Author

shaohuishi commented Oct 31, 2024

Hi @shaohuishi,

I believe the issue could be that your front_primer definition is incomplete.

Dorado expects cDNA reads to take the form:

 5' --- ADAPTER --- FRONT_PRIMER
... --- cDNA
... --- poly(A) --- RC(REAR_PRIMER) --- 3'

But you have described this format

 5' --- ADAPTER --- ????? --- FRONT_PRIMER 
... --- cDNA 
... --- poly(A) --- RC(REAR_PRIMER) --- 3'

I believe the solution is to extend the definition of your front primer to include the additional part.

[anchors]
front_primer = "AAGCAGTGGTATCAACGCAGAGTACATGGG"
rear_primer = "AAGCAGTGGTATCAACGCAGAGTAC"

Kind regards, Rich

Hi Rich,

Thank you for your quick reply,

However, we tried providing the complete FRONT_PRIMER and it still did not work, we review the script dna_poly_tail_calculator.cpp and found

const bool proceed = flank_score >= threshold && std::abs(dist_v1 - dist_v2) > kMinSeparation

in L59, The default threshold kMinSeparation=10 would cause normal reads to be filtered out due to the presence of the same 25bp primer. As an example dist_v1 = 0 & dist_v2 = 5.

So, we changed this threshold kMinSeparation=2 and recompiled the program. This operation gave us a seemingly normal number of reads with polyA tails, however, it caused some poor quality reads, and even some reads that lacked an adapter at one end, to have unexpectedly long polyA tails.

To eliminate these reads, we set

flank_threshold=0.8 
tail_interrupt_length=2

to match the estimation of nanopore error rate at this primer length, and some biological assumptions.

After this change, we found that in extreme cases, based on the calculation method of flank_score, a primer mismatch on one end will benefit from a perfect match on the other end and make flank_score unable to filter the read.

Therefore, in order to use this function more flexibly and effectively, we recommend changing the calculation logic of flank_score to this:

    float flank_score = 0;
    const float top_flank_score_v1 = 1.f - static_cast<float>(top_v1.editDistance) / (front_primer.length())
    const float bottom_flank_score_v1 = 1.f - static_cast<float>(bottom_v1.editDistance) / (rear_primer.length())
    const float top_flank_score_v2 = 1.f - static_cast<float>(top_v2.editDistance) / (rear_primer.length())
    const float bottom_flank_score_v2 = 1.f - static_cast<float>(bottom_v2.editDistance) / (front_primer.length())
    if (fwd) {
        flank_score = std::min(top_flank_score_v1, bottom_flank_score_v1);
    } else {
        flank_score = std::min(top_flank_score_v2, bottom_flank_score_v2);
    }

We also hope that kMinSeparation can be an additional features that can be customized through a configuration file.

Thank you for your time and suggestions,
shaohui

@malton-ont
Copy link
Collaborator

@shaohuishi,

Thank you for your detailed examination! It appears that the cDNA polyA detection does not anticipate that the front and rear primers are so similar, and this arrangement makes distinguishing between a forward and a reverse strand much more difficult. We make ongoing efforts to improve the polyA detection, and we will investigate your suggestions as we move forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
polyA Issue related to polyA tail estimation
Projects
None yet
Development

No branches or pull requests

3 participants