-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: ONT Chimeric read splitting #747
Comments
I did some thinking and research. The best way to approach this is as follows:
3 is quite challenging, but by following the steps, cutadapt will already be useful for nanopore with chimeric reads at step 1, without requiring extra code. |
@marcelm, could you help me a bit with this one? I am in currently investigating how to do this best. I did find the sequence that dorado uses: https://github.com/nanoporetech/dorado/blob/acec121e438099741b690d49c7bff4bf25e1851c/dorado/splitter/ReadSplitter.h#L66 It uses a s string-matching library to get the position, so in theory cutadapt can leverage its existing alignment algorithm as well. Since the chimeric read content for R10 chemistry is supposedly around 10%, the easiest approach for now is just to discard these reads rather than deal with splitting. The sequence is
That should only match sequences that are fully contained within the read due to the overlap setting. Is that correct? |
Sure!
Looks good, just quoting those options below for which I have comments.
You could use
You can use
Keep in mind this will "normalize" read orientation (the reads for which the adapter was found on the reverse complement will be output reverse-complemented); I’m not sure this is necessary or appropriate. Run your command once with
Yes! |
So I tried this and it seems from the output that is most likely that only false positives are found. (Matches with 5 errors where massively overrepresented) . Turns out there is a I am glad this is in the metadata. Makes my job a whole lot easier. |
Currently I am researching ONT possibilities with cutadapt, and it seems that the most basic functionality can be achieved. Unfortunately after the adapters have been adequately cut, sequali still finds adapter sequences.
These are most likely due to chimeric reads, where reads are joined by adapter sequences. These reads should be split. With the newest chemistry the amount of chimeric reads is estimated at 10% (previously around 2%). These chimeric reads are not always split by the sequence provider and historic data may also contain the 2% reads because splitting was not available back then.
Since cutadapt already has a decent alignment algorithm that can detect sequences anywhere in the read, it should be possible to write a routine that detects chimeric reads.
The hard part I guess will be the actual splitting, were one read becomes two or more reads and feed that back into the pipeline. I can imagine that consideration wasn't a thing when cutadapt was designed.
The text was updated successfully, but these errors were encountered: