-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I have 5' and 3' adapter sequences for the paired read sequencing data? how should I remove them #813
Comments
What type of library is this? Is this from whole-genome sequencing? What you’re describing is the Illumina read layout for a TruSeq dual index library, see https://teichlab.github.io/scg_lib_structs/methods_html/Illumina.html. You can try the commands listed in the documentation: https://cutadapt.readthedocs.io/en/stable/guide.html#truseq |
Thank you for the quick reply, I really appreciate it. Yes this is from pool WGS data. I checked the fastQC and it has 2 percent reads has adapter sequences (around 500,000 reads). This is a pair end reads data where the reads are 150bp long. I tried out the commands from the docs. It is bit confusing. |
Illumina paired-end is pretty standard. The command provided in the documentation should do the trick:
Then afterwards you can check your trimmed reads with FastQC or Sequali to see if the adapters are removed. |
Thank you! I understand the command line but I am confused on what adapter sequences to use for which end? |
Just use the command exactly as written. The adapters are already the correct ones. |
Finally I understood them, Thank you so much! Really appreciate the software and the work :) Have a great day and I will close this issue. |
(a) I have some small queries. I have some samples with different adapters being used: Should I use the same command as the before for adapter removal. When I looked into the multiqc report of them. It has polyA. I will use the polyA command for it. Sequences of adapter (b) This is another small follow up question, I have some 100 bp Illumina single read data from online. After checking up with multiqc, I found that it has 50percent adapter. (I used this adapter sequence in this page)I have observed that by providing different length of the adapter sequences, the trimming varies a lot. I used initial 15 bp of the adapter for it and the adapter content went down to 1 percent, Should I go with the whole adapter sequence or the initial 15 bp? I want to ask you what's the potential length of the adapter sequence too be used In the cutadapt? |
a) PolyA is a commonly recurring motif in the human genome. So it is not odd to see polyA. No need to trim it. b) The more information cutadapt is given, the more accurate it can cut. This document contains all the adapters and the recommended trimming sequences for them: https://support-docs.illumina.com/SHARE/AdapterSequences/Content/SHARE/FrontPages/AdapterSeq.htm |
Could you try sequali (https://github.com/rhpvorderman/sequali) on one of the read pairs? That will do an overlap analysis on the read pair and will report the adapter sequence including an identification of the most likely candidate. EDIT: |
Hi, the Fastqc reports are from the data of 100p single reads. I will check out the sequali |
I forgot to mention that all the fastqc reports above are MultiQC reports of 180 samples in total, yes I tried out the Sequali it did tell me the Illumina TruSeq adapter to be overrepresented among the samples. I used the same sequence for trimming the adapter but I wanted to check it out by differentiating it between whether the whole adapter has to be used or without the A before the adapter sequence as illumina always add a A base before attaching the adapter sequences., or the first 15 bases of the adapter or with the combination of multiple adapter. what about the polyA adapter identified by Illumina, does this cost any problem while mapping? |
PolyG is not technically adapter. It is a result of the sequencing machine not being able to see luminescence at that particular spot in the flowcell. Due to the way the chemistry works, there should be yellow, red, yellow and red, or nothing. And nothing codes for G. A repeated G is probably a result of something being broken (hence no lights) rather than an actual G repeat. I wouldn't worry about it. Soft clipping is a thing that modern read mappers do. It is not something that is likely to throw of your results. FastQC (and Sequali as well) look for 12 bp probes. So the detection is not actually PolyG but more like 12 continuous G's. These do not necessarily have to be placed at the end of the sequence for the detection to be triggered. |
Hi, I am using Cutadapt 4.9 version installed in an conda environment. I have some follow questions and I am getting confused on how to trim the adapters. I have adapter information from the paired end sequencing data where they used two adapter sequences. I have mentioned the sequences of the adapter along with this. I am using Illumina platform for paired end DNA sequencing.
Sequences of adapter
5' Adapter:
5’-AATGATACGGCGACCACCGAGATCTACAC(i5Index)ACACTCTTTCCCTACACGACGCTCTTCCGATCT-3’
3' Adapter:
5’-CAAGCAGAAGACGGCATACGAGAT(Reverse complementary sequence of i7Index)GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC - 3’
These are my adapter sequences used in my paired end sequencing data. I want to know how to remove the adapters using cutadapt?
I have some questions on how to remove them?
(a) Should I use the concept of linked adapter?
where lets call for simplicity: 5' adapter as 5A, and 3' adapter as 3A
and here I am confused should I use the reverse complement of one adapter for one of the reverse read?
(b) Or should I trim them individually by means of using cutadapt -a -g -A -G? should I use the rev complement of one of the sequence?? I am really confused on this. Looking forward for the support.
The text was updated successfully, but these errors were encountered: