Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I have 5' and 3' adapter sequences for the paired read sequencing data? how should I remove them #813

Open
sayeraselvan opened this issue Oct 14, 2024 · 14 comments

Comments

@sayeraselvan
Copy link

sayeraselvan commented Oct 14, 2024

Hi, I am using Cutadapt 4.9 version installed in an conda environment. I have some follow questions and I am getting confused on how to trim the adapters. I have adapter information from the paired end sequencing data where they used two adapter sequences. I have mentioned the sequences of the adapter along with this. I am using Illumina platform for paired end DNA sequencing.

Sequences of adapter
5' Adapter:
5’-AATGATACGGCGACCACCGAGATCTACAC(i5Index)ACACTCTTTCCCTACACGACGCTCTTCCGATCT-3’
3' Adapter:
5’-CAAGCAGAAGACGGCATACGAGAT(Reverse complementary sequence of i7Index)GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC - 3’

These are my adapter sequences used in my paired end sequencing data. I want to know how to remove the adapters using cutadapt?

I have some questions on how to remove them?

(a) Should I use the concept of linked adapter?
where lets call for simplicity: 5' adapter as 5A, and 3' adapter as 3A

  cutadapt -a 5A...3A -A 5A...3A -o out1.fq.gz -p out2.fq.gz input1.fq.gz input2.fq.gz

and here I am confused should I use the reverse complement of one adapter for one of the reverse read?

(b) Or should I trim them individually by means of using cutadapt -a -g -A -G? should I use the rev complement of one of the sequence?? I am really confused on this. Looking forward for the support.

@marcelm
Copy link
Owner

marcelm commented Oct 14, 2024

What type of library is this? Is this from whole-genome sequencing?

What you’re describing is the Illumina read layout for a TruSeq dual index library, see https://teichlab.github.io/scg_lib_structs/methods_html/Illumina.html.
Maybe you can get away with not removing adapters? If the inserts are large enough, there’s typically no need to remove adapters because you shouldn’t encounter them (or only in very few reads).

You can try the commands listed in the documentation: https://cutadapt.readthedocs.io/en/stable/guide.html#truseq

@sayeraselvan
Copy link
Author

Thank you for the quick reply, I really appreciate it. Yes this is from pool WGS data. I checked the fastQC and it has 2 percent reads has adapter sequences (around 500,000 reads). This is a pair end reads data where the reads are 150bp long. I tried out the commands from the docs. It is bit confusing.
(a) I should remove the 3' adapter from both the reads where for the forward read I will use the 3' adapter sequence directly and for the reverse read, I have to use the reverse compliment of the forward adapter sequence? or should I need to use linked adapter set to remove them? Or should I also consider removing the 5' adapter?

@rhpvorderman
Copy link
Collaborator

rhpvorderman commented Oct 15, 2024

Illumina paired-end is pretty standard. The command provided in the documentation should do the trick:

cutadapt \
    -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
    -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
    -o trimmed.R1.fastq.gz -p trimmed.R2.fastq.gz \
    reads.R1.fastq.gz reads.R2.fastq.gz

Then afterwards you can check your trimmed reads with FastQC or Sequali to see if the adapters are removed.

@sayeraselvan
Copy link
Author

Thank you! I understand the command line but I am confused on what adapter sequences to use for which end?

@marcelm
Copy link
Owner

marcelm commented Oct 15, 2024

Just use the command exactly as written. The adapters are already the correct ones.

@sayeraselvan
Copy link
Author

Finally I understood them, Thank you so much! Really appreciate the software and the work :) Have a great day and I will close this issue.

@sayeraselvan
Copy link
Author

(a) I have some small queries. I have some samples with different adapters being used: Should I use the same command as the before for adapter removal. When I looked into the multiqc report of them. It has polyA. I will use the polyA command for it.

Sequences of adapter
5' Adapter:
5'-AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT-3'
3' Adapter:
5'-GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGATGACTATCTCGTATGCCGTCTTCTGCTTG-3'

(b) This is another small follow up question, I have some 100 bp Illumina single read data from online. After checking up with multiqc, I found that it has 50percent adapter. (I used this adapter sequence in this page)I have observed that by providing different length of the adapter sequences, the trimming varies a lot. I used initial 15 bp of the adapter for it and the adapter content went down to 1 percent, Should I go with the whole adapter sequence or the initial 15 bp? I want to ask you what's the potential length of the adapter sequence too be used In the cutadapt?

@sayeraselvan sayeraselvan reopened this Oct 15, 2024
@rhpvorderman
Copy link
Collaborator

a) PolyA is a commonly recurring motif in the human genome. So it is not odd to see polyA. No need to trim it.

b) The more information cutadapt is given, the more accurate it can cut. This document contains all the adapters and the recommended trimming sequences for them: https://support-docs.illumina.com/SHARE/AdapterSequences/Content/SHARE/FrontPages/AdapterSeq.htm

@sayeraselvan
Copy link
Author

Hi, I found a little bit of discrepancy over the fastqc reports after trimming them with cutadapt,

(a) FASTQC report before cutadapt: I have downloaded some of the sequences from the paper (178 fruit fly samples) Illumina 2500, 100bp single end read sequencing whose samples have high adapter content within them.

image

(b) Since, I have no information about the adapter sequences: I used the Illumina TruSeq single index adapter AGATCGGAAGAGCACACGTCTGAACTCCAGTCA from the above link (this is because I found them to be one of the overrepresented sequences) and when I used cutadapt for it and after using it, the adapters were trimmed but there were still few less overrepresented sequence warning of some universal adapter.

image

(c) I wanted to see whether if I used the first 16 bp of the above cutadapt -a AGATCGGAAGAGCACA, the result I got polyA and there was overrepresented sequence warning of some universal adapters

image

(d) I did use the cutadapt -a GATCGGAAGAGCACA without the A in front from (c) (I got warning in for incomplete adapter adapter for 100 samples) but the adapter content was this very very less and there was no overrepresented sequence warning

image

(e) Now I used the multiple adapter combination of (c) and (d) together, cutadapt -a AGATCGGAAGAGCAC -a GATCGGAAGAGCAC and got less adapter and there was no overrepresented sequence warning

image

Question: Based in the inference, it is ideal to go for multiple adapter removal case (e) because the adapters used were cutadapt -a AGATCGGAAGAGCAC -a GATCGGAAGAGCAC without A and what about polyA adapters present? Should I remove them

@sayeraselvan sayeraselvan reopened this Oct 16, 2024
@rhpvorderman
Copy link
Collaborator

rhpvorderman commented Oct 16, 2024

Could you try sequali (https://github.com/rhpvorderman/sequali) on one of the read pairs? That will do an overlap analysis on the read pair and will report the adapter sequence including an identification of the most likely candidate.

EDIT: pip install sequali

@sayeraselvan
Copy link
Author

sayeraselvan commented Oct 16, 2024

Hi, the Fastqc reports are from the data of 100p single reads. I will check out the sequali

@sayeraselvan
Copy link
Author

I forgot to mention that all the fastqc reports above are MultiQC reports of 180 samples in total, yes I tried out the Sequali it did tell me the Illumina TruSeq adapter to be overrepresented among the samples. I used the same sequence for trimming the adapter but I wanted to check it out by differentiating it between whether the whole adapter has to be used or without the A before the adapter sequence as illumina always add a A base before attaching the adapter sequences., or the first 15 bases of the adapter or with the combination of multiple adapter. what about the polyA adapter identified by Illumina, does this cost any problem while mapping?

@sayeraselvan
Copy link
Author

sayeraselvan commented Oct 17, 2024

Hi, this is a new dataset of paired end reads by Illumina, I have a short follow up question, for a batch of samples sequenced before trimming with cutadapt, this was the multiqc of 34 samples.

image

then I am using the following command then

cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o trimmedR1.fq.gz –p trimmedR2.fq.gz R1.fq.gz R2.fq.gz--cores=12 –nextseq-trim=13 --minimum-length 35 --no-indels --pair-filter=any --action=trim

Screenshot 2024-10-17 at 16 35 18

for some the samples, the poly G content is bit less than 0.5 percent, would it be a problem to proceed forward with mapping because in the end, I am going to use samtools -q 20 to filter the reads with high quality?

Is it better to continue using samples with low adapter content like this (less than 1 percent) or I have to filter out by all means?

@rhpvorderman
Copy link
Collaborator

PolyG is not technically adapter. It is a result of the sequencing machine not being able to see luminescence at that particular spot in the flowcell. Due to the way the chemistry works, there should be yellow, red, yellow and red, or nothing. And nothing codes for G. A repeated G is probably a result of something being broken (hence no lights) rather than an actual G repeat.

I wouldn't worry about it. Soft clipping is a thing that modern read mappers do. It is not something that is likely to throw of your results. FastQC (and Sequali as well) look for 12 bp probes. So the detection is not actually PolyG but more like 12 continuous G's. These do not necessarily have to be placed at the end of the sequence for the detection to be triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants