I have 5' and 3' adapter sequences for the paired read sequencing data? how should I remove them #813

sayeraselvan · 2024-10-14T14:41:24Z

Hi, I am using Cutadapt 4.9 version installed in an conda environment. I have some follow questions and I am getting confused on how to trim the adapters. I have adapter information from the paired end sequencing data where they used two adapter sequences. I have mentioned the sequences of the adapter along with this. I am using Illumina platform for paired end DNA sequencing.

Sequences of adapter
5' Adapter:
5’-AATGATACGGCGACCACCGAGATCTACAC(i5Index)ACACTCTTTCCCTACACGACGCTCTTCCGATCT-3’
3' Adapter:
5’-CAAGCAGAAGACGGCATACGAGAT(Reverse complementary sequence of i7Index)GTGACTGGAGTTCAGACGTGTGCTCTTCCGATC - 3’

These are my adapter sequences used in my paired end sequencing data. I want to know how to remove the adapters using cutadapt?

I have some questions on how to remove them?

(a) Should I use the concept of linked adapter?
where lets call for simplicity: 5' adapter as 5A, and 3' adapter as 3A

  cutadapt -a 5A...3A -A 5A...3A -o out1.fq.gz -p out2.fq.gz input1.fq.gz input2.fq.gz

and here I am confused should I use the reverse complement of one adapter for one of the reverse read?

(b) Or should I trim them individually by means of using cutadapt -a -g -A -G? should I use the rev complement of one of the sequence?? I am really confused on this. Looking forward for the support.

The text was updated successfully, but these errors were encountered:

marcelm · 2024-10-14T19:49:45Z

What type of library is this? Is this from whole-genome sequencing?

What you’re describing is the Illumina read layout for a TruSeq dual index library, see https://teichlab.github.io/scg_lib_structs/methods_html/Illumina.html.
Maybe you can get away with not removing adapters? If the inserts are large enough, there’s typically no need to remove adapters because you shouldn’t encounter them (or only in very few reads).

You can try the commands listed in the documentation: https://cutadapt.readthedocs.io/en/stable/guide.html#truseq

sayeraselvan · 2024-10-15T08:17:29Z

Thank you for the quick reply, I really appreciate it. Yes this is from pool WGS data. I checked the fastQC and it has 2 percent reads has adapter sequences (around 500,000 reads). This is a pair end reads data where the reads are 150bp long. I tried out the commands from the docs. It is bit confusing.
(a) I should remove the 3' adapter from both the reads where for the forward read I will use the 3' adapter sequence directly and for the reverse read, I have to use the reverse compliment of the forward adapter sequence? or should I need to use linked adapter set to remove them? Or should I also consider removing the 5' adapter?

rhpvorderman · 2024-10-15T08:25:43Z

Illumina paired-end is pretty standard. The command provided in the documentation should do the trick:

cutadapt \
    -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
    -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
    -o trimmed.R1.fastq.gz -p trimmed.R2.fastq.gz \
    reads.R1.fastq.gz reads.R2.fastq.gz

Then afterwards you can check your trimmed reads with FastQC or Sequali to see if the adapters are removed.

sayeraselvan · 2024-10-15T08:39:05Z

Thank you! I understand the command line but I am confused on what adapter sequences to use for which end?

marcelm · 2024-10-15T08:40:35Z

Just use the command exactly as written. The adapters are already the correct ones.

sayeraselvan · 2024-10-15T08:46:04Z

Finally I understood them, Thank you so much! Really appreciate the software and the work :) Have a great day and I will close this issue.

sayeraselvan · 2024-10-15T09:34:16Z

(a) I have some small queries. I have some samples with different adapters being used: Should I use the same command as the before for adapter removal. When I looked into the multiqc report of them. It has polyA. I will use the polyA command for it.

Sequences of adapter
5' Adapter:
5'-AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT-3'
3' Adapter:
5'-GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGATGACTATCTCGTATGCCGTCTTCTGCTTG-3'

(b) This is another small follow up question, I have some 100 bp Illumina single read data from online. After checking up with multiqc, I found that it has 50percent adapter. (I used this adapter sequence in this page)I have observed that by providing different length of the adapter sequences, the trimming varies a lot. I used initial 15 bp of the adapter for it and the adapter content went down to 1 percent, Should I go with the whole adapter sequence or the initial 15 bp? I want to ask you what's the potential length of the adapter sequence too be used In the cutadapt?

rhpvorderman · 2024-10-15T09:42:53Z

a) PolyA is a commonly recurring motif in the human genome. So it is not odd to see polyA. No need to trim it.

b) The more information cutadapt is given, the more accurate it can cut. This document contains all the adapters and the recommended trimming sequences for them: https://support-docs.illumina.com/SHARE/AdapterSequences/Content/SHARE/FrontPages/AdapterSeq.htm

sayeraselvan · 2024-10-16T09:44:19Z

Hi, I found a little bit of discrepancy over the fastqc reports after trimming them with cutadapt,

(a) FASTQC report before cutadapt: I have downloaded some of the sequences from the paper (178 fruit fly samples) Illumina 2500, 100bp single end read sequencing whose samples have high adapter content within them.

(b) Since, I have no information about the adapter sequences: I used the Illumina TruSeq single index adapter AGATCGGAAGAGCACACGTCTGAACTCCAGTCA from the above link (this is because I found them to be one of the overrepresented sequences) and when I used cutadapt for it and after using it, the adapters were trimmed but there were still few less overrepresented sequence warning of some universal adapter.

(c) I wanted to see whether if I used the first 16 bp of the above cutadapt -a AGATCGGAAGAGCACA, the result I got polyA and there was overrepresented sequence warning of some universal adapters

(d) I did use the cutadapt -a GATCGGAAGAGCACA without the A in front from (c) (I got warning in for incomplete adapter adapter for 100 samples) but the adapter content was this very very less and there was no overrepresented sequence warning

(e) Now I used the multiple adapter combination of (c) and (d) together, cutadapt -a AGATCGGAAGAGCAC -a GATCGGAAGAGCAC and got less adapter and there was no overrepresented sequence warning

Question: Based in the inference, it is ideal to go for multiple adapter removal case (e) because the adapters used were cutadapt -a AGATCGGAAGAGCAC -a GATCGGAAGAGCAC without A and what about polyA adapters present? Should I remove them

rhpvorderman · 2024-10-16T10:01:47Z

Could you try sequali (https://github.com/rhpvorderman/sequali) on one of the read pairs? That will do an overlap analysis on the read pair and will report the adapter sequence including an identification of the most likely candidate.

EDIT: pip install sequali

sayeraselvan · 2024-10-16T10:20:25Z

Hi, the Fastqc reports are from the data of 100p single reads. I will check out the sequali

sayeraselvan · 2024-10-16T10:49:20Z

I forgot to mention that all the fastqc reports above are MultiQC reports of 180 samples in total, yes I tried out the Sequali it did tell me the Illumina TruSeq adapter to be overrepresented among the samples. I used the same sequence for trimming the adapter but I wanted to check it out by differentiating it between whether the whole adapter has to be used or without the A before the adapter sequence as illumina always add a A base before attaching the adapter sequences., or the first 15 bases of the adapter or with the combination of multiple adapter. what about the polyA adapter identified by Illumina, does this cost any problem while mapping?

sayeraselvan · 2024-10-17T14:43:05Z

Hi, this is a new dataset of paired end reads by Illumina, I have a short follow up question, for a batch of samples sequenced before trimming with cutadapt, this was the multiqc of 34 samples.

then I am using the following command then

cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o trimmedR1.fq.gz –p trimmedR2.fq.gz R1.fq.gz R2.fq.gz--cores=12 –nextseq-trim=13 --minimum-length 35 --no-indels --pair-filter=any --action=trim

for some the samples, the poly G content is bit less than 0.5 percent, would it be a problem to proceed forward with mapping because in the end, I am going to use samtools -q 20 to filter the reads with high quality?

Is it better to continue using samples with low adapter content like this (less than 1 percent) or I have to filter out by all means?

rhpvorderman · 2024-10-18T07:26:43Z

PolyG is not technically adapter. It is a result of the sequencing machine not being able to see luminescence at that particular spot in the flowcell. Due to the way the chemistry works, there should be yellow, red, yellow and red, or nothing. And nothing codes for G. A repeated G is probably a result of something being broken (hence no lights) rather than an actual G repeat.

I wouldn't worry about it. Soft clipping is a thing that modern read mappers do. It is not something that is likely to throw of your results. FastQC (and Sequali as well) look for 12 bp probes. So the detection is not actually PolyG but more like 12 continuous G's. These do not necessarily have to be placed at the end of the sequence for the detection to be triggered.

sayeraselvan closed this as completed Oct 15, 2024

sayeraselvan reopened this Oct 15, 2024

sayeraselvan closed this as completed Oct 15, 2024

sayeraselvan reopened this Oct 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I have 5' and 3' adapter sequences for the paired read sequencing data? how should I remove them #813

I have 5' and 3' adapter sequences for the paired read sequencing data? how should I remove them #813

sayeraselvan commented Oct 14, 2024 •

edited

Loading

marcelm commented Oct 14, 2024

sayeraselvan commented Oct 15, 2024

rhpvorderman commented Oct 15, 2024 •

edited

Loading

sayeraselvan commented Oct 15, 2024

marcelm commented Oct 15, 2024

sayeraselvan commented Oct 15, 2024

sayeraselvan commented Oct 15, 2024

rhpvorderman commented Oct 15, 2024

sayeraselvan commented Oct 16, 2024

rhpvorderman commented Oct 16, 2024 •

edited

Loading

sayeraselvan commented Oct 16, 2024 •

edited

Loading

sayeraselvan commented Oct 16, 2024

sayeraselvan commented Oct 17, 2024 •

edited

Loading

rhpvorderman commented Oct 18, 2024

I have 5' and 3' adapter sequences for the paired read sequencing data? how should I remove them #813

I have 5' and 3' adapter sequences for the paired read sequencing data? how should I remove them #813

Comments

sayeraselvan commented Oct 14, 2024 • edited Loading

marcelm commented Oct 14, 2024

sayeraselvan commented Oct 15, 2024

rhpvorderman commented Oct 15, 2024 • edited Loading

sayeraselvan commented Oct 15, 2024

marcelm commented Oct 15, 2024

sayeraselvan commented Oct 15, 2024

sayeraselvan commented Oct 15, 2024

rhpvorderman commented Oct 15, 2024

sayeraselvan commented Oct 16, 2024

rhpvorderman commented Oct 16, 2024 • edited Loading

sayeraselvan commented Oct 16, 2024 • edited Loading

sayeraselvan commented Oct 16, 2024

sayeraselvan commented Oct 17, 2024 • edited Loading

rhpvorderman commented Oct 18, 2024

sayeraselvan commented Oct 14, 2024 •

edited

Loading

rhpvorderman commented Oct 15, 2024 •

edited

Loading

rhpvorderman commented Oct 16, 2024 •

edited

Loading

sayeraselvan commented Oct 16, 2024 •

edited

Loading

sayeraselvan commented Oct 17, 2024 •

edited

Loading