Primer sequence trimming. Cutadapt does not cut all of them #798

Guillermouceda · 2024-08-08T16:01:21Z

Hello,

I am relatively new to Cutadapt and therefore this question might have been asked before or it may seem stupid.

I am using Cutadapt to trim the primer sequences from several metabarcoding libraries produced with Illumina MiSeq 600 v3. The sequences are already demultiplexed and split into the original samples. I am running Cutadapt with the following settings:

Windows subsystem for Linux (WSL v. 2)
Python 3.9
Cutadapt 4.3
Cutadapt was installed via pip
Laptop 16 GB of RAM
Processor: Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz 2.30 GHz

My libraries consist of sequences coming from pollen collected from bees. I have used the following primers:

Fwd: ATGCGATACTTGGTGTGAAT
Rev: GACGCTTCTCCAGACTACAAT

I have 175 libraries and therefore I have 350 pair-end files (R1 and R2 files). Thus, I have run Cutadapt in loop like:

for i in *_R1.fastq.gz
do
      FILENAME=`basename $i _R1.fastq.gz`;
      echo $FILENAME;
      cutadapt -g ATGCGATACTTGGTGTGAAT -G GACGCTTCTCCAGACTACAAT \
      -m 100 -M 350 \
      --match-read-wildcards \
      --pair-filter=both -q 10 \
      -o ${FILENAME}_R1_trimmed.fastq.gz -p ${FILENAME}_R2_trimmed.fastq.gz\
      ${FILENAME}_R1.fastq.gz ${FILENAME}_R2.fastq.gz
      done

Cutadapt runs smoothly and I do not get any issues. However, I still find some of my primer sequences after counting them with:

zgrep -c ATGCGATACTTGGTGTGAAT *_R1_trimmed.fastq.gz >> primers_Fw_in_reads_r1_trimmed.txt
zgrep -c GACGCTTCTCCAGACTACAAT *_R2_trimmed.fastq.gz >> primers_Rev_in_reads_r2_trimmed.txt
zgrep -c ATGCGATACTTGGTGTGAAT *_R2_trimmed.fastq.gz >> primers_fw_in_reads_r2_trimmed.txt
zgrep -c GACGCTTCTCCAGACTACAAT *_R1_trimmed.fastq.gz >> primers_rev_in_reads_r1_trimmed.txt

In the forward read (R1) I still find a few Fwd primer sequences, as well as I find a few Rev primer sequences in the R2 reads. Is this normal? I am talking about a few 100s in some samples. But what is most striking is that all the Fwd primer sequences in the R2 files are still there, and the same happens for the Rev primer sequences in the R1 reads. **Is there something that I am not understanding and this is normal?**Is it OK to run again Cutadapt over the output of the first run with the same settings?

I provide the .txt files were the count of each primer sequences was done for each library before/after running Cutadapt and the read count of each library.

Read counts per library file:
read.counts.txt

Primer sequence count before Cutadapt:
primers_fw_in_reads_r2.txt
primers_rev_in_reads_r1.txt
primers_Rev_in_reads_r2.txt
primers_Fw_in_reads_r1.txt

Primer sequence count after Cutadapt:
primers_fw_in_reads_r2_trimmed.txt
primers_rev_in_reads_r1_trimmed.txt
primers_Rev_in_reads_r2_trimmed.txt
primers_Fw_in_reads_r1_trimmed.txt

Thank you very much in advance.

The text was updated successfully, but these errors were encountered:

marcelm · 2024-08-11T16:37:24Z

In the forward read (R1) I still find a few Fwd primer sequences, as well as I find a few Rev primer sequences in the R2 reads. Is this normal? I am talking about a few 100s in some samples.

This is probably normal. I believe the extra occurrences are from reads that contain two primers. Cutadapt by default trims the leftmost one, so additional occurrences remain in the read.

But what is most striking is that all the Fwd primer sequences in the R2 files are still there, and the same happens for the Rev primer sequences in the R1 reads. Is there something that I am not understanding and this is normal?

This is expected: The sequence that you provide with the (lowercase) -g option is searched only in the R1 reads, and the sequence that you provided with the (uppercase) -G option is searched only in the R2 reads.

If you want to search the respective other read as well, you need to use the --revcomp option. Use this only if you are sure that you can have inserts in both directions.

Is it OK to run again Cutadapt over the output of the first run with the same settings?

It does not make that much sense to me. I have not worked with metabarcoding data, so I’m not entirely sure how your data is actually structured, but I would probably run Cutadapt in a slightly different way. For both the -g and the -G options, I would use anchored adapters, that is, write them like this: -g ^ATGCGATACTTGGTGTGAAT -G ^GACGCTTCTCCAGACTACAAT because I would expect them to be exactly at the 5' end of the read. With the ^ character, you allow partial matches and also matches within the read. Also, I would probably use --discard-untrimmed in order to ensure I only get the reads that actually start with a primer.

If you do that, you don’t need the -m 100 -M 350 --pair-filter=both options. Also, using the --match-read-wildcards option is a bit unusual, you may want to leave it out unless you have a reason for needing it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Primer sequence trimming. Cutadapt does not cut all of them #798

Primer sequence trimming. Cutadapt does not cut all of them #798

Guillermouceda commented Aug 8, 2024 •

edited by marcelm

Loading

marcelm commented Aug 11, 2024

Primer sequence trimming. Cutadapt does not cut all of them #798

Primer sequence trimming. Cutadapt does not cut all of them #798

Comments

Guillermouceda commented Aug 8, 2024 • edited by marcelm Loading

marcelm commented Aug 11, 2024

Guillermouceda commented Aug 8, 2024 •

edited by marcelm

Loading