-
Notifications
You must be signed in to change notification settings - Fork 23
patscan_seq
patscan_seq is a wrapper around the sequence pattern scanner patscan (or scan_for_matches) and can be used to scan for patterns in sequences from the stream. There are two different modes to run patscan_seq: the default mode where all pattern matches are output as seperate records following each sequence, or the inline mode where the pattern matches are output along with the sequence record. The default mode is good for searching large sequences (like genomes) and the inline mode is handy for searching for single mathes in short sequence reads.
Advanced pattern syntax is covered in the scan_for_matches README that is located [here].
The default record type looks like this:
REC_TYPE: PATSCAN
S_ID: test
Q_ID: GACT
MATCH: GACT
S_BEG: 1
S_END: 4
STRAND: +
SCORE: 100
MATCH_LEN: 4
---
while the inline record type looks like this:
SEQ_NAME: test
SEQ: GGACTACNNGGGTATCTAATAGTC
SEQ_LEN: 20
PATTERN: GACT
MATCH: GACT
S_BEG: 1
S_END: 4
STRAND: +
MATCH_LEN: 4
---
patscan_seq requires scan_for_matches to be installed. Read more here:
http://blog.theseed.org/servers/2010/07/scan-for-matches.html
... | patscan_seq [options]
[-? | --help] # Print full usage description.
[-p <string> | --pattern=<string>] # Pattern to scan for.
[-P <file!> | --pattern=<file!>] # File with patterns - one per line.
[-c | --comp] # Also search reverse complement strand.
[-i | --inline] # Output matches inline.
[-o | --overlap] # Allow overlapping matches.
[-h <uint> | --max_hits=<uint>] # Stop scanning after max hits.
[-m <uint> | --max_misses=<uint> # Stop scanning after max misses.
[-I <file!> | --stream_in=<file!> # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the sequence in the FASTA file test.fna
:
>test
GGACTACNNGGGTATCTAATAGTC
To search the sequence for a simple pattern consisting of the sequence GGACTA allowing for 3 mismatches, 2 insertions and 1 deletion:
read_fasta -i test.fna | patscan_seq -p 'GGACTA[3,2,1]'
SEQ_NAME: test
SEQ: GGACTACNNGGGTATCTAATAGTC
SEQ_LEN: 24
---
REC_TYPE: PATSCAN
S_ID: test
Q_ID: GGACTA[3,2,1]
MATCH: GGACTA
S_BEG: 0
S_END: 5
STRAND: +
SCORE: 100
MATCH_LEN: 6
---
REC_TYPE: PATSCAN
S_ID: test
Q_ID: GGACTA[3,2,1]
MATCH: NGGGTA
S_BEG: 8
S_END: 13
STRAND: +
SCORE: 100
MATCH_LEN: 6
---
REC_TYPE: PATSCAN
S_ID: test
Q_ID: GGACTA[3,2,1]
MATCH: TCTA
S_BEG: 14
S_END: 17
STRAND: +
SCORE: 100
MATCH_LEN: 4
---
To get the matches inline use the -i
switch:
read_fasta -i test.fna | patscan_seq -i -p GGACTA[3,2,1]
SEQ_NAME: test
SEQ: GGACTACNNGGGTATCTAATAGTC
SEQ_LEN: 24
PATTERN: GGACTA[3,2,1]
MATCH: GGACTA
S_BEG: 0
S_END: 5
STRAND: +
MATCH_LEN: 6
---
SEQ_NAME: test
SEQ: GGACTACNNGGGTATCTAATAGTC
SEQ_LEN: 24
PATTERN: GGACTA[3,2,1]
MATCH: NGGGTA
S_BEG: 8
S_END: 13
STRAND: +
MATCH_LEN: 6
---
SEQ_NAME: test
SEQ: GGACTACNNGGGTATCTAATAGTC
SEQ_LEN: 24
PATTERN: GGACTA[3,2,1]
MATCH: TCTA
S_BEG: 14
S_END: 17
STRAND: +
MATCH_LEN: 4
---
To also scan the complementary strand in nucleotide sequences automagically determines the sequence type) you need to add the -c
switch:
... | patscan_seq -p <pattern> -c
Martin Asser Hansen - Copyright (C) - All rights reserved.
August 2007 - rewritten September 2011
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
patscan_seq is part of the Biopieces framework.