-
Notifications
You must be signed in to change notification settings - Fork 23
tile_seq
tile_seq can create an alignment of several sequences based on pairwise alignments of both the sequence and the reverse complement sequence using the best matching sequence. This is useful for e.g. matching short sequences such as ESTs or deep sequencing reads against a reference sequence. tile_seq is more precise than a multiple alignment, where the introduction of indels in the reference sequence will most likely ruin the alignment. Also, tile_seq is capable of dealing with thousands of sequences.
tile_seq currently uses Muscle as the alignment engine, and Muscle must be installed in order for tile_seq to work.
For more about Muscle:
... | tile_seq [options]
[-? | --help] # Print full usage description.
[-i <uint> | --identity=<uint>] # Minimum identity (%) for pairwise alignment - Default=70
[-s | --supress_indels] # Supress insertions in query sequence.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following file test.fna
containing these FASTA entries:
>ref
ACGACTAGCATCGACTGACA
>test1
CTAGCTTCGACT
>test2
GAATCGACT
>test3
ACGAAACTAGCATC
>test4
AGCATCGACT
>test5
TAACAGGCACT
In order to tile the test1, test2 ... test5 sequences against the reference sequence, first read in the sequences using read_fasta and then pipe through tile_seq:
read_fasta -i test.fna | tile_seq
SEQ: ACGACTAGCATCGACTGACA
SEQ_NAME: ref
---
SEQ: ACGAAACTAGCATC------
SEQ_NAME: test3_+_85.71
---
SEQ: ----CTAGCTTCGACT----
SEQ_NAME: test1_+_91.67
---
SEQ: ------AGCATCGACT----
SEQ_NAME: test4_+_100.00
---
SEQ: -------GAATCGACT----
SEQ_NAME: test2_+_88.89
---
The resulting tiled sequences show the reference sequence as the first sequence, and then
the subsequent sequences sorted alphabetically by the sequence itself, thus giving the
tiled output. Two pieces of information is added to the SEQ_NAME key, namely the orientation
of the pairwise alignment that gave the highest similarity, and a global identity score that
is calculated as the number of matches over the length of the shortest sequence in the pairwise
alignment. Use the -i
switch to change the identity cutoff for the inclusion of alignments:
read_fasta -i test.fna | tile_seq -i 60
SEQ: ACGACTAGCATCGACTGACA
SEQ_NAME: ref
---
SEQ: ACGAAACTAGCATC------
SEQ_NAME: test3_+_85.71
---
SEQ: ----CTAGCTTCGACT----
SEQ_NAME: test1_+_91.67
---
SEQ: -----TAACAGGCACT----
SEQ_NAME: test5_+_63.64
---
SEQ: ------AGCATCGACT----
SEQ_NAME: test4_+_100.00
---
SEQ: -------GAATCGACT----
SEQ_NAME: test2_+_88.89
---
Now test5 is part of the alignment, and the tiled sequences can be written using write_align:
read_fasta -i test.fna | tile_seq -i 60 | write_align -x
. .
ref ACGACTAGCATCGACTGACA
test3_+_85.71 ACGAAACTAGCATC------
test1_+_91.67 ----CTAGCTTCGACT----
test5_+_63.64 -----TAACAGGCACT----
test4_+_100.00 ------AGCATCGACT----
test2_+_88.89 -------GAATCGACT----
Consensus: 50% -------------ACT----
To better illustrate mismatches in the alignment use invert_align:
read_fasta -i test.fna | tile_seq -i 60 | invert_align | write_align -x
. .
ref ACGACTAGCATCGACTGACA
test3_+_85.71 ----AACTAGCATC______
test1_+_91.67 ____-----T------____
test5_+_63.64 _____--A--GGC---____
test4_+_100.00 ______----------____
test2_+_88.89 _______-A-------____
Consensus: 50% --------------------
Now we clearly see that an insertion in test3 offsets the alignment. This behavior can be
suppressed using the -s
switch to tile_seq:
read_fasta -i test.fna | tile_seq -i 60 -s | invert_align | write_align -x
. .
ref ACGACTAGCATCGACTGACA
test3_+_100.00 ------------________
test1_+_91.67 ____-----T------____
test5_+_63.64 _____--A--GGC---____
test4_+_100.00 ______----------____
test2_+_88.89 _______-A-------____
Consensus: 50% --------------------
Note that the identity score of test3 changes dramatically with the use of the -s
switch.
Martin Asser Hansen - Copyright (C) - All rights reserved.
August 2007
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
tile_seq is part of the Biopieces framework.