tile_seq

Biopiece: tile_seq

Description

tile_seq can create an alignment of several sequences based on pairwise alignments of both the sequence and the reverse complement sequence using the best matching sequence. This is useful for e.g. matching short sequences such as ESTs or deep sequencing reads against a reference sequence. tile_seq is more precise than a multiple alignment, where the introduction of indels in the reference sequence will most likely ruin the alignment. Also, tile_seq is capable of dealing with thousands of sequences.

tile_seq currently uses Muscle as the alignment engine, and Muscle must be installed in order for tile_seq to work.

For more about Muscle:

http://www.drive5.com/muscle/

Usage

... | tile_seq [options]

Options

[-?         | --help]               #  Print full usage description.
[-i <uint>  | --identity=<uint>]    #  Minimum identity (%) for pairwise alignment  -  Default=70
[-s         | --supress_indels]     #  Supress insertions in query sequence.
[-I <file!> | --stream_in=<file!>]  #  Read input from stream file                  -  Default=STDIN
[-O <file>  | --stream_out=<file>]  #  Write output to stream file                  -  Default=STDOUT
[-v         | --verbose]            #  Verbose output.

Examples

Consider the following file test.fna containing these FASTA entries:

>ref
ACGACTAGCATCGACTGACA
>test1
CTAGCTTCGACT
>test2
GAATCGACT
>test3
ACGAAACTAGCATC
>test4
AGCATCGACT
>test5
TAACAGGCACT

In order to tile the test1, test2 ... test5 sequences against the reference sequence, first read in the sequences using read_fasta:

read_fasta -i test.fna | tile_seq

SEQ: ACGACTAGCATCGACTGACA
SEQ_NAME: ref
---
SEQ: ACGAAACTAGCATC------
SEQ_NAME: test3_+_85.71
---
SEQ: ----CTAGCTTCGACT----
SEQ_NAME: test1_+_91.67
---
SEQ: ------AGCATCGACT----
SEQ_NAME: test4_+_100.00
---
SEQ: -------GAATCGACT----
SEQ_NAME: test2_+_88.89
---

The resulting tiled sequences show the reference sequence as the first sequence, and then the subsequent sequences sorted alphabetically by the sequence itself, thus giving the tiled output. Two pieces of information is added to the SEQ_NAME key, namely the orientation of the pairwise alignment that gave the highest similarity, and a global identity score that is calculated as the number of matches over the length of the shortest sequence in the pairwise alignment. Use the -i switch to change the identity cutoff for the inclusion of alignments:

read_fasta -i test.fna | tile_seq -i 60

SEQ: ACGACTAGCATCGACTGACA
SEQ_NAME: ref
---
SEQ: ACGAAACTAGCATC------
SEQ_NAME: test3_+_85.71
---
SEQ: ----CTAGCTTCGACT----
SEQ_NAME: test1_+_91.67
---
SEQ: -----TAACAGGCACT----
SEQ_NAME: test5_+_63.64
---
SEQ: ------AGCATCGACT----
SEQ_NAME: test4_+_100.00
---
SEQ: -------GAATCGACT----
SEQ_NAME: test2_+_88.89
---

Now test5 is part of the alignment, and the tiled sequences can be written using write_align:

read_fasta -i test.fna | tile_seq -i 60 | write_align -x

                          .         .
ref              ACGACTAGCATCGACTGACA
test3_+_85.71    ACGAAACTAGCATC------
test1_+_91.67    ----CTAGCTTCGACT----
test5_+_63.64    -----TAACAGGCACT----
test4_+_100.00   ------AGCATCGACT----
test2_+_88.89    -------GAATCGACT----
Consensus: 50%   -------------ACT----

To better illustrate mismatches in the alignment use invert_align:

read_fasta -i test.fna | tile_seq -i 60 | invert_align | write_align -x

                          .         .
ref              ACGACTAGCATCGACTGACA
test3_+_85.71    ----AACTAGCATC______
test1_+_91.67    ____-----T------____
test5_+_63.64    _____--A--GGC---____
test4_+_100.00   ______----------____
test2_+_88.89    _______-A-------____
Consensus: 50%   --------------------

Now we clearly see that an insertion in test3 offsets the alignment. This behavior can be suppressed using the -s switch to tile_seq:

read_fasta -i test.fna | tile_seq -i 60 -s | invert_align | write_align -x

                          .         .
ref              ACGACTAGCATCGACTGACA
test3_+_100.00   ------------________
test1_+_91.67    ____-----T------____
test5_+_63.64    _____--A--GGC---____
test4_+_100.00   ______----------____
test2_+_88.89    _______-A-------____
Consensus: 50%   --------------------

Note that the identity score of test3 changes dramatically with the use of the -s switch.

Author

[email protected]

August 2007

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

tile_seq is part of the Biopieces framework.

http://www.biopieces.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly