Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 5 revisions

Biopiece: get_genome_seq

Description

get_genome_seq can be used to get subsequences from a specified genome that have been indexed with format_genome. The subsequence can be obtained explicitly or from BED/PSL/BLAST entries in the stream.

Use list_genomes to see available genome sequences.

Usage

get_genome_seq [options] <-g genome> <-c chr> <-b beg> <-e end|-l len>

or

... | get_genome_seq [options] <-g genome>

Options

[-?          | --help]               #  Print full usage description.
[-g <genome> | --genome=<genome>]    #  Genome to get subsequence from.
[-c <string> | --chr=<string>]       #  Chromosome with requested subsequence.
[-b <uint>   | --beg=<uint>]         #  Begin position of subsequence (first residue=1).
[-e <uint>   | --end=<uint>]         #  End position of subsequence.
[-l <uint>   | --len=<uint>]         #  Length of subsequence.
[-f <uint>   | --flank=<uint>]       #  Include flanking sequence.
[-m          | --mask]               #  Softmask non-exonic sequence.
[-s          | --splice]             #  Splice sequence BLOCKS in BED/PSL type records.
[-I <file!>  | --stream_in=<file!>]  #  Read input from stream file  -  Default=STDIN
[-O <file>   | --stream_out=<file>]  #  Write output to stream file  -  Default=STDOUT
[-v          | --verbose]            #  Verbose output.

Examples

To get an explicit subsequence from the human genome (currently hg18) do:

get_genome_seq -g hg18 -c chrX -b 12000 -l 20

CHR_END: 12018
SEQ: GGTGCAGTAACACCTGCCGT
CHR_BEG: 11999
SEQ_LEN: 20
CHR: chrX
---

If you have a stream with BED, PSL or BLAST records, you obtain the subsequence simple by piping the stream through get_genome_seq and the sequence will be added to the record. Below is an example for a BED entry:

read_bed -i <BED file> -n 1 | get_genome_seq -g hg18

STRAND: +
CHR_END: 95127728
Q_ID: gi|108087685|gb|DQ594132.1|_Homo_sapiens_piRNA_piR-60244,_complete_sequence
CHR_BEG: 95127700
SCORE: 1
REC_TYPE: BED
BED_LEN: 29
CHR: chr15
SEQ: TTCACTTCTCCCATGTAGTTCCTGAGTGC
BED_COLS: 6
---

Now consider the following 12 column BED entry in the file test.bed:

chr4    70946   71196   AA699063        0       -       70946   71196   0       2       142,55, 0,195,

To obtain the dm3 sequence for this EST, do:

read_bed -i test.bed | get_genome_seq -g dm3 | write_fasta -xw 80

>AA699063
CTTTAGTAATGCTAGTGATCGCGCAAAGCATCAAAATCGAACACACAGTAATGAGGTAAGTCTTTCATCTATATAAAAGT
AATATTTTTCATTTATAGCTATTTTTAGAAACCGTACATTTGTAAAGCACCTGGATGCACAAAACGTTACACCGACCCGA
GCTCTTTGAGAAAACATGTAAAAACAGTCCATGGAGCTGAATTTTATGCAAATAAAAAACATAAGGGGTTGCCTCTAAAT
GACGCAAACT

Now use the -m switch to soft mask the intron so that exons are in uppercase sequence and the rest are in lowercase:

read_bed -i test.bed | get_genome_seq -mg dm3 | write_fasta -xw 80

>AA699063
CTTTAGTAATGCTAGTGATCGCGCAAAGCATCAAAATCGAACACACAGTAATGAGGTAAGTCTTTCATCTATATAAAAGT
AATATTTTTCATTTATAGCTATTTTTAGAAACCGTACATTTGTAAAGCACCTGGATGCACAAaacgttacaccgacccga
gctctttgagaaaacatgtaaaaacagtccatggaGCTGAATTTTATGCAAATAAAAAACATAAGGGGTTGCCTCTAAAT
GACGCAAACT

To splice the exon blocks use the -s switch:

read_bed -i test.bed | get_genome_seq -sg dm3 | write_fasta -xw 80

>AA699063
CTTTAGTAATGCTAGTGATCGCGCAAAGCATCAAAATCGAACACACAGTAATGAGGTAAGTCTTTCATCTATATAAAAGT
AATATTTTTCATTTATAGCTATTTTTAGAAACCGTACATTTGTAAAGCACCTGGATGCACAAGCTGAATTTTATGCAAAT
AAAAAACATAAGGGGTTGCCTCTAAATGACGCAAACT

See also

format_genome

list_genomes

read_bed

extract_seq

write_fasta

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

[email protected]

August 2007

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

get_genome_seq is part of the Biopieces framework.

http://www.biopieces.org

Clone this wiki locally