-
Notifications
You must be signed in to change notification settings - Fork 23
order_pairs
#summary Order records with pair end sequence data.
[order_pairs] order records with pair end sequence data where the sequence names are
either using the Illuina 1.5 scheme where names end on /1 or /2 or the Illumina 1.8 scheme
where The names contain a space followed by 1
or 2
and then a :
. The records are
output in inter leaved order - which is required for pair-end aware assembly programs.
[order_pairs] uses a hashing scheme for this and does not sort according to sequence name.
Using [order_pairs] is important after filtering steps where one record of a pair may have been
discarded. For each record the value to the ORDER
key denotes if the record was paired
or the record was orphan and you can use [grab] to filter the records accordingly.
SEQ_NAME: HWI-ST575:107:C0HE6ACXX:5:1101:1832:2218 1:N:0:TAGCTG
SEQ: GCTTTGACATAGTCGCTCCAGAATTGCCAGCTAGGGTTAGCTTGGCAACTGCAGCGACGTAATGTGCTGTGGCAGATCAATTTATCTGTTTTGAATCA
SEQ_LEN: 98
SCORES: ^P^PJ\Y`eea`e[daYdecggadgdXJIYVbdc`efg_cdedI^aXIO^abeb\eL_daQU^_V]``]UGTZ\^BBBBBBBBBBBBBBBBBBBBBBB
ORDER: paired
---
SEQ_NAME: HWI-ST575:107:C0HE6ACXX:5:1101:1832:2218 2:N:0:TAGCTG
SEQ: GGTTATCGATCTGGAAAAAGCAACTAAACCTAAAGCTAAACCACGTAGCGCCGGGTAAATGATTCAAAACAGATAAATTGATCTGCCACAGCACATTA
SEQ_LEN: 98
SCORES: ^VYPJQ`c^JJ[b[efg^dHJ`aa`adXd_ZXXbIIIY[af_H^aWHWPZ[`gggFFZ^bd_Z]Zb_]ba\^ZGY_`TZ``cc[bbR]]^aaXQ[bbb
ORDER: paired
---
SCORES: ffffcfffffded^eddddddbdcdeedcefecfefdffecabccBB`b`
SEQ: CCNAGGAGGAGNCAATAAGAGACCATTCGTATATGATCTCTCAGGAGAGC
SEQ_LEN: 50
SEQ_NAME: ILLUMINA-52179E_0004:2:1:1044:7943#TTAGGC/1
ORDER: orphan 1
---
SCORES: BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
SEQ: NNNNNNNNGGNNCNANNANNNNGTNNNTNGNANNNNCNNANTTGNNNNNN
SEQ_LEN: 50
SEQ_NAME: ILLUMINA-52179E_0004:2:1:1041:14486#TTAGGC/2
ORDER: orphan 2
---
... | order_pairs [options]
[-? | --help] # Print full usage description.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
If you have two pair-end sequence files with the Illumina 1.5 or 1.8 scheme of naming pairs then you can order these with [order_pairs] simply by doing:
read_fastq -i test1.fq,test2.fq | order_pairs | write_fastq -o combi.fq -x
If you filter your sequences and discard a member of a pairs, you can run the data through [order_pairs] to discard any unmatched records:
read_fastq -i combi.fq | # Read in Illumina data
trim_seq | # Trim ends according to quality scores
grab -e "SEQ_LEN>30" | # Remove entries with sequence shorter than 30
order_pairs | # Make sure the pairs are in order
grab -p 'pair' -k ORDER | # Grab paired records
write_fastq -o combi_trimmed.fq -x # Write to new file
[read_fastq]
[write_fastq]
[trim_seq]
[grab]
[assemble_seq_idba]
[assemble_seq_velvet]
Martin Asser Hansen - Copyright (C) - All rights reserved.
May 2011
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
[order_pairs] is part of the Biopieces framework.