-
Notifications
You must be signed in to change notification settings - Fork 23
add_ident
For records without an identifier, or where the identifier is non-unique, a new identifier can be added with add_ident - or existing identifiers can be replaced.
... | add_ident [options]
[-? | --help] # Print full usage description.
[-k <string> | --key=<string>] # Identifier key - Default=ID
[-p <string> | --prefix=<string>] # Identifier prefix - Default=ID
[-o <uint> | --offset=<uint>] # Identifier offset - Default=0
[-I <file!> | --stream_in=<file>!] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following table:
Organism Sequence Count
Human ATACGTCAG 23524
Dog AGCATGAC 2442
Mouse GACTG 234
Cat AAATGCA 2342
We use read_tab to get the Sequence column, and the add_ident to generate a unique identifier for each record:
read_tab -i test.tab -s 1 -c 1 -k SEQ | add_ident
ID: ID00000000
SEQ: ATACGTCAG
---
ID: ID00000001
SEQ: AGCATGAC
---
ID: ID00000002
SEQ: GACTG
---
ID: ID00000003
SEQ: AAATGCA
---
However, if you want to output the sequence with write_fasta then the new key
ID can be replaced with the required SEQ_NAME using the -k
switch:
read_tab -i test.tab -s 1 -c 1 -k SEQ | add_ident -k SEQ_NAME
SEQ: ATACGTCAG
SEQ_NAME: ID00000000
---
SEQ: AGCATGAC
SEQ_NAME: ID00000001
---
SEQ: GACTG
SEQ_NAME: ID00000002
---
SEQ: AAATGCA
SEQ_NAME: ID00000003
---
If you want to change the format of the identifier, the prefix can be changed
with the -p
switch:
read_tab -i test.tab -s 1 -c 1 -k SEQ | add_ident -k SEQ_NAME -p ID_
SEQ: ATACGTCAG
SEQ_NAME: ID_00000000
---
SEQ: AGCATGAC
SEQ_NAME: ID_00000001
---
SEQ: GACTG
SEQ_NAME: ID_00000002
---
SEQ: AAATGCA
SEQ_NAME: ID_00000003
---
Finally, if you also want change the offset of the identifier from the default 0
- use the
-o
switch:
read_tab -i test.tab -s 1 -c 1 -k SEQ | add_ident -k SEQ_NAME -p ID_ -o 5
SEQ: ATACGTCAG
SEQ_NAME: ID_00000005
---
SEQ: AGCATGAC
SEQ_NAME: ID_00000006
---
SEQ: GACTG
SEQ_NAME: ID_00000007
---
SEQ: AAATGCA
SEQ_NAME: ID_00000008
---
Martin Asser Hansen - Copyright (C) - All rights reserved.
August 2007
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
add_ident is part of the Biopieces framework.