-
Notifications
You must be signed in to change notification settings - Fork 23
grab
grab selects records from the stream by matching keys or values using a pattern, a regular expression, or a numerical evaluation. grab is biopieces' equivalent of Unix' grep, however, grab is much more versatile.
Using the -v
switch outputs to STDERR the number of records grabbed and missed:
Records grabbed: 11233
Records missed: 23
... | grab [options]
[-? | --help] # Print full usage description.
[-p <string> | --patterns=<string>] # Grab using comma separated list of patterns.
[-P <file!> | --patterns_in=<file!>] # Grab using patterns from a file - one pattern per line.
[-r <string> | --regex=<string>] # Grab using Perl regex.
[-e <string> | --eval=<string> # Grab 'key,operator,value'. Operators: '>,<,>=,<=,=,!=,eq,ne'.
[-E <file!> | --exact_in=<file!> # Grab using exact expressions from a file - one expression per line.
[-i | --invert] # Display non-matching results.
[-c | --case_insensitive] # Turn regex matching case insensitive.
[-k <string> | --keys=<string>] # Comma separated list of keys to grab the value for.
[-K | --keys_only] # Only grab for keys.
[-V | --vals_only] # Only grab for vals.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
To easily grab all records in the stream that has any mentioning of the pattern 'human' just pipe the data stream through grab like this:
... | grab -p human
This will search for the pattern 'human' in all keys and all values. The -p
switch takes
a comma separated list of patterns, so in order to match multiple patterns do:
... | grab -p human,mouse
It is also possible to use the -P
switch instead of -p
. -P
is used to read a file with one pattern per line:
... | grab -P patterns.txt
If you want the opposite result - to find all records that does not match the patterns,
add the -i
switch, which not only works with the -p
and -P
switch, but also with -r
and -e
:
... | grab -p human -i
If you want to search the record keys only, e.g. to find all records containing the key SEQ
you can add the -K
switch. This will prevent matching of SEQ in any record value, and in
fact SEQ is a not uncommon peptide sequence you could get an unwanted record. Also, this will
give an increase in speed since only the keys are searched:
... | grab -p SEQ -K
However, if you are interested in finding the peptide sequence SEQ and not the SEQ key, just
add the -V
switch instead:
... | grab -p SEQ -V
Also, if you want to grab for certain key/value pairs you can supply a comma separated list
of keys whos values will then be searched using the -k
switch. This is handy if your records
contain large genomic sequences and you don't want to search the entire sequence for e.g. the
organism name - it is much faster to tell grab which keys to search the value for:
... | grab -p human -k SEQ_NAME
It is also possible to invoke flexible matching using regex (regular expressions) instead of simple pattern matching. In grab the regex engine is Perl based, and allows use of different type of wild cards, alternatives, etc. If you want to grab records withs the sequence ATCG or GCTA you can do this:
... | grab -r 'ATCG|GCTA'
Or if you want to find sequences beginning with ATCG:
... | grab -r '^ATCG'
You can also use grab to locate records that fulfill a numerical property using the -e
switch
witch takes an expression in three parts. The first part is the key that holds the value we want
to evaluate, the second part holds one if these eight operators:
- Greater than: >
- Greater than or equal to: >=
- Less than: <
- Less than or equal to: <=
- Equal to: =
- Not equal to: !=
- String wise equal to: eq
- String wise not equal to: ne
And finally comes the number used in the evaluation. So to grab all records with a sequence length greater than 30:
... | grab -e 'SEQ_LEN > 30'
If you want to locate all records containing the pattern 'human' and where the sequence length is greater that 30, you do this by running the stream through grab twice:
... | grab -p 'human' | grab -e 'SEQ_LEN > 30'
Finally, it is possible to do fast matching of expressions from a file using the -E
switch.
Each of these expressions has to be matched exactly over the entrie length, which if useful if
you e.g. have a file with accession numbers, that you want to locate in the stream:
... | grab -E acc_no.txt
Using -E
is much faster than using -P
, because with -E
the expression has to be complete
matches, where -P
looks for subpatterns.
NB! To get the best speed performance, use the most restrictive grab first.
Martin Asser Hansen - Copyright (C) - All rights reserved.
August 2007
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
grab is part of the Biopieces framework.