Skip to content
Martin Asser Hansen edited this page Oct 2, 2015 · 7 revisions

Biopiece: grab

Description

grab selects records from the stream by matching keys or values using a pattern, a regular expression, or a numerical evaluation. grab is biopieces' equivalent of Unix' grep, however, grab is much more versatile.

Using the -v switch outputs to STDERR the number of records grabbed and missed:

Records grabbed: 11233
Records missed: 23

Usage

... | grab [options]

Options

[-?          | --help]                 #  Print full usage description.
[-p <string> | --patterns=<string>]    #  Grab using comma separated list of patterns.
[-P <file!>  | --patterns_in=<file!>]  #  Grab using patterns from a file - one pattern per line.
[-r <string> | --regex=<string>]       #  Grab using Perl regex.
[-e <string> | --eval=<string>         #  Grab 'key,operator,value'. Operators: '>,<,>=,<=,=,!=,eq,ne'.
[-E <file!>  | --exact_in=<file!>      #  Grab using exact expressions from a file - one expression per line.
[-i          | --invert]               #  Display non-matching results.
[-c          | --case_insensitive]     #  Turn regex matching case insensitive.
[-k <string> | --keys=<string>]        #  Comma separated list of keys to grab the value for.
[-K          | --keys_only]            #  Only grab for keys.
[-V          | --vals_only]            #  Only grab for vals.
[-I <file!>  | --stream_in=<file!>]    #  Read input from stream file  -  Default=STDIN
[-O <file>   | --stream_out=<file>]    #  Write output to stream file  -  Default=STDOUT
[-v          | --verbose]              #  Verbose output.

Examples

To easily grab all records in the stream that has any mentioning of the pattern 'human' just pipe the data stream through grab like this:

... | grab -p human

This will search for the pattern 'human' in all keys and all values. The -p switch takes a comma separated list of patterns, so in order to match multiple patterns do:

... | grab -p human,mouse

It is also possible to use the -P switch instead of -p. -P is used to read a file with one pattern per line:

... | grab -P patterns.txt

If you want the opposite result - to find all records that does not match the patterns, add the -i switch, which not only works with the -p and -P switch, but also with -r and -e:

... | grab -p human -i

If you want to search the record keys only, e.g. to find all records containing the key SEQ you can add the -K switch. This will prevent matching of SEQ in any record value, and in fact SEQ is a not uncommon peptide sequence you could get an unwanted record. Also, this will give an increase in speed since only the keys are searched:

... | grab -p SEQ -K

However, if you are interested in finding the peptide sequence SEQ and not the SEQ key, just add the -V switch instead:

... | grab -p SEQ -V

Also, if you want to grab for certain key/value pairs you can supply a comma separated list of keys whos values will then be searched using the -k switch. This is handy if your records contain large genomic sequences and you don't want to search the entire sequence for e.g. the organism name - it is much faster to tell grab which keys to search the value for:

... | grab -p human -k SEQ_NAME

It is also possible to invoke flexible matching using regex (regular expressions) instead of simple pattern matching. In grab the regex engine is Perl based, and allows use of different type of wild cards, alternatives, etc. If you want to grab records withs the sequence ATCG or GCTA you can do this:

... | grab -r 'ATCG|GCTA'

Or if you want to find sequences beginning with ATCG:

... | grab -r '^ATCG'

You can also use grab to locate records that fulfill a numerical property using the -e switch witch takes an expression in three parts. The first part is the key that holds the value we want to evaluate, the second part holds one if these eight operators:

  1. Greater than: >
  2. Greater than or equal to: >=
  3. Less than: <
  4. Less than or equal to: <=
  5. Equal to: =
  6. Not equal to: !=
  7. String wise equal to: eq
  8. String wise not equal to: ne

And finally comes the number used in the evaluation. So to grab all records with a sequence length greater than 30:

... | grab -e 'SEQ_LEN > 30'

If you want to locate all records containing the pattern 'human' and where the sequence length is greater that 30, you do this by running the stream through grab twice:

... | grab -p 'human' | grab -e 'SEQ_LEN > 30'

Finally, it is possible to do fast matching of expressions from a file using the -E switch. Each of these expressions has to be matched exactly over the entrie length, which if useful if you e.g. have a file with accession numbers, that you want to locate in the stream:

... | grab -E acc_no.txt

Using -E is much faster than using -P, because with -E the expression has to be complete matches, where -P looks for subpatterns.

NB! To get the best speed performance, use the most restrictive grab first.

See also

Author

Martin Asser Hansen - Copyright (C) - All rights reserved.

[email protected]

August 2007

License

GNU General Public License version 2

http://www.gnu.org/copyleft/gpl.html

Help

grab is part of the Biopieces framework.

http://www.biopieces.org#

Clone this wiki locally