-
Notifications
You must be signed in to change notification settings - Fork 23
count_vals
Given a comma seperated list of keys count_vals for each of these keys counts the number of identical values. Since the count basically is dependant on one hash per key, count_vals have the potential to blow the memory quite easily. This is countered by caching the count to disk for every 5 million records, however, the disk caching may be slow.
... | count_vals [options]
[-? | --help] # Print full usage description.
[-k <string> | --keys=<string>] # Comma separeted list of keys.
[-I <file!> | --stream_in=<file!>] # Read input from stream file - Default=STDIN
[-O <file> | --stream_out=<file>] # Write output to stream file - Default=STDOUT
[-v | --verbose] # Verbose output.
Consider the following two column table in the file test.tab
:
Human H1
Human H2
Human H3
Dog D1
Dog D2
Mouse M1
To count the values of both columns we first read the table with read_tab:
read_tab -i test.tab | count_vals -k V0,V1
V0: Human
V1_COUNT: 1
V1: H1
V0_COUNT: 3
---
V0: Human
V1_COUNT: 1
V1: H2
V0_COUNT: 3
---
V0: Human
V1_COUNT: 1
V1: H3
V0_COUNT: 3
---
V0: Dog
V1_COUNT: 1
V1: D1
V0_COUNT: 2
---
V0: Dog
V1_COUNT: 1
V1: D2
V0_COUNT: 2
---
V0: Mouse
V1_COUNT: 1
V1: M1
V0_COUNT: 1
---
The result is that for each of the specified keys (V0 and V1) a new key with the suffix COUNT is added where the value is the global count. The result is better displayed after piping through write_tab:
read_tab -i test.tab | count_vals -k V0,V1 | write_tab -xck V0,V0_COUNT,V1,V1_COUNT
#V0 V0_COUNT V1 V1_COUNT
Human 3 H1 1
Human 3 H2 1
Human 3 H3 1
Dog 2 D1 1
Dog 2 D2 1
Mouse 1 M1 1
Martin Asser Hansen - Copyright (C) - All rights reserved.
August 2007
GNU General Public License version 2
http://www.gnu.org/copyleft/gpl.html
count_vals is part of the Biopieces framework.