-
Notifications
You must be signed in to change notification settings - Fork 0
MestreLion/topuniq
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Sort input by count, printing totals and percentages.
Think of it as sort | uniq -c | sort -nr on steroids ;)
Sample output:
$ topuniq --min-count=100 examples/2-icon-types.txt
39564 100.0% Total (8)
25373 64.1% png
12128 30.7% svg
1290 3.3% xpm
685 1.7% icon
88 0.2% Other (4)
A more complex example:
$ topuniq --min-perc=1 examples/3-shebangs.txt \
--total-last --label-total="TOTAL: %d unique shebangs" \
--sort-other --label-other="(other %d unique shebangs)"
330 26.7% #!/bin/sh
148 12.0% #!/usr/bin/perl -w
145 11.7% #!/usr/bin/python
143 11.6% #!/usr/bin/perl
117 9.5% (other 35 unique shebangs)
90 7.3% #! /bin/sh
80 6.5% #!/bin/bash
42 3.4% #!/usr/bin/env python
39 3.2% #! /usr/bin/perl -w
25 2.0% #! /usr/bin/python
22 1.8% #! /usr/bin/perl
21 1.7% #! /bin/bash
20 1.6% #!/bin/sh -e
14 1.1% #! /usr/bin/env perl
1236 100.0% TOTAL: 48 unique shebangs
As a drop-in replacement for cmd | sort | uniq -c | sort -nr
(using cat just to show pipeline usage, I know it is redundant)
$ cat examples/2-icon-types.txt | topuniq --no-total --no-perc
25373 png
12128 svg
1290 xpm
685 icon
53 theme
33 cache
1 txt
1 svgz
"Enhancing" previously saved data generated by cmd | sort | uniq -c | sort -nr
(yes, lame and cheesy option name, but I could not think of a better one...)
$ topuniq --enhance-uniq --top=10 examples/4-shebangs-preprocessed.txt
1236 100.0% Total (53)
328 26.5% #!/bin/sh
146 11.8% #!/usr/bin/perl -w
145 11.7% #!/usr/bin/python
141 11.4% #!/usr/bin/perl
90 7.3% #! /bin/sh
80 6.5% #!/bin/bash
42 3.4% #!/usr/bin/env python
39 3.2% #! /usr/bin/perl -w
25 2.0% #! /usr/bin/python
21 1.7% #! /usr/bin/perl
179 14.5% Other (43)
Performance comparisons with sort | uniq -c | sort -nr
(always using the 41277 lines, 235KB examples/1-man-bash-words.txt, average of
3 runs of 'time' in a 100 iterations loop)
Reference:
sort | uniq -c | sort -nr: real 0m10.042s
Worst case scenario - no min-* or top-* filter
topuniq real 0m14.360s (gawk)
real 0m13.294s (mawk)
Direct comparison - no-op same output as reference
(no, I didn't optimize for that... yet ;)
topuniq --no-total --no-perc real 0m14.201s (gawk)
real 0m13.252s (mawk)
Best case scenario - using min-count > total
(not cheating with --stop-after-*, of course)
topuniq --min-count=3000 real 0m11.797s (gawk)
real 0m11.739s (mawk)
Not bad, not bad at all ;)
... and soon to be hugely improved.
Wishlist:
(A.K.A. "Things I would add if I did not fear bloat and feature-creep)
- Optimize for some common option combinations:
--no-perc + no --min-perc : do not calculate percentages at all
--no-other: do not update *['other'] arrays
--no-total + --no-perc + no filters: skip awk entirely ;)
--enhance-uniq: skip last sort -nr
- Add position column, and --no-pos option. Very useful for long lists, but
nothing grep -n or pasting to an editor can't do. Position would be blank
for <other>, even if sorted.
- Add yet another percentage: position %, same value --top-perc uses to filter
To answer the question "what does being #15 in this list mean?". Besides,
I already calculate it, so why not show it? ;) --no/show--pos-perc
- Add 2 more percentages: cumulative % of lines above (Up) and below (Down).
Useful for analyzing thresholds. --no-perc-up and --no-perc-down to disable
(maybe --no-percsum-*? Anyway, --show-* to enable if not default)
% down would of course also count lines filtered in <other> and not printed.
Example: 40: 145 0.4% 56.2% 43.4% bash
- This is starting to look like a spreadsheet, so I'd better add headers.
Optional (--show-header) and customizable, of course.
- Request this sweet, useful tool to be included in Debian?
So you think any of these features are worth having? Leave a comment, or ask
for them in "Issues". I would gladly add them in next release!
Full manual, from --help:
Usage: topuniq [options] [FILE...]
If FILE is not given, read from standard input. For numeric input
options, NUM must be a positive integer (digits only). All options
requiring arguments accept both --option=ARG or --option ARG forms
Options not listed here, if any, are appended to uniq -c
Options:
-h|--help show this page.
--min-count=NUM only print lines with count >= NUM
--min-perc=NUM only print lines with count percent >= NUM%
--top=NUM only print the top NUM lines. 0 = all lines
--top-perc=NUM only print the top NUM% lines
All lines with count less than any of the above options will be
grouped together as a single <other> line, printed last by default.
Setting a minimum higher than total, either count or percentage,
will effectively disable printing the <total> line. For --top-*
options, NUM does not include the total.
--stop-after-top=NUM stop reading after NUM top unique lines
--stop-after-count=NUM stop reading after lines with count < NUM
Unlike --min-* and --top-* options, the above will discard lines,
thus affecting <total>, <other> and all percentages.
--stop-after-top is equivalent to 'head -nNUM' after sort -nr and
before topuniq's enhancements. For both, NUM=0 disables the option
--precision=NUM use NUM decimal digits for the percentages,
default 1
--no-perc do not print percentages
--no-total do not print <total> line
--no-other do not print <other> line
--total-last print <total> line last instead of first
--sort-other print <other> line in sorted position
--label-total=LABEL use LABEL for <total> line, default "Total (%d)"
--label-other=LABEL use LABEL for <other> line, default "Other (%d)"
For the --label-* options, optional "%d" prints the number of unique
lines that <total> or <other> represents
--enhance-uniq consider input as already processed by
sort | uniq -c, skip it and process from there.
Useful for enhancing previously saved data
Environment Variables:
topuniq uses sort and uniq, so the user locale, particularly
LC_COLLATE, affects ordering and unique matching, as well as sort
performance. LC_NUMERIC affects decimal separator when printing
percentages. Use LC_ALL=C for the fastest and locale-independent
results.
Examples:
# Ignore lines with count < 10%, using case-insensitive uniq
topuniq --min-perc=10 --no-other --ignore-case
# Top 20, sorting <others> within the list, and customizing its label
topuniq --top=20 --sort-other --label-other="Other %d unique lines"
# Enhance an existing input, discarding lines with count < 10
topuniq my_uniq_data.txt --enhance-uniq --stop-after-count=10
# Behaves exactly like sort | uniq -c | sort -nr
topuniq --no-total --no-perc
For input data, some examples you may pipe directly to topuniq:
# Words in Bash's manual page
man bash | tr '[:punct:][:blank:]' '\n' | sed '/^$/d'
# Icon types in /usr/share/icons
find /usr/share/icons -type f -name "*.*" | awk -F. '{print $NF}'
# Shebangs from /usr/bin scripts
for f in /usr/bin/*; do [ -f "" ] && head -n1 "" | grep ^#!; done
Copyright (C) 2012 Rodrigo Silva (MestreLion) <[email protected]>
License: GPLv3 or later. See <http://www.gnu.org/licenses/gpl.html>
About
Think of it as sort | uniq -c | sort -nr on steroids ;)
Resources
Stars
Watchers
Forks
Packages 0
No packages published