-
Notifications
You must be signed in to change notification settings - Fork 8
/
README.lib
577 lines (426 loc) · 21.3 KB
/
README.lib
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
Library routines
================
Some of these functions use the notion of an array's "height". This
abbreviates the following notion: it is the largest number n >=1 such that all
indices from 1...n are "in" the array. If even 1 is not "in" the array, then
the array's "height" is 0.
Some of these functions are adapted from
http://www.cs.bell-labs.com/cm/cs/who/bwk/awkcode.txt. Some others from
Aleksey Cheusov's Runawk package at http://runawk.sourceforge.net/.
Errors and debugging
--------------------
die(msg)
Prints msg to stderr and tries to exit with status 1.
Note that if you have an END block, and die was called before reaching the
END block, the END block will be run before the awk program exits. If you don't
want the END block to run when you're dying prematurely, insert the following
line at the start of your END block:
END {
if (EXITCODE) exit EXITCODE
...
}
assert(expr, msg)
If expr evaluates to false (0 or ""), then die using msg.
dump(A, [prefix])
Print out "[key]=value" lines for all keys present in A.
idump(A, [[start], stop], [prefix])
Print out "[key]=value" lines for all keys between start...stop in A.
Type-checking
-------------
ismissing(u)
Returns true if its argument is uninitialized.
isnull(s)
Returns true if its argument is the empty string "".
Note that isnull(s) with an unitialized argument s will return false;
however isnull accepts an optional second argument, and isnull(s, 1) will
return true when s is either "" or uninitialized.
isnum(n)
Returns true if its argument is a number, expressed either numerically
(123) or as a string ("123").
Awk will also treat values like "xyz" and " 1" and "2xyz" as convertible to
numbers (to 0, 1, and 2, respectively); however, isnum(...) will return false
for those values. (This is deliberate.)
Note that isnum(n) with an unitialized argument n will return false;
however isnum accepts an optional second argument, and isnum(n, 1) will return
true when n is either a number or uninitialized.
iszero(n)
Returns true if its argument is 0 or "0".
Awk will also treat values like "xyz" and " 0" and "0xyz" as convertible to
0; however, iszero(...) will return false for those values. (This is
deliberate.)
Note that iszero(n) with an unitialized argument n will return false;
however iszero accepts an optional second argument, and iszero(n, 1) will
return true when n is either 0 or "0" or uninitialized.
isint(n)
Returns true if its argument is n or "n", for any positive, zero, or
negative integer n.
Note that isint(n) with an unitialized argument n will return false;
however isint accepts an optional second argument, and isint(n, 1) will return
true when n is either an integer or uninitialized.
isnat(n)
Returns true if its argument is n or "n", for any positive or zero integer
n.
Note that isnat(n) with an unitialized argument n will return false;
however isnat accepts an optional second argument, and isnat(n, 1) will return
true when n is either a natural number or uninitialized.
ispos(n)
Returns true if its argument is n or "n", for any positive integer n. (Only
integers are accepted.)
Note that ispos(n) with an unitialized argument n will return false;
however ispos accepts an optional second argument, and ispos(n, 1) will return
true when n is either a positive integer or uninitialized.
isneg(n)
Returns true if its argument is n or "n", for any negative integer n. (Only
integers are accepted.)
Note that isneg(n) with an unitialized argument n will return false;
however isneg accepts an optional second argument, and isneg(n, 1) will return
true when n is either a negative integer or uninitialized.
check(u, missing)
If u is unitialized, sets the global MISSING and returns missing; else
clears MISSING and returns u.
checknum(n, missing, msg)
If n is a number (according to isnum()), clears the global MISSING and
returns n. Else if n is unitialized, sets the global MISSING and returns
missing (which is not itself required to be a number). Else die(msg).
checknat(n, missing, msg)
If n is a natural number (according to isnat()), clears the global MISSING
and returns n. Else if n is unitialized, sets the global MISSING and returns
missing (which is not itself required to be a natural number). Else die(msg).
checkpos(n, missing, msg)
If n is a positive integer (according to ispos()), clears the global
MISSING and returns n. Else if n is unitialized, sets the global MISSING and
returns missing (which is not itself required to be a positive integer). Else
die(msg).
Numbers and randoms
-------------------
max(m,n)
min(m,n)
These functions are trivial to inline and may be removed.
choose(n, [k=1], [A])
Choose k distinct random elements from A[1]...A[n]. Return those k elements
separated by SUBSEP. (They will be ordered with A[i] before A[j] when i<j.)
If no initialized value is provided for k, it defaults to 1. If no array is
provided, k random integers from the inclusive range 1...n are returned
instead. If an array is provided but n is uninitialized, A's "height" is used
for n (see explanation at the start of this document).
permute(n, [k=n])
Return a permutation of k random elements from 1..n, separated by SUBSEP.
If no initialized value is provided for k, it defaults to n.
shuffle(A, [n])
Shuffles the elements at A[1]...A[n] in place.
If no initialized value is provided for n, it defaults to A's "height" (see
explanation at the start of this document).
Array utilities
---------------
isempty(A)
Return true if array A has no members.
sort(A, [n])
Sort elements A[1]...A[n] in place.
If no initialized value is provided for n, it defaults to A's "height" (see
explanation at the start of this document).
The sort performed is an "insertion sort", which is O(n^2), but is fast for
small arrays. Insertion sort is also a stable sorting algorithm, but no
facilities are provided here for matching array elements by anything but their
entire text; so stability is not especially useful.
qsort(A, 1, n)
Sort elements A[1]...A[n] in place using quicksort. For efficiency, the
values 1 and n must be explicitly supplied.
Quicksort averages O(n log n), but has O(n^2) worst-case. Here we choose
pivots randomly to avoid hitting worst-case behavior on the common case of
already-sorted arrays.
Quicksort is an unstable sorting algorithm; though see the comments on sort
about the limited scope here for stability to be useful.
hsort(A, [n])
Sort elements A[1]...A[n] in place using heapsort.
If no initialized value is provided for n, it defaults to A's "height" (see
explanation at the start of this document).
Heapsort has average and worst-case O(n log n). It's thought to be slower
than quicksort in average cases. Heapsort relies essentially on random
access, so it has poorer cache performance than some other algorithms.
Heapsort is an unstable sorting algorithm; though see the comments on sort
about the limited scope here for stability to be useful.
pop([start], [len], [A])
Pop elements A[start]...A[start+len-1] from array A, mutating that array,
and return the popped elements separated by SUBSEP.
If no array is provided, the fields $start...$(start+len-1) are popped
instead.
If no initialized values are provided for start or len, they default to 1
and the array's "height" (or to NF, when no array is provided).
Note that when fields are popped, $0 will be recomputed using OFS, and
details about the input separators may be discarded. If you want to preserve
the input separators (including leading or trailing space), use gsplit.
insert(value, [start], [A])
Insert value as the new element A[start] in array A, pushing any existing
elements between A[start] and A[A's "height"] upwards.
If no array is provided, value is instead inserted at field $start.
If no initialized value is provided for start, it defaults to (A's
"height")+1 (or to NF+1, when no array is provided).
extend(V, [start], [A])
Like insert, but instead of inserting a single value it inserts all the
values between V[1]...V[V's "height"].
reverse(A)
Reverses the elements A[1]...A[A's "height"] in place.
If no array is provided, the fields $1...$NF are reversed instead.
concat([start=1], [len], [sep=OFS], [A])
Returns the elements A[start]...A[start+len-1], separated by sep.
If no array is provided, instead provide the concatenation of
$start...$(start+len-1), separated by sep.
If no initialized value is provided for start or len, they default to 1 and
A's "height" (or to 1 and NF, when no array is provided). If no value is
provided for sep, it defaults to OFS.
has_value(A, value)
Returns true if A[k] == value for any k already in A.
includes(A1, A2, [onlykeys=0])
Returns true if for every k in A2, k is also in A1 and A1[k] == A2[k]. If
the optional argument onlykeys is true, then the last equality is ignored.
union(A1, A2, [conflicts=1])
Mutates A1 to contain the union of A1 and A2. The conflicts argument
specifies how to handle cases where k is in A1 and A2 but A1[k] != A2[k]. If
conflicts == 0, neither A1[k] nor A2[k] is preserved in the result array. If
conflicts == 1 (the default), then A1[k] is used. If conflicts == 2, then
A2[k] is.
intersect(A1, A2, [conflicts=1])
Mutates A1 to the intersection of A1 and A2. The conflicts argument
specifies how to handle cases where k is in A1 and A2 but A1[k] != A2[k]. If
conflicts == 0, neither A1[k] nor A2[k] is preserved in the result array. If
conflicts == 1 (the default), then A1[k] is used. If conflicts == 2, then
A2[k] is.
subtract(A1, A2, [conflicts=1])
Mutates A1 to the subtraction of A2 from A1. The conflicts argument
specifies how to handle cases where k is in A1 and A2 but A1[k] != A2[k]. If
conflicts == 0, such keys are removed from A1 anyway. If conflicts == 1 (the
default), such keys are left in A1.
String utilities
----------------
quote(str)
Returns a version of str quoted for use by a shell. Example:
if (system("test -e "..quote(filename)) == 0) {
print filename " exists"
}
delete_quoted(str, [repl=""])
Deletes all "spans like this" from str. Knows how to handle \" and \\.
Optionally permits substituting a different replacement string (rather than "").
parse_json(str, T, V, [slack=""])
Parses JSON str. A successful parse will fill the T and V arrays like this:
parse of: {"one": 10, "two": [20,30,[40,[50,60]]], "three": {}}
return value = 10
T[1]=string V[1]=one
T[2]=number V[2]=10
T[3]=string V[3]=two
T[4]=number V[4]=20
T[5]=number V[5]=30
T[6]=number V[6]=40
T[7]=number V[7]=50
T[8]=number V[8]=60
T[9]=string V[9]=three
T[10]=object V[10]=1:2,3:11,9:14
T[11]=array V[11]=4,5,12
T[12]=array V[12]=6,13
T[13]=array V[13]=7,8
T[14]=object V[14]=
The return value of 10 indicates that the object at T[10] and V[10] is the root.
The caller has to split up the contents of V[10] manually (use split or asplit),
and retrieve the values from the specified indices in T and V.
An unsuccessful parse will leave T and V in an indeterminate state, and give
a return value < 1.
The optional slack argument permits parsing some common invalid formats:
* if slack contains ",", arrays are permitted to have a trailing ","
* if slack contains ":", object keys are permitted to be unquoted when
they conform to JavaScript identifier syntax.
query_json(str, A, [root, [slack]])
Parses JSON str with supplied slack, if any. Unpacks the returned object or
array into A. Optionally permits specifying a different root to unpack.
Example:
query of: {"one": 1, "two": {"alpha":0, "beta": {"uno": { "id":1, "child": {
"legal": true, "id": 11 }}, "dos": {"id": 2, "child": {"legal": true,
"id": 21 }}}}}
with root="two.beta"
A["uno","id"] = 1
A["uno","child", "legal"] = 1
A["uno","child", "id"] = 11
A["dos","id"] = 2
A["dos","child", "legal"] = 1
A["dos","child", "id"] = 21
Arrays are handled like this:
query of: {"one": 1, "two": {"alpha":0, "beta": {"uno": { "id":[1,101,1001], "child": {
"legal": true, "id": [11,-11] }}, "dos": {"id": 2, "child": {"legal": true,
"id": 21 }}}}}
with root="two.beta"
A["uno","id", 0] = 3
A["uno","id", 1] = 1
A["uno","id", 2] = 101
A["uno","id", 3] = 1001
A["uno","child", "legal"] = 1
A["uno","child", "id", 0] = 2
A["uno","child", "id", 1] = 11
A["uno","child", "id", 2] = -11
A["dos","id"] = 2
A["dos","child", "legal"] = 1
A["dos","child", "id"] = 21
rep(str, n, [sep=""])
Returns the concatenation of n copies of str, optionally joined by sep.
trim(str)
Returns a copy of str with leading and trailing space removed.
trimleft(str)
Returns a copy of str with leading space removed.
trimright(str)
Returns a copy of str with trailing space removed.
has_prefix(str, pre)
Returns true if str begins with pre.
has_suffix(str, suf)
Returns true if str ends with pre.
detab(str, [siz=8])
Returns str with tabs replaced by spaces. Assumes tabstops at siz
characters (defaults to 8).
entab(str, [siz=8])
Returns str with sequences of two or more spaces replaced by tabs, where
doing so preserves layout. Assumes tabstops at siz characters (defaults to 8).
Regex utilities
---------------
Note: /re/ literals can only be used with built-in functions like match, sub,
gsub, and gensub. If /re/ is passed to a user-defined function, like the ones
below, it's evaluated as the boolean expression "$0 ~ /re/". Use "re" instead.
asplit(str, A, [" "], ["="])
Deletes any existing content in A and fills it with the key=value pairs
specified in str. For example:
asplit("key1 key2=value2 key3=value3")
will result in an array where
A["key1"]=""
A["key2"]="value2"
A["key3"]="value3"
Optionally permits substituting different parsing characters than " " and "=".
bmatch(str, opener, closer)
Like Lua's string.find("%b()"), with opener="(" and closer=")". Finds text
inside (and including) balanced pairs of opener and closer. Sets RSTART and
RLENGTH.
tail(str, [re=SUBSEP], [nth=1], [none=""])
If str="abc<SUBSEP>def<SUBSEP>ghi", then:
tail(str,SUBSEP,1) returns "def<SUBSEP>ghi"
and:
tail(str,SUBSEP,2) returns "ghi"
Asking for a tail more advanced than is present returns none (which defaults to
""). Sets RSTART and RLENGTH.
head(str, [re=SUBSEP], [nth=1], [none=""])
If str="abc<SUBSEP>def<SUBSEP>ghi", then:
head(str,SUBSEP,1) returns "abc"
head(str,SUBSEP,2) returns "def"
and:
head(str,SUBSEP,3) returns "ghi"
Asking for a head more advanced than is present returns none (which defaults to
""). As a special case, asking for head(str, re, -1) returns the last head in
str: in the examples above, "ghi". Sets RSTART and RLENGTH.
If you are going to extract many heads, it will be more efficient to use
split (or gsplit) on str instead, and just access the destination ITEMS array.
matchstr(str, re, [nth=1], [none=""])
Whereas head returns elements _between_ the text that matches re, matchstr
returns the items that _match_ re. As with head, asking for the fourth match
when only three are present returns none (which defaults to ""). Also, as a
special case, asking for matchstr(str, re, -1) returns the last match in str.
Sets RSTART and RLENGTH. The simplest invocation:
res = matchstr(str, re, 1, none)
is equivalent to:
if (match(str, re))
res = substr(str, RSTART, RLENGTH)
else res = none
If you are going to extract many matches, it will be more efficient to use
gmatch on str instead, and just access the destination MATCH array. Or use
gsplit and access the destination SEPS array.
nthindex(str, needle, nth)
A generalization of index that also permits searching for the 2nd, 3rd, ...
occurrences of needle in str. As with index, needle is interpreted literally
rather than as a regex; and the return value is a position in str >= 1 when
needle is found, or 0 when it is not found. As a special case, asking for
nthindex(str, needle, -1) returns the position of the last occurrence. Does not
set RSTART or RLENGTH.
gmatch(str, re, MATCHES, STARTS)
Fills MATCHES with all of re's non-overlapping matches in str; also fills
STARTS with the corresponding indexes where these matches are located. Either
or both of these arrays may be omitted.
This function differs from gawk's version of match, which also accepts an
optional array argument. Gawk's match finds only the first match of re in str,
and returns \\0 (the whole match) as well as \\1...\\n (the matching groups),
as well as the starts and lengths of each of these, in the destination array.
gmatch on the other hand finds multiple matches, but only returns \\0 for each,
and returns the matching text and its starting position in different arrays.
gsplit(str, ITEMS, re, SEPS)
Takes an optional SEPS array argument to collect all the "separator" text
that matches re. Mostly equivalent to gawk's version of split. Unlike split,
though, this function (as well as the other regex functions here) honor
zero-width matches, so long as the underlying awk implementation's match does.
That is, gsplit("abxd", ITEMS, "x*") will result in ITEMS[1]="a", ITEMS[2]="b",
ITEMS[3]="d". I don't expect this often to be useful, but it does seem the more
correct behavior.
OS/filesystem
-------------
getopt(optstring, OPTIONS, basename, version, usage_msg)
Automatically handles "--help", "--version", and "--" arguments.
Optstring is parsed as for getopt(2), with some extensions.
* The presence of a single letter "w" not followed by a flag means that an
option "-w" should be accepted in the ARGV list up until the presence of
"--" or a non-option ARGV.
* If optstring contains a letter followed by a colon, as "x:", that means
that the option requires an argument. Either "-xarg" or "-x" and a
following "arg" in the ARGV list are accepted. If "-x" occurs multiple
times in the ARGV list, only the last argument will be remembered.
* If optstring contains a letter followed by a question mark, as "y?", that
means that the option accepts an optional argument (the default if no
argument is provided is ""). If "-y" occurs multiple times in the ARGV
list, only the argument of its last occurrence (which may be the default)
will be remembered.
* If optstring contains a letter followed by a plus, as "z+", that means
that the option requires an argument, and that all arguments should be
remembered when the option is repeated. (The arguments are concatted into
a single string separated by SUBSEP).
As with getopt(2), multiple short arguments may be joined together in the
ARGVs: with the optstring "wx:", ARGV[1]="-wxarg" is parsed the same as
ARGV[1]="-w" ARGV[2]="-x" ARGV[3]="arg".
Long-form options can also be enabled by passing them as keys of the
OPTIONS array. If OPTIONS contains the following key/values:
OPTIONS["doubleu"]=""
OPTIONS["ex"]=":"
OPTIONS["why"]="?default"
OPTIONS["zee"]="+"
then getopt will also accept in the ARGV list the options "--doubleu" with no
argument, "--ex" with a required argument, "--why" with an optional argument
("default" is used when no argument is supplied), and "--zee" with a required
argument, multiple instances of which are remembered.
Armed with these parsing instructions, getopt reads through the ARGV list
until it reaches "--", or a non-option argument, or the end of the list. It
clears the array OPTIONS and stores the actual options it finds there. So the
ARGV list "-wxarg" "-ybar" "-zfizz" "-zbuzz" results in:
OPTIONS["w"]=1
OPTIONS["x"]="arg"
OPTIONS["y"]="bar"
OPTIONS["z"]="fizz<SUBSEP>buzz"
The ARGV list "--doubleu" "--ex=arg" "--why" "bar" "--zee=fizz" "--zee" "buzz"
would produce a similar result.
The ARGV list is mutated to remove all the parsed options and their
arguments. (All those entries are removed from the list, not merely set to "".)
This function has no facility for tracking that --doubleu and -w should
have the same meaning; if the caller wants such behavior, it should check for
both keys in the result.
isreadable(path)
Returns true if a readable file exists at path.
filesize(path, [followlink])
Returns the size of the file at path; or "" if it's a non-regular file; or
-1 if the file doesn't exist. If followlink=1 and the file at path is a link,
return information about the link's target rather than the link itself.
filetype(path, [followlink])
Returns one of the characters "fdlcbsp" to indicate the type of the
filesystem entry at path; or -1 if there is no such entry. If followlink=1 and
the file at path is a link, return information about the link's target rather
than the link itself.
basename(path, suffix)
Just like basename(1).
dirname(path)
Just like dirname(1).
getfile(path)
Reads the file at path and returns it as a single string.
getpipe(cmd)
Reads the stdout of cmd and returns it as a single string.
mktemp()
Returns the path of a new file in TMPDIR that will be deleted when awk
exits, even if it exits abnormally.
This doesn't yet, but should, save the temp file in a new private directory
under TMPDIR.
# vim: ft=txt: