-
Notifications
You must be signed in to change notification settings - Fork 1
/
index.html
702 lines (511 loc) · 14 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
<!DOCTYPE html>
<html>
<head>
<style type="text/css">
@import url(https://fonts.googleapis.com/css?family=Yanone+Kaffeesatz);
@import url(https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic);
@import url(https://fonts.googleapis.com/css?family=Droid+Sans+Mono:400,700,400italic);
@import url(https://fonts.googleapis.com/css?family=Ubuntu+Mono:400,700,400italic);
@import url(remark-style.css);
</style>
</head>
<body>
<textarea id="source">
class: center, middle
# Command-Line-Fu
An introduction to the UNIX command line for munging data.
(Source on [github](https://github.com/bjelline/command-line-fu), made with [remark](https://github.com/gnab/remark))
---
## "UNIX philosophy"
* small, sharp tools
* that can be chained together
* to solve complex problems
* goes back to the introduction of the pipe and grep in 1973
where simple means: providing well thought out functionality,
implemented to perfection (see [grep algorigthm and implementation](http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html))
[wikipedia](https://en.wikipedia.org/wiki/Unix_philosophy)
---
## Command Line Keys
* use the up-arrow (and down-arrow) to go back to previouse commands
* use the tabulator key for file name completion: type just the first letter of a filename, then try tab
* use CTRL-A to move your cursor to the beginning of the line
* use CTRL-E to move your cursor to the end of the line
* use CTRL-K to delete from the cursor to the end of the line
---
## Command Line History
* !! repeats the last command
* !$ repeat the last word of the last command
```bash
$ mkdir this_long_directory_name
$ cd !$
```
---
## What kind of file is that?
first, let's look at the size:
```bash
$ ls -l parole
-rw-r----- 1 bjelline admin 2286089 Feb 25 12:13 parole
$ ls -lh parole
-rw-r----- 1 bjelline admin 2.2M Feb 25 12:13 parole
```
---
## What kind of file is that?
you don't know what might be in the file?
guess the file type:
```bash
$ file parole
parole: ASCII text
```
if it's text, we can take a peek with less.
```bash
$ less parole
```
---
## more or less
use key commands in less (in many contexts)
* q to quit
* SPACE to page one down
* CTRL-B to go back one page
* CTRL-F to go back one page
* G to jump to the end
* CTRL-G to find out where you a (line number + percent)
* 50% jump to middle of file (works for any number 0-100)
* /cookie to search for cookie
* n to jump to the next occurance
* b to jump back to the previous occurance
---
## How many lines are there?
```bash
$ wc -l parole
24291 parole
```
See also man wc
---
## rename and move stuff around
now we know that 'parole' is a csv file
let's rename it:
```bash
$ mv parole parole.csv
```
See also man mv
---
## grep
find certain lines, get rid of certain files
grep will look for a string in each line, and print
the line if the string is contained.
```bash
$ grep TRAMA parole.csv
```
---
## grep
If there is a space in your search string you must quote it:
```bash
$ grep 'JAMES, MARK' parole.csv
```
grep is case sensitive, use -i to make it case insensitive:
```bash
$ grep -i trama parole.csv
```
---
## grep inverted
the option v turns grep around: all lines are printed, except those
matching the pattern. this way you can get rid of certain files
```bash
$ grep -v ....
```
---
## grep enhanced
to build more complex search, with search patterns, use egrep
```bash
$ egrep 'TRAMA|TENNEY' parole.csv
```
---
## sorting
unsorted stuff
```bash
$ cat names.txt
DUQUIN, JON
MCALPINE, ERIC M
MCALPINE, ERIC M
MCALPINE, ERIC M
BROWN, STEPHANIE
BROWN, WILLIAM
BOATWRIGHT, MAURICE
PAYNE, RAYMOND
DUQUIN, JON
BROWN, WILLIAM
MCKINNEY, ROSS B
MORALES, VICTOR
DUQUIN, JON
MCALPINE, ERIC M
BYNUM, STEVEN
```
---
## sorting
sorting
```bash
$ sort names.txt
BOATWRIGHT, MAURICE
BROWN, STEPHANIE
BROWN, WILLIAM
BROWN, WILLIAM
BYNUM, STEVEN
DUQUIN, JON
DUQUIN, JON
DUQUIN, JON
MCALPINE, ERIC M
MCALPINE, ERIC M
MCALPINE, ERIC M
MCALPINE, ERIC M
MCKINNEY, ROSS B
MORALES, VICTOR
PAYNE, RAYMOND
```
---
## sort and make uniq
```bash
$ sort names.txt | uniq
BOATWRIGHT, MAURICE
BROWN, STEPHANIE
BROWN, WILLIAM
BYNUM, STEVEN
DUQUIN, JON
MCALPINE, ERIC M
MCKINNEY, ROSS B
MORALES, VICTOR
PAYNE, RAYMOND
```
---
## sort and count
```bash
$ sort names.txt | uniq -c
1 BOATWRIGHT, MAURICE
1 BROWN, STEPHANIE
2 BROWN, WILLIAM
1 BYNUM, STEVEN
3 DUQUIN, JON
4 MCALPINE, ERIC M
1 MCKINNEY, ROSS B
1 MORALES, VICTOR
1 PAYNE, RAYMOND
```
---
## sort and count, sort by number
```bash
$ sort names.txt | uniq -c | sort -n
1 BOATWRIGHT, MAURICE
1 BROWN, STEPHANIE
1 BYNUM, STEVEN
1 MCKINNEY, ROSS B
1 MORALES, VICTOR
1 PAYNE, RAYMOND
2 BROWN, WILLIAM
3 DUQUIN, JON
4 MCALPINE, ERIC M
```
---
## sort and count, sort by number, reversed
```bash
$ sort names.txt | uniq -c | sort -rn
4 MCALPINE, ERIC M
3 DUQUIN, JON
2 BROWN, WILLIAM
1 PAYNE, RAYMOND
1 MORALES, VICTOR
1 MCKINNEY, ROSS B
1 BYNUM, STEVEN
1 BROWN, STEPHANIE
1 BOATWRIGHT, MAURICE
```
---
## top 3 by number of occurences
```bash
$ sort names.txt | uniq -c | sort -rn | head -3
4 MCALPINE, ERIC M
3 DUQUIN, JON
2 BROWN, WILLIAM
```
---
## redirection
Here we use the greater than sign for output-redirection to a file.
This will overwrite the file if it already exists, so be careful when to use it!
```bash
$ grep DENIED parole.csv > parole-denied.csv
```
---
## redirection
if you want to add to an existing file use double greater than signs:
```bash
$ grep GRANTED parole.csv > parole-granted.csv
$ grep PAROLED parole.csv >> parole-granted.csv
```
---
## redirection
Using redirection you can get any program to write to a file.
How does this work? The UNIX command line tools send their output
to a "channel" called "standard output" (or STDOUT, stdout in different programming languages).
the redirection is the general way of sending this channel to a file.
See also man bash, search for REDIRECTION
---
## pipeline
How many people were denied parole? we just
need to count the lines:
```bash
$ grep DENIED parole.csv > parole-denied.csv
$ wc -l parole-denied.csv
```
---
## pipeline
But there is no need to create the file at all: we can creat a pipeline
with the vertical bar symbol. The pipeline connects the output
channel of grep to the input channel of wc:
```bash
$ grep DENIED parole.csv | wc -l
```
The vertical bar is often called "pipe", so you would read
this command as: grep denied parole.csv pipe wordcount minus L
Using the pipe you can create complex programs without writing any permanent code.
---
## pipeline
When building pipes and working with large data sets, always put "| less" at the end to take a peek.
```bash
$ sort names.txt | less
$ sort names.txt | uniq -c | less
$ sort names.txt | uniq -c | sort -n | less
$ sort names.txt | uniq -c | sort -rn | less
$ sort names.txt | uniq -c | sort -rn | head -30 | less
```
---
## cutting columns from files
cut, paste
---
## handling csv files
There are the commands 'cut' and 'paste' that you might want to
use to handle comma-separated-values files. But beware: as soon as
you have commas inside one of the columns cut will not work for you!
Here [csvkit](https://github.com/onyxfish/csvkit) comes in:
csvkit knows about strings and escaping
in csv and will cut out the correct columns:
```bash
$ csvcut -c1,2 parole.csv
```
Now we can use the power of pipelines to learn more about a csv file
```bash
$ csvcut -c1,2 parole.csv | sort .....
```
---
## handling json files
use [jq](https://stedolan.github.io/jq/) for querying json files:
```bash
$ less example.json
{
"status": "past",
"rating": {
"count": 1,
"average": 5
},
"utc_offset": -18000000,
"group": {
"name": "SOME CLUB",
"join_mode": "approval",
"who": "Members",
"id": 12345
},
"description": "<p>we haven an <b>OPEN HOUSE</b>",
"created": 1383331008000,
"updated": 1383930450000,
```
---
## handling json files
use [jq](https://stedolan.github.io/jq/) for querying json files:
```bash
$ cat example.json | jq '.'
{
"status": "past",
"rating": {
"count": 1,
"average": 5
},
"utc_offset": -18000000,
"group": {
"name": "SOME CLUB",
"join_mode": "approval",
"who": "Members",
"id": 12345
},
"description": "<p>we haven an <b>OPEN HOUSE</b>",
"created": 1383331008000,
"updated": 1383930450000,
```
---
## handling json files
use [jq](https://stedolan.github.io/jq/) for querying json files:
```bash
$ cat example.json | jq '{id: .id, rating: .rating.average}'
{
"id": "148644612",
"rating": 5
}
```
---
## handling json files with array
use [jq](https://stedolan.github.io/jq/) for querying json files:
```bash
$ cat array.json
[
{
"status": "past",
"rating": {
"count": 1,
"average": 5
},
"yes_rsvp_count": 8,
"headcount": 0,
"id": "148644612"
},
{
"status": "past",
"rating": {
"count": 0,
"average": 0
},
```
---
## handling json files with array
use [jq](https://stedolan.github.io/jq/) for querying json files:
```bash
$ cat array.json | jq '.[]|{id: .id, rating: .rating.average}' | less
{
"id": "148644612",
"rating": 5
}
{
"id": "150931712",
"rating": 4.75
}
```
---
class: smallcode
## compressed files: zip
see list of file with unzip -v before uncompressing
```bash
$ unzip -v groups_new.zip
Archive: groups_new.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
0 Stored 0 0% 10-24-2014 12:50 00000000 groups_new/
912939 Defl:N 151623 83% 10-24-2014 10:34 d331e014 groups_new/category_1.json
1308266 Defl:N 226343 83% 10-24-2014 10:44 51ae53a1 groups_new/category_10.json
667660 Defl:N 112417 83% 10-24-2014 10:44 c3cbb3d4 groups_new/category_11.json
596065 Defl:N 94715 84% 10-24-2014 10:44 9feb58bf groups_new/category_12.json
...
336000 Defl:N 51356 85% 10-24-2014 10:44 0b4d4b64 groups_new/category_8.json
1545754 Defl:N 247832 84% 10-24-2014 10:44 c95bd009 groups_new/category_9.json
-------- ------- --- -------
40093888 6725152 83% 36 files
```
---
class: smallcode
## compressed files: zip
see content of files with unzip -c before uncompressing
```bash
$ unzip -c groups_new.zip | less
Archive: groups_new.zip
extracting: groups_new/
inflating: groups_new/category_1.json
[
{
"category": {
"shortname": "arts-culture",
"name": "fine arts/culture"
},
"city": "New York",
"rating": 4.52,
"description": "<p>discussing books chosen by the members of our group.</p>",
"join_mode": "approval",
"country": "US",
"who": "Book Lovers",
```
---
## compressed files: zip
uncompress
```bash
$ unzip groups_new.zip
Archive: groups_new.zip
creating: groups_new/
inflating: groups_new/category_1.json
inflating: groups_new/category_10.json
inflating: groups_new/category_11.json
inflating: groups_new/category_12.json
inflating: groups_new/category_13.json
inflating: groups_new/category_14.json
inflating: groups_new/category_15.json
extracting: groups_new/category_7.json
inflating: groups_new/category_8.json
inflating: groups_new/category_9.json
---
## compressed files: gz
a .gz file contains one compressed file only.
look at the contents of that file
```bash
$ zcat some_log.1.gz
```
---
## compressed files: tgz
tgz is short for .tar.gz: the tar file contains sequence
of several files (no compression), the .gz does the compression.
look at the list of files:
```bash
$ tar tvfz data.tgz
```
uncompress
```bash
$ tar xvfz data.tgz
```
---
class: center, middle
# Into the Database
---
## Loading into the Database
* (binary) Database Dump
* SQL Dump, e.g. pg_dump, pg_restore
* Load CSV from client, e.g. \COPY
* Load CSV from server, e.g. COPY
* Load CSV as foreign table
---
## some real world pipes
```bash
$ csvcut -c3 parole.csv |cut -c1-2|
$ csvcut -c3 parole.csv |cut -c1-2| awk '{ if (int($0) > 20) print "19$0"; else print "20$0";'
$ echo "YEAR INCARCERATION" > year_of_incarceration.txt
$ csvcut -c3 parole.csv |cut -c1-2| grep -v DI| perl -n -e 'if($_>16) { print "19$_" } else { print "20$_" }' >> year_of_incarceration.txt
$ csvcut -c5 parole.csv | cut -d\/ -f 3 | sed -e 's/DATE/YEAR/' |head -10 > birth_year.txt
$ csvcut -c11,12 parole-3.csv | perl -n -e '($in,$birth) = split /,/; if($.==1){print "AGE AT INCARCERATION\n"} else { print $in-$birth, "\n"}' > age_at_incarceration.txt
```
---
## a famous example
[source](http://www.leancrew.com/all-this/2011/12/more-shell-less-egg/)
```bash
tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q
```
McIlroy’s short, impossible-to-misunderstand explanation:
If you are not a UNIX adept, you may need a little explanation, but not much, to understand this pipeline of processes. The plan is easy:
* Make one-word lines by transliterating the complement (-c) of the alphabet into newlines (note the quoted newline), and squeezing out (-s) multiple newlines.
* Transliterate upper case to lower case.
* Sort to bring identical words together.
* Replace each run of duplicate words with a single representative and include a count (-c).
* Sort in reverse (-r) numeric (-n) order.
* Pass through a stream editor; quit (q) after printing the number of lines designated by the script’s first parameter (${1}).
</textarea>
<script type="text/javascript" src="https://gnab.github.io/remark/downloads/remark-latest.min.js">
</script>
<script type="text/javascript">
var slideshow = remark.create();
</script>
</body>
</html>