An introduction to the UNIX command line for munging data.
(Source on github, made with remark)
- small, sharp tools
- that can be chained together
- to solve complex problems
- goes back to the introduction of the pipe and grep in 1973
where simple means: providing well thought out functionality, implemented to perfection (see grep algorigthm and implementation)
- use the up-arrow (and down-arrow) to go back to previouse commands
- use the tabulator key for file name completion: type just the first letter of a filename, then try tab
- use CTRL-A to move your cursor to the beginning of the line
- use CTRL-E to move your cursor to the end of the line
- use CTRL-K to delete from the cursor to the end of the line
- !! repeats the last command
- !$ repeat the last word of the last command
$ mkdir this_long_directory_name
$ cd !$
first, let's look at the size:
$ ls -l parole
-rw-r----- 1 bjelline admin 2286089 Feb 25 12:13 parole
$ ls -lh parole
-rw-r----- 1 bjelline admin 2.2M Feb 25 12:13 parole
you don't know what might be in the file? guess the file type:
$ file parole
parole: ASCII text
if it's text, we can take a peek with less.
$ less parole
use key commands in less (in many contexts)
- q to quit
- SPACE to page one down
- CTRL-B to go back one page
- CTRL-F to go back one page
- G to jump to the end
- CTRL-G to find out where you a (line number + percent)
- 50% jump to middle of file (works for any number 0-100)
- /cookie to search for cookie
- n to jump to the next occurance
- b to jump back to the previous occurance
$ wc -l parole
24291 parole
See also man wc
now we know that 'parole' is a csv file let's rename it:
$ mv parole parole.csv
See also man mv
find certain lines, get rid of certain files
grep will look for a string in each line, and print the line if the string is contained.
$ grep TRAMA parole.csv
If there is a space in your search string you must quote it:
$ grep 'JAMES, MARK' parole.csv
grep is case sensitive, use -i to make it case insensitive:
$ grep -i trama parole.csv
the option v turns grep around: all lines are printed, except those matching the pattern. this way you can get rid of certain files
$ grep -v ....
to build more complex search, with search patterns, use egrep
$ egrep 'TRAMA|TENNEY' parole.csv
unsorted stuff
$ cat names.txt
DUQUIN, JON
MCALPINE, ERIC M
MCALPINE, ERIC M
MCALPINE, ERIC M
BROWN, STEPHANIE
BROWN, WILLIAM
BOATWRIGHT, MAURICE
PAYNE, RAYMOND
DUQUIN, JON
BROWN, WILLIAM
MCKINNEY, ROSS B
MORALES, VICTOR
DUQUIN, JON
MCALPINE, ERIC M
BYNUM, STEVEN
sorting
$ sort names.txt
BOATWRIGHT, MAURICE
BROWN, STEPHANIE
BROWN, WILLIAM
BROWN, WILLIAM
BYNUM, STEVEN
DUQUIN, JON
DUQUIN, JON
DUQUIN, JON
MCALPINE, ERIC M
MCALPINE, ERIC M
MCALPINE, ERIC M
MCALPINE, ERIC M
MCKINNEY, ROSS B
MORALES, VICTOR
PAYNE, RAYMOND
$ sort names.txt | uniq
BOATWRIGHT, MAURICE
BROWN, STEPHANIE
BROWN, WILLIAM
BYNUM, STEVEN
DUQUIN, JON
MCALPINE, ERIC M
MCKINNEY, ROSS B
MORALES, VICTOR
PAYNE, RAYMOND
$ sort names.txt | uniq -c
1 BOATWRIGHT, MAURICE
1 BROWN, STEPHANIE
2 BROWN, WILLIAM
1 BYNUM, STEVEN
3 DUQUIN, JON
4 MCALPINE, ERIC M
1 MCKINNEY, ROSS B
1 MORALES, VICTOR
1 PAYNE, RAYMOND
$ sort names.txt | uniq -c | sort -n
1 BOATWRIGHT, MAURICE
1 BROWN, STEPHANIE
1 BYNUM, STEVEN
1 MCKINNEY, ROSS B
1 MORALES, VICTOR
1 PAYNE, RAYMOND
2 BROWN, WILLIAM
3 DUQUIN, JON
4 MCALPINE, ERIC M
$ sort names.txt | uniq -c | sort -rn
4 MCALPINE, ERIC M
3 DUQUIN, JON
2 BROWN, WILLIAM
1 PAYNE, RAYMOND
1 MORALES, VICTOR
1 MCKINNEY, ROSS B
1 BYNUM, STEVEN
1 BROWN, STEPHANIE
1 BOATWRIGHT, MAURICE
$ sort names.txt | uniq -c | sort -rn | head -3
4 MCALPINE, ERIC M
3 DUQUIN, JON
2 BROWN, WILLIAM
Here we use the greater than sign for output-redirection to a file.
This will overwrite the file if it already exists, so be careful when to use it!
$ grep DENIED parole.csv > parole-denied.csv
if you want to add to an existing file use double greater than signs:
$ grep GRANTED parole.csv > parole-granted.csv
$ grep PAROLED parole.csv >> parole-granted.csv
Using redirection you can get any program to write to a file.
How does this work? The UNIX command line tools send their output to a "channel" called "standard output" (or STDOUT, stdout in different programming languages). the redirection is the general way of sending this channel to a file.
See also man bash, search for REDIRECTION
How many people were denied parole? we just need to count the lines:
$ grep DENIED parole.csv > parole-denied.csv
$ wc -l parole-denied.csv
But there is no need to create the file at all: we can creat a pipeline with the vertical bar symbol. The pipeline connects the output channel of grep to the input channel of wc:
$ grep DENIED parole.csv | wc -l
The vertical bar is often called "pipe", so you would read this command as: grep denied parole.csv pipe wordcount minus L
Using the pipe you can create complex programs without writing any permanent code.
When building pipes and working with large data sets, always put "| less" at the end to take a peek.
$ sort names.txt | less
$ sort names.txt | uniq -c | less
$ sort names.txt | uniq -c | sort -n | less
$ sort names.txt | uniq -c | sort -rn | less
$ sort names.txt | uniq -c | sort -rn | head -30 | less
cut, paste
There are the commands 'cut' and 'paste' that you might want to use to handle comma-separated-values files. But beware: as soon as you have commas inside one of the columns cut will not work for you!
Here csvkit comes in: csvkit knows about strings and escaping in csv and will cut out the correct columns:
$ csvcut -c1,2 parole.csv
Now we can use the power of pipelines to learn more about a csv file
$ csvcut -c1,2 parole.csv | sort .....
use jq for querying json files:
$ less example.json
{
"status": "past",
"rating": {
"count": 1,
"average": 5
},
"utc_offset": -18000000,
"group": {
"name": "SOME CLUB",
"join_mode": "approval",
"who": "Members",
"id": 12345
},
"description": "<p>we haven an <b>OPEN HOUSE</b>",
"created": 1383331008000,
"updated": 1383930450000,
use jq for querying json files:
$ cat example.json | jq '.'
{
"status": "past",
"rating": {
"count": 1,
"average": 5
},
"utc_offset": -18000000,
"group": {
"name": "SOME CLUB",
"join_mode": "approval",
"who": "Members",
"id": 12345
},
"description": "<p>we haven an <b>OPEN HOUSE</b>",
"created": 1383331008000,
"updated": 1383930450000,
use jq for querying json files:
$ cat example.json | jq '{id: .id, rating: .rating.average}'
{
"id": "148644612",
"rating": 5
}
use jq for querying json files:
$ cat array.json
[
{
"status": "past",
"rating": {
"count": 1,
"average": 5
},
"yes_rsvp_count": 8,
"headcount": 0,
"id": "148644612"
},
{
"status": "past",
"rating": {
"count": 0,
"average": 0
},
use jq for querying json files:
$ cat array.json | jq '.[]|{id: .id, rating: .rating.average}' | less
{
"id": "148644612",
"rating": 5
}
{
"id": "150931712",
"rating": 4.75
}
see list of file with unzip -v before uncompressing
$ unzip -v groups_new.zip
Archive: groups_new.zip
Length Method Size Cmpr Date Time CRC-32 Name
-------- ------ ------- ---- ---------- ----- -------- ----
0 Stored 0 0% 10-24-2014 12:50 00000000 groups_new/
912939 Defl:N 151623 83% 10-24-2014 10:34 d331e014 groups_new/category_1.json
1308266 Defl:N 226343 83% 10-24-2014 10:44 51ae53a1 groups_new/category_10.json
667660 Defl:N 112417 83% 10-24-2014 10:44 c3cbb3d4 groups_new/category_11.json
596065 Defl:N 94715 84% 10-24-2014 10:44 9feb58bf groups_new/category_12.json
...
336000 Defl:N 51356 85% 10-24-2014 10:44 0b4d4b64 groups_new/category_8.json
1545754 Defl:N 247832 84% 10-24-2014 10:44 c95bd009 groups_new/category_9.json
-------- ------- --- -------
40093888 6725152 83% 36 files
see content of files with unzip -c before uncompressing
$ unzip -c groups_new.zip | less
Archive: groups_new.zip
extracting: groups_new/
inflating: groups_new/category_1.json
[
{
"category": {
"shortname": "arts-culture",
"name": "fine arts/culture"
},
"city": "New York",
"rating": 4.52,
"description": "<p>discussing books chosen by the members of our group.</p>",
"join_mode": "approval",
"country": "US",
"who": "Book Lovers",
uncompress
$ unzip groups_new.zip
Archive: groups_new.zip
creating: groups_new/
inflating: groups_new/category_1.json
inflating: groups_new/category_10.json
inflating: groups_new/category_11.json
inflating: groups_new/category_12.json
inflating: groups_new/category_13.json
inflating: groups_new/category_14.json
inflating: groups_new/category_15.json
extracting: groups_new/category_7.json
inflating: groups_new/category_8.json
inflating: groups_new/category_9.json
---
## compressed files: gz
a .gz file contains one compressed file only.
look at the contents of that file
```bash
$ zcat some_log.1.gz
tgz is short for .tar.gz: the tar file contains sequence of several files (no compression), the .gz does the compression.
look at the list of files:
$ tar tvfz data.tgz
uncompress
$ tar xvfz data.tgz
- (binary) Database Dump
- SQL Dump, e.g. pg_dump, pg_restore
- Load CSV from client, e.g. \COPY
- Load CSV from server, e.g. COPY
- Load CSV as foreign table
$ csvcut -c3 parole.csv |cut -c1-2|
$ csvcut -c3 parole.csv |cut -c1-2| awk '{ if (int($0) > 20) print "19$0"; else print "20$0";'
$ echo "YEAR INCARCERATION" > year_of_incarceration.txt
$ csvcut -c3 parole.csv |cut -c1-2| grep -v DI| perl -n -e 'if($_>16) { print "19$_" } else { print "20$_" }' >> year_of_incarceration.txt
$ csvcut -c5 parole.csv | cut -d\/ -f 3 | sed -e 's/DATE/YEAR/' |head -10 > birth_year.txt
$ csvcut -c11,12 parole-3.csv | perl -n -e '($in,$birth) = split /,/; if($.==1){print "AGE AT INCARCERATION\n"} else { print $in-$birth, "\n"}' > age_at_incarceration.txt
tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q
McIlroy’s short, impossible-to-misunderstand explanation:
If you are not a UNIX adept, you may need a little explanation, but not much, to understand this pipeline of processes. The plan is easy:
- Make one-word lines by transliterating the complement (-c) of the alphabet into newlines (note the quoted newline), and squeezing out (-s) multiple newlines.
- Transliterate upper case to lower case.
- Sort to bring identical words together.
- Replace each run of duplicate words with a single representative and include a count (-c).
- Sort in reverse (-r) numeric (-n) order.
- Pass through a stream editor; quit (q) after printing the number of lines designated by the script’s first parameter (${1}).