Skim through this article: https://fileinfo.com/help/binary_vs_text_files
- Text
- Text formats can be opened in any text editor (like Notepad, Sublime, TextEdit, etc.)
- Text formats are easy to version control as git can clearly see what changed between two versions of a file.
- Binary
- Binary formats cannot be opened in a text editor and often require a custom program (Acrobat Reader for pdf, Microsoft Excel for xlsx, Photoshop for psd, etc.)
- Binary formats are a bit more difficult to version control. The reason Git is so efficient is that it does not need to store a copy of every single version of a file but instead it can store just the changes (the lines of text that have changed) between two versions of a file. When working with binary files, git cannot do this.
A simple format that can be viewed in excel. Really easy to see in text editor, contains only one table.
This is what a csv looks like. The first line is the headers and each line there after is a row where the columns are separated by a comma.
date,close
22-Jan-17,46
22-Jan-17,45
23-Jan-17,45
24-Jan-17,42.1
24-Jan-17,46
24-Jan-17,57
25-Jan-17,42.3
25-Jan-17,36
26-Jan-17,42.6
When you open this in excel, the table will end up looking like this:
date | close |
---|---|
22-Jan-17 | 46.0 |
22-Jan-17 | 45.0 |
23-Jan-17 | 45.0 |
24-Jan-17 | 42.1 |
24-Jan-17 | 46.0 |
24-Jan-17 | 57.0 |
25-Jan-17 | 42.3 |
25-Jan-17 | 36.0 |
26-Jan-17 | 42.6 |
Same as CSV, but separates values with tabs rather than commas, can also be opened in excel.
This is what a TSV looks like. Note how the commas are replaced with the tab character.
date close
22-Jan-17 46
22-Jan-17 45
23-Jan-17 45
24-Jan-17 42.1
24-Jan-17 46
24-Jan-17 57
25-Jan-17 42.3
25-Jan-17 36
26-Jan-17 42.6
The table will render in the same way as the CSV in excel.
We'll dig into this momentarily.
XML is an older data format that started in the late 1990's. It uses the idea of a "tree" to represent data. If you're familar with HTML (hypertext markup language), XML is a superset of HTML. Instead of having a defined set of "tags" such as body
, head
title
, etc. XML allows one to define their own arbitrary tags.
For example, in the snippet below, we have a tag defined as note
. Each note
contains to
, from
, heading
, and body
.
<?xml version="1.0" encoding="UTF-8"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
If we wanted to represent a "list", we just put the same tag multiple times. For example, if we wanted the above note to be for multiple people, we could add an additional to
tag within the note
.
These days, JSON is much more popular than XML. However, you may still run into XML data in the wild since it reigned supreme for decades.
Here is an example of the House of Representatives providing data via XML: https://xml.house.gov/ Or a more international standard: http://www.akomantoso.org/?page_id=25
PDF data is hard to extract. If you have PDFs containing tables, you may be able to copy and paste the tables into a text file or into excel.
I have had success extracting data from PDFs with off the shelf tools when either (1) the PDF contains tables or (2) you have many PDFs that are exactly the same in structure with different data (for example many copies of a form). In these two cases, I'd consider trying
On macOS, install Tabula by running brew install tabula
.
- It also uses the idea of a "tree" to represent the structure of contents on a web page.
- Copy/Paste - try this first, sometimes you can just copy an HTML table and paste directly into excel then save as CSV
- import.io (and other similar tools) is especially good with extracting tables from websites, but can also extract data from other parts of a website using CSS selectors.
- Binary format
- Demo: Unzip roster.xlsx with Unarchiver
https://github.com/unitedstates/congress-legislators
- Kirsten's Slides on Government Data (slide 50)
- https://5stardata.info/en/
- http://opendatahandbook.org/