-
Notifications
You must be signed in to change notification settings - Fork 2
/
index.Rmd
240 lines (181 loc) · 10.7 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
title: "dataRbeautiful: Part 1"
author: "by [Matthew J. Oldach](https://github.com/moldach/) - `r format(Sys.time(), '%d %B %Y')`"
output:
html_document:
css:
number_sections: FALSE
includes:
before_body: header.html
after_body: footer.html
---
> In this series of posts I will set out to recreate some of the visualizations from the book "Knowledge is Beautiful" by David McCandless in R.
David McCandless is author of two bestselling infographics books and gave a [TED talk about data visualization](http://www.ted.com/talks/david_mccandless_the_beauty_of_data_visualization). I bought his second book ["Knowledge is Beautiful"](https://informationisbeautiful.net/2014/knowledge-is-beautiful/), in 2015, which contains 196 beautiful infographics.
At that time, I was really into **The Walking Dead**, and his book inspired me to make my own infographic:
![](C:/Users/Matthew/Documents/WalkingDead_infographic.jpg)
Recently, I was trying to think of some fun data visualization projects and decided to choose a couple from the book that could be recreated as close to possible in `R`.
The book is an excellent resource for those who like these sorts of exercises as **every single visualization in the book is paired with an online dataset to** [explore at your interest!!!](http://www.informationisbeautiful.net/data/). I never knew how rich the datasets were until I tried to recreate my first visualization, "**Best in Show**". The dataset for [**Best in Show**](bit.ly/KIB_BestDogs) alone, is an excel file with eight sheets!
McCandles says the whole book took him 15,832 hours over two-years and I don't doubt it. Whipping up a quick EDA plotcan be fast and simple for a meeting if you only care about actionable results. However, if your publishing something it takes some time to create a stunning visual.
# Best in Show
***
**Best in Show** is a scatter-plot of dog silhouettes, color-coded based on the category of dog, sized accordingly, and pointing either left-or-right, depending on their intelligence.
Let's load the environment; I like the `needs()` package which makes it simple to install & load packages into `R`.
```{r setup, message = FALSE}
knitr::opts_chunk$set(echo = TRUE)
if (!require("needs")) {
install.packages("needs", dependencies = TRUE)
library(needs)
}
```
As mentioned above, the data is an excel file with eight sheets. We can read in the Excel file with `read_excel()` and specify the sheet we would like to access with the `sheet =` argument.
```{r}
# install&Load readxl package
needs(here,
readxl,
stringr,
dplyr, # to use select()
magrittr) # to use %<>% operators
path = here()
# Specify the worksheet by name
dog_excel <- read_excel("~/bestinshow.xlsx", sheet = "Best in show full sheet", range = cell_rows(3:91))
# Pick the desired variables
dog_excel %<>%
select(1,3,5:6, "intelligence category", "size category")
# Rename the columns
colnames(dog_excel) <- c("breed", "category", "score", "popularity", "intelligence", "size")
# Remove first row (non-descript column names)
dog_excel <- dog_excel[-1,]
```
Take a look at the HTML table (supports filtering, pagination, and sorting).
```{r}
needs(DT)
datatable(dog_excel, options = list(pageLength = 5))
```
Take a look at the `intelligence` levels in the dataset.
```{r}
# What are the intelligence levels?
unique(dog_excel$intelligence)
```
McCandless' classified dogs `intelligence` as either "dumb" or "clever". But here we see there are categories in this *clean* dataset. So let's assign the top three factors as `clever` and the other three as `dumb`
```{r}
dog_excel$intelligence %<>%
str_replace(pattern = "Brightest", replace = "clever", .) %>%
str_replace(pattern = "Above average", replace = "clever", .) %>%
str_replace(pattern = "Excellent", replace = "clever", .) %>%
str_replace(pattern = "Average", replace = "dumb", .) %>%
str_replace(pattern = "Fair", replace = "dumb", .) %>%
str_replace(pattern = "Lowest", replace = "dumb", .)
```
There are 87 dogs in the original visualization. I found 24 silhouettes under an [**Creative Commons Attribution-Share Alike 4.0 License**](https://creativecommons.org/licenses/by-sa/4.0/) on [**SuperColoring.com**](http://www.supercoloring.com). This means I can freely copy and redistribute in any medium or format as long as I give a link to the webpage and indicate the author's name and the license. I've included that information in a `.csv.` and we will use this to subset McCandless's data.
```{r, message = FALSE}
needs(readr)
dog_silhouettes <- read_csv("~/dog_silhouettes.csv")
dog_df <- dog_excel %>%
inner_join(dog_silhouettes, by = "breed")
# change popularity from string into numeric values
dog_df$popularity <- as.numeric(dog_df$popularity)
```
We need to use the `magick` package to make the white background surrounding the `.svg` silhouettes transparent, scale images to a common size (by width), and save them as `.png`'s.
```{r}
needs(magick)
file.names <- dir(path, pattern = ".svg")
for (file in file.names){
# read in the file
img <- image_read_svg(file)
# scale all images to a common scale
img_scaled <- image_scale(img, 700)
# make the background transparent
img_trans <- image_transparent(img_scaled, 'white')
# get rid of .svg ending
file_name <- str_replace(file, pattern = ".svg", replace = "")
# write out the file as a .png
image_write(img_trans, paste0(file_name, ".png"))
}
```
Some of the dog silhouette's are pointing their heads in the opposite direction. We need to use imag_flop() from the `magick` package so that all are facing the same direction. Later we can subset the data frame by intelligence so `clever` and `dumb` dogs face opposite directions.
```{r}
path = here()
flop.images <- c("~/labrador-retriever-black-silhouette.png", "~/border-terrier-black-silhouette.png", "~/boxer-black-silhouette.png", "~/french-bulldog-black-silhouette.png", "~/german-shepherd-black-silhouette.png", "~/golden-retriever-black-silhouette.png", "~/greyhound-black-silhouette.png", "~/rottweiler-black-silhouette.png")
for(i in flop.images){
i <- str_replace(i, pattern = "~/", replace = "")
img <- image_read(i)
img_flop <- image_flop(img)
image_write(img_flop, i)
}
```
The next step is to color the dogs according to category.
Let's write a function that can be given three arguments: 1) `df` a data frame, 2) `category` the type of dog it is, 3) `color` a specific color for each `category`.
The first step is to select only those dogs which belong to the `category` of interest. This can be done with `filter()` from the `dplyr` package. However, if you want to use `dplyr` functions, like `filter()`, you need to [follow the instruction at this website](https://rpubs.com/hadley/dplyr-programming); set with `enquo`.
```{r}
img_color <- function(df, category, color){
# Make filter a quosure
category = enquo(category)
# subset df on category
new_df <- df %>%
filter(category == !!category)
# get directory paths of images for the for loop
category_names <- new_df$breed %>%
tolower() %>%
str_replace_all(" ", "-") %>%
paste0("-black-silhouette.png")
for(name in category_names){
img <- image_read(name)
img_color <- image_fill(img, color, "+250+250")
image_write(img_color, name)
}
}
```
I wanted to replicate the figures as close as possible within R so to replicate the colors of the visualizations I scanned the book, saved the images, and then used this [tool to get the html color codes](https://html-color-codes.info/colors-from-image/).
Now let's Use the created function above to color dog silhouettes according to their `category`.
```{r}
# Herding color "#D59E7B"
img_color(dog_df, "herding", "#D59E7B")
# Hound color "#5E4D6C"
img_color(dog_df, "hound", "#5E4D6C")
# Non-sporting color "#6FA86C"
img_color(dog_df, "non-sporting", "#6FA86C")
# Sporting color "#B04946"
img_color(dog_df, "sporting", "#B04946")
# Terrier color "#A98B2D"
img_color(dog_df, "terrier", "#A98B2D")
# Toy color "#330000"
img_color(dog_df, "toy", "#330000")
# Working color "#415C55"
img_color(dog_df, "working", "#415C55")
```
Okay it's finally time to make the visualization.
Usually one plots points with `geom_point()` from `ggplot2` but in this case I want images for each of the breed's instead. We can use the `ggimage` package, and with a little tweaking we can flop images based on `intelligence`. `ggimage` does not support color as an aesthetic like `ggplot2` which is why I manually assigned colors & sizes earlier.
Since `popularity` scores range from 1 to 140 with 1 being the **most** popular we will need to reverse the y-axis with `scale_y_reverse()`.
```{r}
needs(ggplot2,
ggimage)
# add "~/" to get filenames of images for plotting
dog_df$name <- paste0("~/", dog_df$name)
# create a ggplot/ggimage object
p <- ggplot(subset(dog_df, intelligence == "clever"), aes(x = score, y = popularity, image = name), alpha = 0.5) + geom_image(image_fun = image_flop) + geom_image(data=subset(dog_df, intelligence == "dumb")) +
labs(title = "Best in Show", subtitle = "The ultimate datadog", caption = "Source: bit.ly/KIB_BestDogs") +
labs(x = NULL, y = NULL) +
theme(panel.background = element_blank(),
legend.position = "top",
legend.box = "horizontal",
plot.title = element_text(size = 13,
# I'm not sure what font he chose so I'll pick something I think looks similar
family = "AvantGarde",
face = "bold",
lineheight = 1.2),
plot.subtitle = element_text(size = 10,
family = "AvantGarde"),
plot.caption = element_text(size = 5,
hjust = 0.99),
axis.text = element_blank(),
axis.ticks = element_blank()) +
scale_y_reverse()
```
The final step is to add text annotations underneath the dog breeds. Since I subset `dog_df` by `intelligence` if I try to annotate with `geom_text()` it will only annotate part of the data. We will need to the `annotate()` function instea since geome are not mapped from variables of a data frame, but are instead passed in as vectors.
```{r, warning = FALSE}
# Add annotations
p + annotate("text", x=dog_df$score[1:24], y=((dog_df$popularity[1:24])+6), label = dog_df$breed[1:24], size = 2.0)
```
The visualization looks similar to the original and highlights most of the aesthetics that were included. One could always add the additional details in Adobe Illustrator or Inkscape to make it look more like [the final visualization](https://informationisbeautiful.net/visualizations/best-in-show-whats-the-top-data-dog/)
![](C:/Users/Matthew/Documents/best-in-show.png)
Stay tuned for Part II!