generated from r4ds/bookclub-template
-
Notifications
You must be signed in to change notification settings - Fork 5
/
02_introduction-to-ggplot2.Rmd
422 lines (299 loc) · 12 KB
/
02_introduction-to-ggplot2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
# Introduction to ggplot2
**Learning objectives:**
- We are going to learn about the layerd grammer of graphics on the {ggplot2} package in R
- We are going to learn about the key components of every graphics
## Introduction
```{r,echo=FALSE,warning=FALSE,message=FALSE}
library(png)
library(grid)
library(gridExtra)
img1 <- rasterGrob(as.raster(readPNG("images/grammar-of-graphics.png")),interpolate = FALSE)
img2 <- rasterGrob(as.raster(readPNG("images/ggplot2_logo.png")),interpolate = FALSE)
grid.arrange(img1,img2,ncol=2)
```
Leland Wilkinson (Grammar of Graphics, 1999) formalized two main principles in his plotting framework:
- Graphics = distinct layers of grammatical elements
- Meaningful plots through aesthetic mappings
- The essential grammatical elements to create any visualization with {ggplot2} are:
![](images/ge_all.png)
## Data layer
```{r,warning=FALSE,message=FALSE}
# load data
data(CPS85 , package = "mosaicData")
```
The Data Layer specifies the data being plotted.
![](images/ge_data.png)
```{r}
head(CPS85,n=3)
ggplot2::ggplot(data = CPS85)
```
## Aesthetic Layer
- This involves linking variables in the data to graphical properties of the plot (e.g.,**x**,**y**,**color**,**shape**,**size**).
![](images/ge_aes.png)
![](images/common-aesthetics-1.png)
```{r,warning=FALSE,message=FALSE}
# specify dataset and mapping
library(ggplot2)
ggplot(data = CPS85,
mapping = aes(x = exper, y = wage))
```
## Geometries Layer
The next essential element for data visualization is the geometries layer or geom layer for short.
![](images/ge_geom.png)
- Geoms are the geometric objects (**points**, **lines**, **bars**, etc.) that can be placed on a graph.
```{r,warning=FALSE,message=FALSE}
# add points
ggplot(data = CPS85,
mapping = aes(x = exper, y = wage)) +
geom_point()
```
```{r,warning=FALSE,message=FALSE}
# delete outlier
library(dplyr)
plotdata <- filter(CPS85, wage < 40)
# redraw scatterplot
ggplot(data = plotdata,
mapping = aes(x = exper, y = wage)) +
geom_point()
```
```{r,warning=FALSE,message=FALSE}
# make points blue, larger, and semi-transparent
ggplot(data = plotdata,
mapping = aes(x = exper, y = wage)) +
geom_point(color = "cornflowerblue",
alpha = .7,
size = 3)
```
```{r,warning=FALSE,message=FALSE}
# add a line of best fit.
ggplot(data = plotdata,
mapping = aes(x = exper, y = wage)) +
geom_point(color = "cornflowerblue",
alpha = .7,
size = 3) +
geom_smooth(method = "lm")
```
## Statistics Layer
- The statistics layer allows us to plot statistical values calculated from the data
- This is used to transform the input variables to displayed values
![](images/ge_stats.png)
```{r,warning=FALSE,message=FALSE}
# indicate sex using color
ggplot(data = plotdata,
mapping = aes(x = exper,
y = wage,
color = sex)) +
geom_point(alpha = .7,
size = 3) +
geom_smooth(method = "lm",
se = FALSE,
size = 1.5)
```
![](images/stat_func.png)
![](images/visualization-stat-bar.png)
## Coordinates Layer
The coordinate layer allows us to adjust the x and y coordinates
![](images/ge_coord.png)
- We can adjust the min and max values, as well as the major ticks.
```{r}
ggplot(data = plotdata,
mapping = aes(x = exper,
y = wage,
color = sex)) +
geom_point(alpha = .6) +
geom_smooth(method = "lm",
se = FALSE)+
coord_cartesian(xlim = c(0,60),ylim = c(0,30))
```
## grouping
In addition to mapping variables to the x and y axes, variables can be mapped to the **color**, **shape**, **size**, **transparency**, and other visual characteristics of geometric objects.
- This allows groups of observations to be superimposed in a single graph.
- Let’s add sex to the plot and represent it by color.
```{r,warning=FALSE,message=FALSE}
# indicate sex using color
ggplot(data = plotdata,
mapping = aes(x = exper,
y = wage,
color = sex)) +
geom_point(alpha = .7,
size = 3) +
geom_smooth(method = "lm",
se = FALSE,
size = 1.5)
```
- It appears that men tend to make more money than women. Additionally, there may be a stronger relationship between experience and wages for men than than for women.
## scales
- Scales control how variables are mapped to the visual characteristics of the plot.
- Scale functions (which start with scale_) allow us to modify this mapping.
![](images/scale-guides.png)
```{r,warning=FALSE,message=FALSE}
# modify the x and y axes and specify the colors to be used
ggplot(data = plotdata,
mapping = aes(x = exper,
y = wage,
color = sex)) +
geom_point(alpha = .7,
size = 3) +
geom_smooth(method = "lm",
se = FALSE,
size = 1.5) +
scale_x_continuous(breaks = seq(0, 60, 10)) +
scale_y_continuous(breaks = seq(0, 30, 5),
label = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue"))
```
## Facets Layer
The facet layer allows us to create subplots within the same graphic object
![](images/ge_facet.png)
```{r,warning=FALSE,message=FALSE}
# reproduce plot for each level of job sector
ggplot(data = plotdata,
mapping = aes(x = exper,
y = wage,
color = sex)) +
geom_point(alpha = .7) +
geom_smooth(method = "lm",
se = FALSE) +
scale_x_continuous(breaks = seq(0, 60, 10)) +
scale_y_continuous(breaks = seq(0, 30, 5),
label = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue")) +
facet_wrap(~sector)
```
- It appears that the differences between men and women depend on the job sector under consideration.
## labels
- Graphs should be easy to interpret and informative labels are a key element in achieving this goal.
- The labs function provides customized labels for the axes and legends. Additionally, a custom title, subtitle, and caption can be added.
```{r,warning=FALSE,message=FALSE}
# add informative labels
ggplot(data = plotdata,
mapping = aes(x = exper,
y = wage,
color = sex)) +
geom_point(alpha = .7) +
geom_smooth(method = "lm",
se = FALSE) +
scale_x_continuous(breaks = seq(0, 60, 10)) +
scale_y_continuous(breaks = seq(0, 30, 5),
label = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue")) +
facet_wrap(~sector) +
labs(title = "Relationship between wages and experience",
subtitle = "Current Population Survey",
caption = "source: http://mosaic-web.org/",
x = " Years of Experience",
y = "Hourly Wage",
color = "Gender")
```
- Now a viewer doesn’t need to guess what the labels expr and wage mean, or where the data come from.
## themes
The themes layer refers to all non-data ink.
![](images/ge_themes.png)
- Finally, we can fine tune the appearance of the graph using themes.
- Theme functions (which start with theme_) control background colors, fonts, grid-lines, legend placement, and other non-data related features of the graph.
```{r,warning=FALSE,message=FALSE}
# use a minimalist theme
ggplot(data = plotdata,
mapping = aes(x = exper,
y = wage,
color = sex)) +
geom_point(alpha = .6) +
geom_smooth(method = "lm",
se = FALSE) +
scale_x_continuous(breaks = seq(0, 60, 10)) +
scale_y_continuous(breaks = seq(0, 30, 5),
label = scales::dollar) +
scale_color_manual(values = c("indianred3",
"cornflowerblue")) +
facet_wrap(~sector) +
labs(title = "Relationship between wages and experience",
subtitle = "Current Population Survey",
caption = "source: http://mosaic-web.org/",
x = " Years of Experience",
y = "Hourly Wage",
color = "Gender") +
theme_minimal()
```
- Now we have something. It appears that men earn more than women in management, manufacturing, sales, and the “other” category.
- They are most similar in **clerical**, **professional**, and **service positions**. The data contain no women in the **construction sector**. For management positions, wages appear to be related to experience for men, but not for women (this may be the most interesting finding). This also appears to be true for sales.
These findings are tentative. They are based on a limited sample size and do not involve statistical testing to assess whether differences may be due to chance variation.
## Placing the data and mapping options
Plots created with ggplot2 always start with the **ggplot function**. In the examples above, the data and mapping options were placed in this function. In this case they apply to each geom_ function that follows.
We can also place these options directly within a geom. In that case, they only apply only to that specific geom.
```{r,warning=FALSE,message=FALSE}
# placing color mapping in the ggplot function
ggplot(plotdata,
aes(x = exper,
y = wage,
color = sex)) +
geom_point(alpha = .7,
size = 3) +
geom_smooth(method = "lm",
formula = y ~ poly(x,2),
se = FALSE,
size = 1.5)
```
- Since the mapping of sex to color appears in the ggplot function, it applies to both geom_point and geom_smooth. The color of the point indicates the sex, and a separate colored trend line is produced for men and women. Compare this to
```{r,warning=FALSE,message=FALSE}
# placing color mapping in the geom_point function
ggplot(plotdata,
aes(x = exper,
y = wage)) +
geom_point(aes(color = sex),
alpha = .7,
size = 3) +
geom_smooth(method = "lm",
formula = y ~ poly(x,2),
se = FALSE,
size = 1.5)
```
- Since the sex to color mapping only appears in the geom_point function, it is only used there. A single trend line is created for all observations.
## Graphs as objects
A ggplot2 graph can be saved as a named R object (like a data frame), manipulated further, and then printed or saved to disk.
```{r,warning=FALSE,message=FALSE}
# prepare data
data(CPS85 , package = "mosaicData")
plotdata <- CPS85[CPS85$wage < 40,]
# create scatterplot and save it
myplot <- ggplot(data = plotdata,
aes(x = exper, y = wage)) +
geom_point()
# print the graph
myplot
# make the points larger and blue
# then print the graph
myplot <- myplot + geom_point(size = 3, color = "blue")
myplot
# print the graph with a title and line of best fit
# but don't save those changes
myplot + geom_smooth(method = "lm") +
labs(title = "Mildly interesting graph")
# print the graph with a black and white theme
# but don't save those changes
myplot + theme_bw()
```
## Resources
- [ggplot2 Book](https://ggplot2-book.org/introduction.html)
- [ggplot2 Cheatsheet](https://github.com/rstudio/cheatsheets/blob/main/data-visualization-2.1.pdf)
- [R Graph Gallery](https://r-graph-gallery.com)
- [R Graphics Cookbook](https://r-graphics.org)
- [ggplot2 Extensions Gallery](https://exts.ggplot2.tidyverse.org/gallery/)
- [Introduction to the Grammar of Graphics](https://murraylax.org/rtutorials/gog.html)
## Meeting Videos {-}
### Cohort 1 {-}
`r knitr::include_url("https://www.youtube.com/embed/pDdDjnYmHAs")`
<details>
<summary> Meeting chat log </summary>
```
00:05:30 Lydia Gibson: I can’t hear your audio that well
00:56:33 Oluwafemi Oyedele: https://exts.ggplot2.tidyverse.org/gallery/
00:56:42 Oluwafemi Oyedele: https://r-graph-gallery.com/
00:56:43 Kotomi Oda: Thank you for the presentation!
00:56:48 Oluwafemi Oyedele: https://r-graphics.org/
00:56:58 Oluwafemi Oyedele: https://ggplot2-book.org/introduction.html
00:57:05 Lydia Gibson: https://docs.google.com/spreadsheets/d/1yrXUdZ95upU3kISocqinvaDh-U2JK3mnjhtj4Fr27H8/edit#gid=0
```
</details>