-
Notifications
You must be signed in to change notification settings - Fork 15
/
join.qmd
115 lines (84 loc) · 3.11 KB
/
join.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
title: "Join two data frames"
date-modified: 'today'
date-format: long
license: CC BY-NC
bibliography: references.bib
---
```{r}
#| warning: false
#| message: false
library(tidyverse)
```
## Join
There are a series of [join commands](https://dplyr.tidyverse.org/reference/index.html#section-two-table-verbs)
- left_join, inner_join, right_join, full_join,
- semi_join, anti_join
![{dplyr} joins visualized in venn diagrams](images/join_diagram.png)
First let's import the favorability ratings data from fivethirtyeight.com
## data
These exercises use the following [`ggplot2` training datasets](https://ggplot2.tidyverse.org/reference/index.html#section-data)
- dplyr::starwars
- Data from fivethrityeight.org (modified)
```{r}
#| message: false
#| warning: false
fav_ratings <- read_csv("data/538_favorability_popularity.csv", skip = 11)
fav_ratings
```
```{r}
starwars
```
Join on a **key** that is common across two data frames. For best results use a numeric key that promotes precision. Unfortunately our data frames do not have a numeric key to use as our join key. We'll use an alphabetic key, `name`, and this will highlight what can go wrong. along with troubleshooting strategies.
The `name` variable is the key because it is common to both tables: `fav_ratings` and `starwars`.
::: column-margin
```{=html}
<iframe width="300" height="200" src="https://www.youtube.com/embed/GJUcnEV_6O0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
```
:::
```{r}
fav_ratings |>
left_join(starwars, by = join_by(name)) |>
arrange(desc(fav_rating))
```
### anti_join
anti_join will determine data values that are in the left table and not the right table.
```{r}
anti_join(fav_ratings, starwars) |> arrange(name)
anti_join(starwars, fav_ratings) |> arrange(name)
```
### semi_join or inner_join
Join only the rows that match.
```{r}
fav_ratings |>
semi_join(starwars)
```
### Regex
Regular expressions (regex) and {stringr} can be handy when manipulating character variables into join keys. We use regex when troubleshooting what is not matching. In other computatons we might also use regex for more sophisticated text manipulations, specifically for finding patterns in text.
- str_detect()
- str_to_lower()
- str_remove()
- str_trim()
- str_extract()
::: column-margin
```{=html}
<iframe width="300" height="200" src="https://www.youtube.com/embed/UCrYhpGT_B0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
```
:::
```{r}
starwars %>%
filter(str_detect(name,
regex("3p|palpatine|obi|amidala|leia|d2",
ignore_case = TRUE))) %>%
arrange(name)
```
```{r}
fav_ratings %>%
filter(str_detect(name,
regex("3p|palpatine|obi|amidala|leia|d2",
ignore_case = TRUE))) %>%
arrange(name)
```
## See Also
- dplyr::**bind_rows**()
- dplyr::**bind_cols**()