-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathhw1_solutions.Rmd
237 lines (168 loc) · 8.22 KB
/
hw1_solutions.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
---
title: "Homework 1: Transforming data"
author: "your name here"
date: Due 2015-02-05
output: html_document
---
Topics covered in this homework include:
- dplyr and the five verbs
- working with factors
- third normal form
- tidy data
(@) **Please calculate 2+2 in the space below.**
```{r}
2+2
```
### Set-up the soccer data
The code below clears memory and then loads dplyr and the soccer data.
```{r echo=FALSE}
rm(list=ls())
suppressPackageStartupMessages(library(dplyr))
load(url("http://www.princeton.edu/~mjs3/soc504_s2015/CrowdstormingDataJuly1st.RData"))
soccer.data <- tbl_df(soccer.data)
```
### A robustness check
In lab, we calculated the rate of red cards for players of different skin tone. Now, we are going to see how robust our conclusions were to some of the choices that we made in the analysis. In particular, it is important to know that in soccer there are actually two ways to get a red card: a direct red card and getting two yellow cards (which equals one red card).
(@directreds) **Create a table like the one were made in lab where the outcome of interest is rate of direct red cards.**
```{r}
soccer.data %>%
filter(!(is.na(rater1) | is.na(rater2))) %>%
mutate(skin.color = (rater1 + rater2) / 2) %>%
group_by(skin.color) %>%
summarise(total.reds = sum(red.cards), total.games = sum(games)) %>%
mutate(red.rate=total.reds / total.games)
```
(@allreds) **Imagine that you submitted the table above in a paper (Of course, in a real paper you would create a graph, but we have not learned `ggplot2` yet.) Create a table like the one above but where the outcome of interest is rate of all forms of red cards (direct red cards + two yellow cards). The column red.cards is direct red cards and the column yellow.reds is the red cards that result from two yellow cards.**
```{r}
soccer.data %>%
filter(!(is.na(rater1) | is.na(rater2))) %>%
mutate(skin.color = (rater1 + rater2) / 2) %>%
mutate(all.reds = red.cards + yellow.reds) %>%
group_by(skin.color) %>%
summarise(total.all.reds = sum(all.reds), total.games = sum(games)) %>%
mutate(red.rate=total.all.reds / total.games)
```
(@) In words, compare your answers in questions @directreds and @allreds. Do this choice make a difference?
```{answer}
It does not seem to make a difference if you use only direct red cards or all types of red cards.
```
### Looking at subsets of the data, by country
Imagine that you presented these results at ASA, and an audience member speculated the relationship between skin tone and red cards would be different in the different soccer leagues.
(@byleague) **Create a table that shows, for each league, the rate of red cards by skin color. In this case, please use direct red cards (red.cards) as you outcome.**
```{r}
soccer.data %>%
filter(!(is.na(rater1) | is.na(rater2))) %>%
mutate(skin.color = (rater1 + rater2) / 2) %>%
group_by(league.country, skin.color) %>%
summarise(total.reds = sum(red.cards), total.games = sum(games)) %>%
mutate(red.rate=total.reds / total.games)
# it is really hard to see much in that table, we can also make a graph
library(ggplot2)
soccer.data %>%
filter(!(is.na(rater1) | is.na(rater2))) %>%
mutate(skin.color = (rater1 + rater2) / 2) %>%
group_by(league.country, skin.color) %>%
summarise(total.reds = sum(red.cards), total.games = sum(games)) %>%
mutate(red.rate=total.reds / total.games) %>%
ggplot( aes(x=skin.color, y=red.rate, color=league.country, size=total.games)) + geom_point() + stat_smooth(method = "loess", se=FALSE)
```
(@) **In words, what would you conclude from your response to @byleague?**
```{answer}
The overall rate of red cards is different in the different leagues, but the relationship between skin color and red cards seems pretty similar.
```
## Watch how this works with a different dataset: Gapminder
Just to show you that this all works with different data, you will now do some analysis with the [Gapminder](http://www.gapminder.org/) data, as currated and cleaned by [Jenny Bryan](https://github.com/jennybc/gapminder).
```{r echo=FALSE}
require(dplyr)
load(url("http://www.princeton.edu/~mjs3/soc504_s2015/gapminder.RData"))
gapminder <- tbl_df(gapminder)
glimpse(gapminder)
head(gapminder)
tail(gapminder)
```
(@) **Is this data in third normal form?**
```{answer}
No. Continent is not part of the primary key, and it does not "refer to the key, the whole key, and nothing but the key."
The continent that each country belongs to should only be stored once. As is, it could get out of sync.
```
(@) Explain:
```{answer}
Continent is not part of the primary key, and it does not "refer to the key, the whole key, and nothing but the key."
The continent that each country belongs to should only be stored once. As is, it could get out of sync.
A better way to create a new table that shows which continent each country is in.
```
(@) **Is this an optimal structure for data storage?**
```{answer}
No. The continent that each country belongs to should only be stored once. As is, it could get out of sync.
```
(@) **Is this a sensible structure for data analysis?**
```{answer}
Yes. It makes it easy to filter by continent. The best way to do this is to store the data in third normal form and then do joins duing the analysis stage. We'll learn more about joins later in the semester.
```
(@) **For each continent, show the mean GDP in each of the years in the data.**
```{r}
gapminder %>%
group_by(continent, year) %>%
summarise(mean(gdpPercap))
```
(@) **Which country had the highest GDP per captia in Africa in 1952?** Note you don't need to produce a data.frame with a single country to answer this question. A data.frame with the appropriate countries sorted is enough.
```{r}
gapminder %>%
filter(continent=="Africa", year==1952) %>%
arrange(desc(gdpPercap)) %>%
select(country, gdpPercap)
```
(@) **Which country had the highest GDP (not GDP per captia) in any year in the data?** Note you don't need to produce a single country to answer this question. A data.frame with the appropriate countries sorted is enough.
```{r}
gapminder %>%
mutate(gdp = gdpPercap * pop) %>%
arrange(desc(gdp)) %>%
select(country, year, gdp)
```
(@) **Which continent had the most variation in life expectancy in 2007?** Note you don't need to produce a single country to answer this question. A data.frame with the appropriate countries sorted is enough.
```{r}
gapminder %>%
filter(year=="2007") %>%
group_by(continent) %>%
summarise(life.exp.var = var(lifeExp))
```
(@openq) **Optional challenge: Create a question that will require you to use all 5 `dplyr` verbs: `filter`, `arrange`, `select`, `mutate`, and `summarise`. Then, write a query to answer it.**
```{answer}
[your text here, optional]
```
(@) **Challenge problem: Now show the code to answer question @openq.**
```{r}
# your code here; optional
```
## More practice with data structures
`R` comes with the dataset `ldeaths`, which records the monthly deaths from bronchitis, emphysema and asthma in the UK, 1974–1979. To see the data type `ldeaths`. For more information type "?ldeaths"
(@) **Is this data tidy?**
```{answer}
no
```
(@) **Explain**
```{answer}
There are three variables: year, month, and count. It is not the case that each row is an observation and each column is a variable.
```
`R` comes with the dataset `mtcars`, which shows fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models), as taken from the 1974 Motor Trend US magazine. To see the data type `mtcars`. For more information type "?mtcars"
(@) **Is this data tidy?**
```{answer}
yes
```
(@) **Explain**
```{answer}
Each row is an observation, and each column is a variable. One way that it could be better would be to have the name of the car as a column, instead of as a rowname.
```
`R` comes with the dataset `quakes`, which shows 1000 seismic events near Fiji. To see the data type `quakes`. For more information type "?quakes"
(@) **Is this data tidy?**
```{answer}
yes
```
(@) **Explain**
```{answer}
Each row is an observation and each column is a variable. One way that it could be better would be to have some kind of unique identifier for each earthquake.
```
#### The command below is helpful for debugging, please don't change it
```{r echo=FALSE}
sessionInfo()
```