-
Notifications
You must be signed in to change notification settings - Fork 0
/
PS07_source.qmd
556 lines (422 loc) · 21.5 KB
/
PS07_source.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
---
title: "Problem Set 07"
author: "Your Name"
date: last-modified
date-format: "[Last modified on] MMMM DD, YYYY HH:mm:ss zzz"
format:
html: default
pdf: default
editor: source
---
```{r include = FALSE}
# Do not edit this code block/chunk!
set.seed(13) # 13 Gives the year 2014
library(knitr)
knitr::opts_chunk$set(echo = TRUE, fig.align = "center", comment = NA, message = FALSE, warning = FALSE, fig.width = 16/2, fig.height = 9/2)
library(ggplot2)
library(dplyr)
library(forcats)
library(moderndive)
data(gss_cat)
YEAR <- sample(unique(gss_cat$year), 1)
n <- gss_cat |> filter(year == YEAR) |> select(marital) |> nrow()
```
# Background
For this exercise, we will mimic the tactile sampling you did in class with virtual sampling. We will use some data from the [general social survey](http://gss.norc.org/), an annual personal-interview survey conducted in the United States. The survey is designed to monitor changes in both social characteristics and attitudes.
The **population** of interest will be **ALL** `r n` individuals living in a single neighborhood in `r YEAR`. As an analogy to the tactile sampling you did in class, the neighborhood is the "bowl" and the `r n` people are the little balls.
If you get stuck as you are working through this Problem Set, we **strongly recommend** you re-read [Chapter 7](https://moderndive.com/7-sampling.html#sampling-definitions) in [ModernDive](https://moderndive.com/index.html), in particular subsections 7.3.1 on ["Terminology & notation"](https://moderndive.com/7-sampling.html#terminology-and-notation) and 7.3.2 on ["Statistical definitions"](https://moderndive.com/7-sampling.html#sampling-definitions). These terminology, notation, and definitions related to sampling are definitely tricky at first; the best method to master them is practice, practice, practice.
### Key points on symbols:
| Symbol | POPULATION PARAMETER | SAMPLE STATISTIC |
|-----------------|----------------------|------------------|
| Number of cases | $N$ | $n$ |
| Proportion | $p$ | $\hat{p}$ |
| Standard error | $SE$ | $\widehat{SE}$ |
::: {.callout-caution icon="false" title="R Code"}
```{r}
sym <- c("Number of cases", "Proportion", "Standard error")
PP <- c("$N$", "$p$", "$SE$")
SS <- c("$n$", "$\\hat{p}$", "$\\widehat{SE}$")
DF <- data.frame(sym = sym, PP = PP, SS = SS)
kable(DF, col.names = c("Symbol", "Population Parameter",
"Sample Statistic"),
caption = "Key Points on Symbols")
```
:::
## Setup
First load the necessary packages:
::: {.callout-caution icon="false" title="R Code"}
```{r}
library(ggplot2)
library(dplyr)
library(forcats)
library(moderndive)
```
:::
You can load and and view the `gss_cat` data set from the `forcats` package using the code below:
::: {.callout-caution icon="false" title="R Code"}
```{r}
data(gss_cat)
glimpse(gss_cat)
```
:::
Be sure to examine the data in the **viewer**. Type `?gss_cat` in the **console** to see a description of the variables in this data set.
# Exploratory Data Wrangling
This data set includes many of years of data and many variables. To start, we will restrict our analysis to only `r YEAR`, and to only the variable indicating the `marital` status of each respondent.
::: {.callout-caution icon="false" title="R Code"}
```{r}
YEAR
gss_YEAR <- gss_cat |>
filter(year == YEAR) |>
select(marital)
```
:::
The following shows the different responses for `marital` status:
::: {.callout-caution icon="false" title="R Code"}
```{r}
gss_YEAR |>
distinct(marital)
```
:::
## Setting a Seed
**Setting a seed:** We will take some random samples in this Problem Set. In order to make sure `R` takes the same random sample every time you run your code, you can do what is called "setting a seed". Do this in any code chunk where you take a random sample.
You can set a seed with any number.
```{r}
set.seed(45)
```
# The True Population Proportion $p$ of Divorced People
Again, for this exercise, the **population** of interest will be **ALL** `r n` individuals living in this single neighborhood in `r YEAR`. Since we have data on **ALL** `r n` people living in the neighborhood, we can compute the **exact population proportion** $p$ of divorced people directly using **ALL** the data as follows:
::: {.callout-caution icon="false" title="R Code"}
```{r}
gss_YEAR |>
summarize(divorced = sum(marital == "Divorced"),
N = n(), p2 = mean(marital == "Divorced")) |>
mutate(p = divorced / N) -> ans
ans
# Or
(p <- mean(gss_YEAR$marital=="Divorced"))
```
:::
::: callout-note
- We use $N$ for the size of the full population of `r n` people, and $p$ because we are calculating the TRUE population proportion $p$.
- Inference to the population is not needed. We do not need to use a **sample** to try to infer something about the **true population proportion** $p$ of divorced people in this neighborhood in `r YEAR`. We know that $p = `r ans$p`$.
:::
In other words, this problem set is not a realistic reflection of a real life problem. However, for this problem set, we are *simulating* the act of sampling from this neighborhood population to understand and study how factors like sample size influence **sampling variation**.
# Demo: Sampling 50 People in the Neighborhood
## Estimating $\hat{p}$ from a Single Sample
We are first going to use random sampling to **ESTIMATE** the true **population** proportion $p$ of the neighborhood that are divorced with only a **sample** of 50 people.
::: callout-note
This will represent a situation of only having the resources to knock on 50 doors to get responses from people in this neighborhood!
:::
Be sure to look at the results in the viewer. Remember, you can set the seed to whatever value you like. For this exercise, use a four digit seed that is your birth month and day. For example, if your birthday is on Halloween then you will use the seed 1031.
::: {.callout-caution icon="false" title="R Code"}
```{r}
set.seed(0804)
n50_1rep <- gss_YEAR |>
rep_sample_n(size = 50, reps = 1)
head(n50_1rep)
# OR
set.seed(0804)
n50_sample <- sample(gss_YEAR$marital, size = 50, replace = FALSE)
head(n50_sample)
```
:::
Next, let's calculate the **sample proportion** $\hat{p}$ of people who identified as `Divorced` in our sample of 50 people.
::: {.callout-caution icon="false" title="R Code"}
```{r}
ans1 <- n50_1rep |>
summarize(divorce_count = sum(marital == "Divorced"),
phat = mean(marital == "Divorced"),
n = n()) |>
mutate(p_hat = divorce_count/ n)
ans1
# OR
(phat <- mean(n50_sample=="Divorced"))
```
:::
This sample proportion $\hat{p}$ is an **ESTIMATE**, and our **best guess** of what the **true population** proportion $p$ of `Divorced` people is in this neighborhood, based on a sample of only $50$ people. The estimate ($\hat{p} = `r ans1$p_hat`$) is reasonably close to the true population proportion $p =`r ans$p`$ we calculated from the full population.
::: {.callout-note icon="false" title="Problem 1"}
Ask two of your classmates what their estimate of $\hat{p}$ was. How do the $\hat{p}$ estimates from different samples compare?
:::
::: {.callout-important icon="false" collapse="false" title="Problem 1 Answers"}
- Delete this and put your text answer here.
:::
::: {.callout-note icon="false" title="Problem 2"}
**Why** did everyone get a different estimate?
:::
::: {.callout-important icon="false" collapse="false" title="Problem 2 Answers"}
- Delete this and put your text answer here.
:::
## Estimating $\widehat{SE}$ from a Single Sample
Typically we only have the opportunity to collect **one sample** for our study. Consequently, we have to use the amount of variability in our **single sample** as an estimate of the amount of variability we might expect in our results if we had taken a random sample of 50 different people. The $\widehat{SE}_{\hat{p}}$ serves as an **ESTIMATE** of **sampling variability** if you only have a **single sample**. The formula for estimating the standard error of $\hat{p}$ is given in (@eq-se).
$$
\widehat{SE}_{\hat{p}} \approx \sqrt{\frac{\hat{p} \times (1-\hat{p})}{n}}
$$ {#eq-se}
::: callout-note
Note that we use $n$ for the size of the sample, that p "wears a hat", like so: $\hat{p}$ because we are ESTIMATING a proportion based on only a sample, and that the SE "wears a hat" as well because we are ESTIMATING $SE$ based on only a sample.
:::
The standard error of $\hat{p}$ can be estimated in `R` as follows:
::: {.callout-caution icon="false" title="R Code"}
```{r}
n50_1rep |>
summarize(divorce_count = sum(marital == "Divorced"),
n = n()) |>
mutate(p_hat = divorce_count/ n,
se_hat = sqrt(p_hat * (1 - p_hat) / n)) -> n50_1rep_ans
n50_1rep_ans
```
:::
# Demo: Generating a Sampling Distribution of $\hat{p}$
If you ran the code chunk that takes a random sample of 50 cases a thousand more times....and wrote down every $\hat{p}$ you got, you would have what is called a simulated "sampling distribution".
::: callout-note
A sampling distribution shows every \[or nearly every!\] possible result a sampling statistic can have under every \[or nearly every!\] possible sample **of a given sample size** from a population.
:::
## Simulated Sampling Distribution of $\hat{p}$ for $n = 50$
Instead of running the sampling code chunk for $n = 50$ over and over, we can "collect" 1000 samples of $n = 50$ easily with `R`. The following code chunk takes 1000 **different** samples of $n = 50$ and stores them in the data frame `n50_1000rep`:
::: {.callout-caution icon="false" title="R Code"}
```{r}
set.seed(19)
n50_1000rep <- gss_YEAR |>
rep_sample_n(size = 50, reps = 1000)
head(n50_1000rep)
# Consider a different approach
set.seed(19)
reps <- 1000
n <- 50
phat <- numeric(reps)
for(i in 1:reps){
ss <- sample(gss_YEAR$marital, size = 50, replace = FALSE)
phat[i] <- mean(ss == "Divorced")
}
head(phat)
```
:::
Be sure to look at `n50_1000rep` in the data viewer to get a sense of these 1000 samples look like.
::: {.callout-note icon="false" title="Problem 3"}
What is the name of the column that identifies which of the 1000 samples each row is from?
:::
::: {.callout-important icon="false" collapse="false" title="Problem 3 Answers"}
- Delete this and put your text answer here.
:::
::: {.callout-note icon="false" title="Problem 4"}
What is the sample size $n$ for each of the $1000$ samples we took? (i.e. how many humans are sampled in each replicate)?
:::
::: {.callout-important icon="false" collapse="false" title="Problem 4 Answers"}
- Delete this and put your text answer here.
:::
The following code chunk calculates the sample proportion $\hat{p}$ of people who reported they were divorced for each of the **1000 samples**.
::: {.callout-caution icon="false" title="R Code"}
```{r}
p_hat_n50_1000rep <- n50_1000rep |>
group_by(replicate) |>
summarize(divorce_count = sum(marital == "Divorced"),
n = n()) |>
mutate(p_hat = divorce_count / n)
```
:::
Examine the first five rows of the results:
::: {.callout-caution icon="false" title="R Code"}
```{r}
p_hat_n50_1000rep |>
slice(1:5)
# Or
p_hat_n50_1000rep |>
head(n = 5)
```
:::
## Visualize the Sampling Distribution of $\hat{p}$ for $n = 50$
We can plot the **sampling distribution** of these 1000 $\hat{p}$ estimates of divorced respondents with a histogram using the code below.
::: {.callout-caution icon="false" title="R Code"}
```{r}
#| label: "fig-n50"
#| fig-cap: "Simulated sampling distribution of p_hat based on 1000 samples of size 50 created with `ggplot2`"
# histogram of phat using ggplot2
ggplot(p_hat_n50_1000rep, aes(x = p_hat)) +
geom_histogram(binwidth = 0.02, color = "black", fill = "aquamarine3") +
labs(x = "Sample proportion of divorced respondents",
title = "Sampling distribution of p_hat based on n = 50") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
xlim(-0.05, 0.6) -> hist1
hist1
```
```{r}
#| label: "fig-n50base"
#| fig-cap: "Simulated sampling distribution of p_hat based on 1000 samples of size 50 created with base `R`"
# hist of phat using base R
hist(phat, main = "Sampling distribution of phat based on n = 50",
col = "aquamarine3", breaks = 15)
```
:::
::: {.callout-note icon="false" title="Problem 5"}
Based on @fig-n50 and @fig-n50base, what appears to be a very common value of $\hat{p}$? What is a very uncommon value? Specifically, find the 1%, 99%, the mean, and the standard deviation of the values stored in `p_hat` to help answer the question.
:::
::: {.callout-important icon="false" collapse="false" title="Problem 5 Answers"}
```{r}
# Type your code and comments below
```
- Delete this and put your text answer here.
:::
## Mean and Standard Error of the Sampling Distribution of $\hat{p}$ for $n = 50$
We can estimate the mean of the sampling distribution by calculating the mean of all 1000 $\hat{p}$ estimates, and the standard error of the sampling distribution by calculating the standard deviation of all 1000 $\hat{p}$ values as follows:
::: {.callout-caution icon="false" title="R Code"}
```{r}
p_hat_n50_1000rep |>
summarize(M_p_hat = mean(p_hat),
SE_p_hat = sd(p_hat))
```
:::
::: {.callout-note icon="false" title="Problem 6"}
How do the values computed in the above code chunk compare to the estimates we got for $\hat{p}$ and $\widehat{SE}_{\hat{p}}$ for `Divorced` respondents based on your **single** sample of 50 people earlier in this Problem Set?
:::
::: {.callout-important icon="false" collapse="false" title="Problem 6 Answers"}
- Delete this and put your text answer here.
:::
::: {.callout-note icon="false" title="Problem 7"}
Use the `rep_sample_n` function to collect 1000 virtual samples of size $n = 15$. Store the 1000 virtual samples in an object named `n15_1000rep`. Use a seed of 910.
:::
::: {.callout-important icon="false" collapse="false" title="Problem 7 Answers"}
```{r}
# Type your code and comments inside the code chunk
```
:::
::: {.callout-note icon="false" title="Problem 8"}
Calculate sample proportion $\hat{p}$ of people who reported they were `Divorced` for each replicate of your $n = 15$ sampling. Store the results in `ques8` and display the first six rows of `ques8`.
:::
::: {.callout-important icon="false" collapse="false" title="Problem 8 Answers"}
```{r}
# Type your code and comments inside the code chunk
```
:::
::: {.callout-note icon="false" title="Problem 9"}
Visualize the sampling distribution of $\hat{p}$ from your $n = 15$ sampling with a purple histogram.
:::
::: {.callout-important icon="false" collapse="false" title="Problem 9 Answers"}
```{r}
#| label: "fig-n15"
#| fig-cap: "Simulated sampling distribution of p_hat based on samples of size 15"
# Type your code and comments inside the code chunk
```
:::
::: {.callout-note icon="false" title="Problem 10"}
Calculate the mean of the $n = 15$ sampling distribution, and the standard error of the $n = 15$ sampling distribution.
:::
::: {.callout-important icon="false" collapse="false" title="Problem 10 Answers"}
```{r}
# Type your code and comments inside the code chunk
```
:::
::: {.callout-note icon="false" title="Problem 11"}
How does the standard error of the $n = 15$ sampling distribution compare to the standard error of the $n = 50$ sampling distribution?
:::
::: {.callout-important icon="false" collapse="false" title="Problem 11 Answers"}
- Delete this and put your text answer here.
:::
::: {.callout-note icon="false" title="Problem 12"}
Explain any observed differences from 11.
:::
::: {.callout-important icon="false" collapse="false" title="Problem 12 Answers"}
- Delete this and put your text answer here.
:::
::: {.callout-note icon="false" title="Problem 13"}
Use the `rep_sample_n` function to collect 1000 virtual samples of size $n = 600$. Store the 1000 virtual samples in an object named `n600_1000rep`. Use a seed of 84.
:::
::: {.callout-important icon="false" collapse="false" title="Problem 13 Answers"}
```{r}
# Type your code and comments inside the code chunk
```
:::
::: {.callout-note icon="false" title="Problem 14"}
Calculate the proportion $\hat{p}$ of people who reported they were `Divorced`for each replicate of your $n = 600$ sampling. Store the results in `ques14` and display the first six rows of `ques14`.
:::
::: {.callout-important icon="false" collapse="false" title="Problem 14 Answers"}
```{r}
# Type your code and comments inside the code chunk
```
:::
::: {.callout-note icon="false" title="Problem 15"}
Calculate the mean of the $n = 600$ sampling distribution, and the standard error of the $n = 600$ sampling distribution.
:::
::: {.callout-important icon="false" collapse="false" title="Problem 15 Answers"}
```{r}
# Type your code and answers below
```
:::
::: {.callout-note icon="false" title="Problem 16"}
Was there more **variability** from sample to sample when we took a sample size of 600 or a sample size of 50? **Explain what evidence you have for assessing this**.
:::
::: {.callout-important icon="false" collapse="false" title="Problem 16 Answers"}
- Delete this and put your text answer here.
:::
::: {.callout-note icon="false" title="Problem 17"}
Which sampling distribution looked more normally distributed (bell shaped and symmetrical); the one built on $n$ = 15, 50 or 600? **Why?**
:::
::: {.callout-important icon="false" collapse="false" title="Problem 17 Answers"}
- Delete this and put your text answer here.
```{r, fig.height = 8}
# Type your code and comments below
```
:::
## Estimating $\hat{p}$ and the Standard Error of $\hat{p}$ from a Single Sample (revisited)
In most instances, we do not have access to the full population as we did in this GSS data; instead we have to take a **sample** to try to say something about the **larger population**. Furthermore, in the real world, we typically only take a **single** sample from the population, due to time or money constraints.
So how do we **ESTIMATE** a $\hat{p}$ and a standard error of $\hat{p}$ when we only have a single sample, and not 1000 repeated samples? As demonstrated at the very beginning of the Problem Set we:
- estimate $\hat{p}$ from the sample
- use the formula for the standard error of $\hat{p}$ given in (@eq-se) and repeated below to estimate SE based on a single sample
$$\widehat{SE}_{\hat{p}} \approx \sqrt{\frac{\hat{p} \times (1-\hat{p})}{n}}$$
::: {.callout-note icon="false" title="Problem 18"}
Imagine we collected only a single small sample of 15 respondents as given from the code below.
```{r, label= "Test1"}
set.seed(53)
n15_1rep <- gss_YEAR |>
rep_sample_n(size = 15, reps = 1)
# and
n50_1rep <- gss_YEAR |>
rep_sample_n(size = 50, reps = 1)
```
Following the example from the beginning of the Problem Set (roughly line 138), estimate the **sample proportion** $\hat{p}$ of people who identified as `Divorced` based on `n15_1rep`... AS WELL AS the **standard error of** $\hat{p}$.
:::
::: {.callout-important icon="false" collapse="false" title="Problem 18 Answers"}
```{r, label = "Test2"}
# Type your code and comments inside the code chunk
```
:::
::: callout-note
You should have gotten a value reasonably close to the estimate we made earlier from our sampling distribution for $n = 15$! Note that when you must estimate a standard error from **only a single sample**, the formula **contains the sample size, n**. The larger the sample size n, the larger the number in the denominator of the SE formula.
:::
Fill in the Markdown table below with all the standard errors you computed for this problem set. In other words:
::: {.callout-note icon="false" title="Problem 19"}
- Replace `x` with the standard error you obtained by taking the standard deviation of the $n = 15$ sampling distribution. Replace `a` with the standard error you obtained for a single sample of $n = 15$ using the mathematical formula.
- When you are done, make sure all the `|` in the table still line up so your results print out in a table!
:::
::: {.callout-important icon="false" collapse="false" title="Problem 19 Answers"}
| Sample size n | SE via sd of sampling distribution | SE via one sample and formula |
|------------------|-----------------------------|-------------------------|
| 15 | x | a |
| 50 | y | b |
```{r, label = "test3"}
# Your code below
```
:::
::: {.callout-note icon="false" title="Problem 20"}
Based on what you observed for 19, **IF** you collected a single sample from 600 respondents, do you think the standard error will be smaller or larger than the one you calculated for $n = 15$. **Explain your reasoning**.
:::
::: {.callout-important icon="false" collapse="false" title="Problem 20 Answers"}
- Delete this and put your text answer here.
:::
# Turning in Your Work
You will need to make sure you commit and push all of your changes to the github education repository where you obtained the lab.
::: callout-tip
- Make sure you **render a final copy with all your changes** and work.
- Look at your final html file to make sure it contains the work you expect and is formatted properly.
:::
# Logging out of the Server
There are many statistics classes and students using the Server. To keep the server running as fast as possible, it is best to sign out when you are done. To do so, follow all the same steps for closing Quarto document:
::: callout-tip
- Save all your work.
- Click on the orange button in the far right corner of the screen to quit `R`
- Choose **don't save** for the **Workspace image**
- When the browser refreshes, you can click on the sign out next to your name in the top right.
- You are signed out.
:::
```{r}
sessionInfo()
```