-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
371 lines (294 loc) · 14 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
options(width = 115)
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# svrep
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/svrep)](https://CRAN.R-project.org/package=svrep)
[![Mentioned in Awesome Official Statistics ](https://awesome.re/mentioned-badge.svg)](http://www.awesomeofficialstatistics.org)
[![Codecov test coverage](https://codecov.io/gh/bschneidr/svrep/branch/main/graph/badge.svg)](https://app.codecov.io/gh/bschneidr/svrep?branch=main)
<!-- [![R-CMD-check](https://github.com/bschneidr/svrep/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/bschneidr/svrep/actions/workflows/R-CMD-check.yaml) -->
<!-- badges: end -->
svrep provides methods for creating, updating, and analyzing replicate weights for surveys. Functions from svrep can be used to implement adjustments to replicate designs (e.g. nonresponse weighting class adjustments) and analyze their effect on the replicate weights and on estimates of interest. Facilitates the creation of bootstrap and generalized bootstrap replicate weights.
## Installation
You can install the released version of svrep from [CRAN](https://CRAN.R-project.org) with:
``` r
install.packages("svrep")
```
You can install the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("bschneidr/svrep")
```
## Citation
When using the 'svrep' package, please make sure to cite it in any resulting publications. This is appreciated by the package maintainer and helps to incentivize ongoing development, maintenance, and support.
> Schneider B. (2023). "svrep: Tools for Creating, Updating, and Analyzing Survey Replicate Weights". R package version 0.6.0.
When using the 'svrep' package, please also cite the 'survey' package and R itself, since they are essential to the use of 'svrep'. Call `citation('svrep')`, `citation('survey')`, and `citation('base')` for more information and to generate BibTex entries for citing these packages as well as R.
## Example usage
### Creating replicate weights
Suppose we have data from a survey selected using a complex sampling method such as cluster sampling. To represent the complex survey design, we can create a survey design object using the survey package.
```{r, message=FALSE, warning=FALSE}
library(survey)
library(svrep)
data(api, package = "survey")
set.seed(2021)
# Create a survey design object for a sample
# selected using a single-stage cluster sample without replacement
dclus1 <- svydesign(
data = apiclus1,
ids = ~dnum,
weights = ~pw,
fpc = ~fpc
)
```
To help us estimate sampling variances, we can create bootstrap replicate weights. The function `as_bootstrap_design()` creates bootstrap replicate weights appropriate to common complex sampling designs, using bootstrapping methods from the 'survey' package as well as additional methods such as the Rao-Wu-Yue-Beaumont method (a generalization of the Rao-Wu bootstrap).
```{r}
# Create replicate-weights survey design
orig_rep_design <- dclus1 |> as_bootstrap_design(
replicates = 500,
type = "Rao-Wu-Yue-Beaumont"
)
print(orig_rep_design)
```
For especially complex survey designs (e.g., systematic samples), the generalized survey bootstrap can be used.
```{r}
# Load example data for a stratified systematic sample
data('library_stsys_sample', package = 'svrep')
# First, ensure data are sorted in same order as was used in sampling
library_stsys_sample <- library_stsys_sample[
order(library_stsys_sample$SAMPLING_SORT_ORDER),
]
# Create a survey design object
design_obj <- svydesign(
data = library_stsys_sample,
strata = ~ SAMPLING_STRATUM,
ids = ~ 1,
fpc = ~ STRATUM_POP_SIZE
)
# Convert to generalized bootstrap replicate design
gen_boot_design_sd2 <- as_gen_boot_design(
design = design_obj,
variance_estimator = "SD2",
replicates = 500
)
```
For relatively simple designs, we can also use the random-groups jackknife.
```{r}
# Create random-group jackknife replicates
# for a single-stage survey with many first-stage sampling units
rand_grp_jk_design <- apisrs |>
svydesign(data = _,
ids = ~ 1,
weights = ~ pw) |>
as_random_group_jackknife_design(
replicates = 20
)
```
### Adjusting for non-response or unknown eligibility
In social surveys, unit nonresponse is extremely common. It is also somewhat common for respondent cases to be classified as "ineligible" for the survey based on their response. In general, sampled cases are typically classified as "respondents", "nonrespondents", "ineligible cases", and "unknown eligibility" cases.
```{r}
# Create variable giving response status
orig_rep_design$variables[['response_status']] <- sample(
x = c("Respondent", "Nonrespondent",
"Ineligible", "Unknown eligibility"),
prob = c(0.6, 0.2, 0.1, 0.1),
size = nrow(orig_rep_design),
replace = TRUE
)
table(orig_rep_design$variables$response_status)
```
It is common practice to adjust weights when there is non-response or there are sampled cases whose eligibility for the survey is unknown. The most common form of adjustment is "weight redistribution": for example, weights from non-respondents are reduced to zero, and weights from respondents are correspondingly increased so that the total weight in the sample is unchanged. In order to account for these adjustments when estimating variances for survey statistics, the adjustments are repeated separately for each set of replicate weights. This process can be easily implemented using the `redistribute_weights()` function.
```{r}
# Adjust weights for unknown eligibility
ue_adjusted_design <- redistribute_weights(
design = orig_rep_design,
reduce_if = response_status %in% c("Unknown eligibility"),
increase_if = !response_status %in% c("Unknown eligibility")
)
```
By supplying column names to the `by` argument of `redistribute_weights()`, adjustments are conducted separately in different groups. This can be used to conduct nonresponse weighting class adjustments.
```{r}
nr_adjusted_design <- redistribute_weights(
design = ue_adjusted_design,
reduce_if = response_status == "Nonrespondent",
increase_if = response_status == "Respondent",
by = c("stype")
)
```
### Comparing estimates from different sets of weights
In order to assess whether weighting adjustments have an impact on the estimates we care about, we want to compare the estimates from the different sets of weights. The function `svyby_repwts()` makes it easy to compare estimates from different sets of weights.
```{r}
# Estimate overall means (and their standard errors) from each design
overall_estimates <- svyby_repwts(
rep_designs = list('original' = orig_rep_design,
'nonresponse-adjusted' = nr_adjusted_design),
formula = ~ api00,
FUN = svymean
)
print(overall_estimates, row.names = FALSE)
# Estimate domain means (and their standard errors) from each design
domain_estimates <- svyby_repwts(
rep_designs = list('original' = orig_rep_design,
'nonresponse-adjusted' = nr_adjusted_design),
formula = ~ api00,
by = ~ stype,
FUN = svymean
)
print(domain_estimates, row.names = FALSE)
```
We can even test for differences in estimates from the two sets of weights and calculate confidence intervals for their difference.
```{r}
estimates <- svyby_repwts(
rep_designs = list('original' = orig_rep_design,
'nonresponse-adjusted' = nr_adjusted_design),
formula = ~ api00,
FUN = svymean
)
vcov(estimates)
diff_between_ests <- svycontrast(
stat = estimates,
contrasts = list(
"Original vs. Adjusted" = c(-1,1)
)
)
print(diff_between_ests)
confint(diff_between_ests)
```
### Diagnosing potential issues with weights
When adjusting replicate weights, there are several diagnostics which can be used to ensure that the adjustments were carried out correctly and that they do more good than harm. The function `summarize_rep_weights()` helps by allowing you to quickly summarize the replicate weights.
For example, when carrying out nonresponse adjustments, we might want to verify that all of the weights for nonrespondents have been set to zero in each replicate. We can use the `summarize_rep_weights()` to compare summary statistics for each replicate, and we can use its `by` argument to group the summaries by one or more variables.
```{r}
summarize_rep_weights(
rep_design = nr_adjusted_design,
type = 'specific',
by = "response_status"
) |>
subset(Rep_Column %in% 1:2)
```
At the end of the adjustment process, we can inspect the number of rows and columns and examine the variability of the weights across all of the replicates.
```{r}
nr_adjusted_design |>
subset(response_status == "Respondent") |>
summarize_rep_weights(
type = 'overall'
)
```
### Sample-based calibration
When we rake or poststratify to estimated control totals rather than to "true" population values, we may need to account for the variance of the estimated control totals to ensure that calibrated estimates appropriately reflect sampling error of both the primary survey of interest and the survey from which the control totals were estimated. The 'svrep' package provides two functions which accomplish this. The function `calibrate_to_estimate()` requires the user to supply a vector of control totals and its variance-covariance matrix, while the function `calibrate_to_sample()` requires the user to supply a dataset with replicate weights to use for estimating control totals and their sampling variance.
As an example, suppose we have a survey measuring vaccination status of adults in Louisville, Kentucky. For variance estimation, we use 100 bootstrap replicates.
```{r}
data("lou_vax_survey")
# Load example data
lou_vax_survey <- svydesign(
data = lou_vax_survey,
ids = ~ 1,
weights = ~ SAMPLING_WEIGHT
) |>
as_bootstrap_design(
replicates = 100,
mse = TRUE
)
# Adjust for nonresponse
lou_vax_survey <- lou_vax_survey |>
redistribute_weights(
reduce_if = RESPONSE_STATUS == "Nonrespondent",
increase_if = RESPONSE_STATUS == "Respondent"
) |>
subset(RESPONSE_STATUS == "Respondent")
```
To reduce nonresponse bias or coverage error for the survey, we can rake the survey to population totals for demographic groups estimated by the Census Bureau in the American Community Survey (ACS). To estimate the population totals for raking purposes, we can use microdata with replicate weights.
```{r}
# Load microdata to use for estimating control totals
data("lou_pums_microdata")
acs_benchmark_survey <- survey::svrepdesign(
data = lou_pums_microdata,
variables = ~ UNIQUE_ID + AGE + SEX + RACE_ETHNICITY + EDUC_ATTAINMENT,
weights = ~ PWGTP,
repweights = "PWGTP\\d{1,2}",
type = "successive-difference",
mse = TRUE
)
```
We can see that the distribution of race/ethnicity among respondents differs from the distribution of race/ethnicity in the ACS benchmarks.
```{r}
# Compare demographic estimates from the two data sources
estimate_comparisons <- data.frame(
'Vax_Survey' = lou_vax_survey |>
svymean(x = ~ RACE_ETHNICITY) |>
coef(),
'ACS_Benchmark' = acs_benchmark_survey |>
svymean(x = ~ RACE_ETHNICITY) |>
coef()
)
rownames(estimate_comparisons) <- gsub(x = rownames(estimate_comparisons),
"RACE_ETHNICITY", "")
print(estimate_comparisons)
```
There are two options for calibrating the sample to the control totals from the benchmark survey. With the first approach, we supply point estimates and their variance-covariance matrix to the function `calibrate_to_estimate()`.
```{r, warning=FALSE, message=FALSE}
# Estimate control totals and their variance-covariance matrix
control_totals <- svymean(
x = ~ RACE_ETHNICITY + EDUC_ATTAINMENT,
design = acs_benchmark_survey
)
point_estimates <- coef(control_totals)
vcov_estimates <- vcov(control_totals)
# Calibrate the vaccination survey to the estimated control totals
vax_survey_raked_to_estimates <- calibrate_to_estimate(
rep_design = lou_vax_survey,
estimate = point_estimates,
vcov_estimate = vcov_estimates,
cal_formula = ~ RACE_ETHNICITY + EDUC_ATTAINMENT,
calfun = survey::cal.raking
)
```
With the second approach, we supply the control survey's replicate design to `calibrate_to_sample()`.
```{r, warning=FALSE, message=FALSE}
vax_survey_raked_to_acs_sample <- calibrate_to_sample(
primary_rep_design = lou_vax_survey,
control_rep_design = acs_benchmark_survey,
cal_formula = ~ RACE_ETHNICITY + EDUC_ATTAINMENT,
calfun = survey::cal.raking
)
```
After calibration, we can see that the estimated vaccination rate has decreased, and the estimated standard error of the estimated vaccination rate has increased.
```{r, warning=FALSE, message=FALSE}
# Compare the two sets of estimates
svyby_repwts(
rep_design = list(
'NR-adjusted' = lou_vax_survey,
'Raked to estimate' = vax_survey_raked_to_estimates,
'Raked to sample' = vax_survey_raked_to_acs_sample
),
formula = ~ VAX_STATUS,
FUN = svymean,
keep.names = FALSE
)
```
### Saving results to a data file
Once we're satisfied with the weights, we can create a data frame with the analysis variables and columns of final full-sample weights and replicate weights. This format is easy to export to data files that can be loaded into R or other software later.
```{r}
data_frame_with_final_weights <- vax_survey_raked_to_estimates |>
as_data_frame_with_weights(
full_wgt_name = "RAKED_WGT",
rep_wgt_prefix = "RAKED_REP_WGT_"
)
# Preview first 10 column names
colnames(data_frame_with_final_weights) |> head(10)
```
```{r, eval=FALSE}
# Write the data to a CSV file
write.csv(
x = data_frame_with_final_weights,
file = "survey-data_with-updated-weights.csv"
)
```