-
Notifications
You must be signed in to change notification settings - Fork 82
Description
Hello package maintainers!
I am building confidence intervals for groups with bootstrapped values and I'm having trouble creating multiple re-sampled datasets from which to build my confidence intervals.
Using the palmerpenguins library as an example:
library(tidyverse)
library(infer)
library(palmerpenguins)
There are 344 total observations and each species has a different number of observations:
nrow(penguins)
# [1] 344
penguins %>% group_by(species) %>% count()
# A tibble: 3 × 2
# Groups: species [3]
# species n
<fct> <int>
#1 Adelie 152
#2 Chinstrap 68
#3 Gentoo 124
I want to be able to group by the species, and for each species pull multiple samples while using the original number of observations per each group.
set.seed(100)
slices <- penguins2 %>%
group_by(species) %>%
rep_slice_sample(prop = 1, replace = TRUE, reps = 10)
That should give me 344 * 10 = 3440 lines in the full new data set. This is true, but when you look at the data you can see that each replicate has a different number of observations. For all of the Adelie, n per sample should be 152, chinstrap should be 68, and Gentoo should be 124. Instead we find this:
slices %>% group_by(species, replicate) %>% count()
# A tibble: 30 × 3
# Groups: species, replicate [30]
# species replicate n
# <fct> <int> <int>
#1 Adelie 1 148
#2 Adelie 2 147
# 3 Adelie 3 148
# 4 Adelie 4 151
# 5 Adelie 5 138
# 6 Adelie 6 157
# 7 Adelie 7 161
# 8 Adelie 8 157
# 9 Adelie 9 151
#10 Adelie 10 138
# ℹ 20 more rows
# ℹ Use `print(n = ...)` to see more rows
What am I missing?
thanks for your insight.