Skip to content

Suggestion of new function: describe_missing() #454

@rempsyc

Description

@rempsyc

When writing (psychology) scientific papers, great care must be taken in reporting the state of item-level missing data for each psychological questionnaire. For example, Parent (2013) writes:

I recommend that authors (a) state their tolerance level for missing data by scale or subscale (e.g., “We calculated means for all subscales on which participants gave at least 75% complete data”) and then (b) report the individual missingness rates by scale per data point (i.e., the number of missing values out of all data points on that scale for all participants) and the maximum by participant (e.g., “For Attachment Anxiety, a total of 4 missing data points out of 100 were observed, with no participant missing more than a single data point”).

In order to comply with this recommandation, I have developed the function nice_na(), which nicely summarizes NA values according to those guidelines. The function describes both absolute and percentage values of specified column lists and supports specifying scales through regex. Reprex:

library(rempsyc)

# If the questionnaire items start with the same name, e.g.,
set.seed(15)
fun <- function() {
  c(sample(c(NA, 1:10), replace = TRUE), NA, NA, NA)
}
df <- data.frame(
  ID = c("idz", NA),
  open_1 = fun(), open_2 = fun(), open_3 = fun(),
  extrovert_1 = fun(), extrovert_2 = fun(), extrovert_3 = fun(),
  agreeable_1 = fun(), agreeable_2 = fun(), agreeable_3 = fun()
)

head(df, 3)
#>     ID open_1 open_2 open_3 extrovert_1 extrovert_2 extrovert_3 agreeable_1
#> 1  idz      4     NA      1           5           6           1           7
#> 2 <NA>      9      4      3           1          10          NA           7
#> 3  idz      1      4      1           9           2          NA           8
#>   agreeable_2 agreeable_3
#> 1           7           9
#> 2           7           2
#> 3           7           8

# One can list the scale names directly:
nice_na(df, scales = c("ID", "open", "extrovert", "agreeable"))
#>                       var items na cells na_percent na_max na_max_percent
#> 1                   ID:ID     1  7    14      50.00      1            100
#> 2           open_1:open_3     3 11    42      26.19      3            100
#> 3 extrovert_1:extrovert_3     3 17    42      40.48      3            100
#> 4 agreeable_1:agreeable_3     3 10    42      23.81      3            100
#> 5                   Total    10 45   140      32.14     10            100
#>   all_na
#> 1      7
#> 2      3
#> 3      3
#> 4      3
#> 5      2

Created on 2023-09-02 with reprex v2.0.2


Would you like this function to migrate from rempsyc to datawizard?

For the name, I was thinking data_missing_items or just data_missing since it also works without scale items and it is similar to our other data_ functions like data_duplicated. It could also be describe_missing in line with describe_distribution (actually that one makes more sense I think).

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions