Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion of new function: describe_missing() #561

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from
Draft

Conversation

rempsyc
Copy link
Member

@rempsyc rempsyc commented Nov 11, 2024

Fixes #454

@rempsyc rempsyc marked this pull request as draft November 11, 2024 11:31
@rempsyc rempsyc marked this pull request as ready for review November 11, 2024 21:19
Copy link
Member

@etiennebacher etiennebacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I think it would be good to have describe_missing() but the way it is implemented and documented looks very field-specific to me. I find the output of skimr::skim() easier to understand with n_missing and complete_rate for instance. I'm also not familiar at all with aggregating stats on missing values across several variables (e.g. Ozone:Wind) and the default output looks unexpected to me (I'd rather expect one row per variable).

Comment on lines 3 to 15
#' @description Provides a detailed description of missing values in a data frame.
#' This function reports both absolute and percentage missing values of specified
#' column lists or scales, following recommended guidelines. Some authors recommend
#' reporting item-level missingness per scale, as well as a participant's maximum
#' number of missing items by scale. For example, Parent (2013) writes:
#'
#' *I recommend that authors (a) state their tolerance level for missing data by scale
#' or subscale (e.g., "We calculated means for all subscales on which participants gave
#' at least 75% complete data") and then (b) report the individual missingness rates
#' by scale per data point (i.e., the number of missing values out of all data points
#' on that scale for all participants) and the maximum by participant (e.g., "For Attachment
#' Anxiety, a total of 4 missing data points out of 100 were observed, with no participant
#' missing more than a single data point").*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds a bit too much focused on survey data while this function can be interesting for all kinds of data. I'd rather keep the first or two first sentences here and move the rest in a specific section in 'Details' (but even there, this seems very field-specific).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved everything after "Some authors recommend" to @details.

Also, I think the way I see it, is that a lot of packages and functions can report basic missing data features, like skimr::skim() (that's the "easy" part). What is missing is a way to handle, as you highlight, survey data in that field-specific way. I thought it still fits with datawizard even if offers additional field-specific features, although we can probably try to make it more general for other users. In the details section, I added a paragraph adding more context about scales as used in psychology:

#' In psychology, it is common to ask participants to answer questionnaires in
#' which people answer several questions about a specific topic. For example,
#' people could answer 10 different questions about how extroverted they are.
#' In turn, researchers calculate the average for those 10 questions (called
#' items). These questionnaires are called (e.g., Likert) "scales" (such as the
#' Rosenberg Self-Esteem Scale, also known as the RSES).

Copy link
Member Author

@rempsyc rempsyc Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose one question we have to answer is: do we want to have describe_missing only report basic missing info that is field-general a bit more like skim(), OR we do we also want it to include the features specific to the survey format? (or said another way, should we remove or keep the survey feature)

#' missing more than a single data point").*
#'
#' @param data The data frame to be analyzed.
#' @param vars Variable (or lists of variables) to check for missing values (NAs).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use select, exclude, etc. in all other dataframe functions, I think we should here as well.

Copy link
Member Author

@rempsyc rempsyc Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it works a little bit differently than select elsewhere. vars takes a list of list of strings (such as list(c("openness_1", "openness_2", "openness_3"), c("extroversion_1", "extroversion_2", "extroversion_3"))) to take into account the nested structure of the items / columns. I can rename it to select, but do you think it will create confusion or expectations that it should rely on and work with .select_nse? Or should we include select and exclude in addition to vars? I'm not sure how .select_nse could accommodate the nested structure like I'm doing right now 🤔

R/describe_missing.R Outdated Show resolved Hide resolved
R/describe_missing.R Outdated Show resolved Hide resolved
#' @keywords missing values NA guidelines
#' @return A dataframe with the following columns:
#' - `var`: Variables selected.
#' - `items`: Number of items for selected variables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think unique_values instead of items would be clearer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, so in this case "number of items" refers to the number of columns selected for each "scale" or combination of variables. Maybe I should use that instead, as I'm afraid unique_values would suggest unique responses for a given column.

Copy link
Member Author

@rempsyc rempsyc Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed specific as in psychology we tend to think of variables as made of several "items". So items 1-10 create a variable such as a personality trait "extroversion". I'm not sure how to call it because "variable" might be confused with "scale" (i.e., a composite score). Maybe I could just rename that output column "columns", but I'm open to your suggestions if you have more. A more accurate name (for psychology) would be n_items, so perhaps we can do n_columns??

R/describe_missing.R Outdated Show resolved Hide resolved
R/describe_missing.R Outdated Show resolved Hide resolved
@rempsyc rempsyc marked this pull request as draft December 2, 2024 15:55
@rempsyc
Copy link
Member Author

rempsyc commented Dec 17, 2024

Thanks for the feedback and comments! We can definitely rename the column names for more clarity e.g., to use missing_ instead of na_ and other suggestions (I initially chose na to make shorter column names so the whole output could fit on my rather narrow console). I can also add a new column complete_rate to mirror skim(). Otherwise, skim() and describe_missing() have the same relative structure (variables in the first column and aggregate stats on the other columns).

the default output looks unexpected to me (I'd rather expect one row per variable).

There is one row per variable / scale, but each variable / scale can be defined by multiple items / columns, and so the output has to be able to accommodate that (the current strategy is to use the : indicator to show which variables each row includes).

But if I understand correctly, you would like that the default, instead of reporting for all columns as an aggregate (i.e., always exactly 1 row), would report one row per column, for all columns. Although for large datasets this would create a long output, that could work.

@rempsyc
Copy link
Member Author

rempsyc commented Dec 17, 2024

Ok so I changed the default so that when no scale or variable are specified, all columns are reported on separate rows:

However, this behaviour is overwritten if scales or variables are specified:

library(datawizard)

# Use the entire data frame
set.seed(15)
fun <- function() {
  c(sample(c(NA, 1:10), replace = TRUE), NA, NA, NA)
}
df <- data.frame(
  ID = c("idz", NA),
  openness_1 = fun(), openness_2 = fun(), openness_3 = fun(),
  extroversion_1 = fun(), extroversion_2 = fun(), extroversion_3 = fun(),
  agreeableness_1 = fun(), agreeableness_2 = fun(), agreeableness_3 = fun()
)
describe_missing(df)
#>           variable n_columns n_missing cells missing_percent complete_percent
#> 1               ID         1         7    14           50.00            50.00
#> 2       openness_1         1         4    14           28.57            71.43
#> 3       openness_2         1         4    14           28.57            71.43
#> 4       openness_3         1         3    14           21.43            78.57
#> 5   extroversion_1         1         6    14           42.86            57.14
#> 6   extroversion_2         1         6    14           42.86            57.14
#> 7   extroversion_3         1         5    14           35.71            64.29
#> 8  agreeableness_1         1         3    14           21.43            78.57
#> 9  agreeableness_2         1         4    14           28.57            71.43
#> 10 agreeableness_3         1         3    14           21.43            78.57
#> 11           Total        10        45   140           32.14            67.86
#>    missing_max missing_max_percent all_missing
#> 1            1                 100           7
#> 2            1                 100           4
#> 3            1                 100           4
#> 4            1                 100           3
#> 5            1                 100           6
#> 6            1                 100           6
#> 7            1                 100           5
#> 8            1                 100           3
#> 9            1                 100           4
#> 10           1                 100           3
#> 11          10                 100           2

# If the questionnaire items start with the same name,
# one can list the scale names directly:
describe_missing(df, scales = c("ID", "openness", "extroversion", "agreeableness"))
#>                          variable n_columns n_missing cells missing_percent
#> 1                              ID         1         7    14           50.00
#> 2           openness_1:openness_3         3        11    42           26.19
#> 3   extroversion_1:extroversion_3         3        17    42           40.48
#> 4 agreeableness_1:agreeableness_3         3        10    42           23.81
#> 5                           Total        10        45   140           32.14
#>   complete_percent missing_max missing_max_percent all_missing
#> 1            50.00           1                 100           7
#> 2            73.81           3                 100           3
#> 3            59.52           3                 100           3
#> 4            76.19           3                 100           3
#> 5            67.86          10                 100           2

# Otherwise you can provide nested columns manually:
describe_missing(df,
                 select = list(
                   c("ID"),
                   c("openness_1", "openness_2", "openness_3"),
                   c("extroversion_1", "extroversion_2", "extroversion_3"),
                   c("agreeableness_1", "agreeableness_2", "agreeableness_3")
                 )
)
#>                          variable n_columns n_missing cells missing_percent
#> 1                              ID         1         7    14           50.00
#> 2           openness_1:openness_3         3        11    42           26.19
#> 3   extroversion_1:extroversion_3         3        17    42           40.48
#> 4 agreeableness_1:agreeableness_3         3        10    42           23.81
#> 5                           Total        10        45   140           32.14
#>   complete_percent missing_max missing_max_percent all_missing
#> 1            50.00           1                 100           7
#> 2            73.81           3                 100           3
#> 3            59.52           3                 100           3
#> 4            76.19           3                 100           3
#> 5            67.86          10                 100           2

Created on 2024-12-16 with reprex v2.1.1

@etiennebacher
Copy link
Member

I feel like most unresolved comments and questions regarding the documentation and the implementation are related to the scope of this function. I'd rather have a "generalist" function à la skimr rather than something specialized for psychology that I think could live in the rempsyc package.

@easystats/core-team what do you think? Are you interested in having some of those field-specific features in this function?

@etiennebacher etiennebacher mentioned this pull request Dec 17, 2024
@mattansb
Copy link
Member

I tend to agree. This function should be more general purpose - and maybe a psych-centric wrapper can be housed in @rempsyc 's package (I also just now noticed your handle is the name of the package 😅)

Copy link

codecov bot commented Dec 18, 2024

Codecov Report

Attention: Patch coverage is 95.00000% with 2 lines in your changes missing coverage. Please review.

Project coverage is 91.25%. Comparing base (81dd0e0) to head (0e83588).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
R/describe_missing.R 95.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #561      +/-   ##
==========================================
+ Coverage   91.14%   91.25%   +0.11%     
==========================================
  Files          76       77       +1     
  Lines        6045     6144      +99     
==========================================
+ Hits         5510     5607      +97     
- Misses        535      537       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@DominiqueMakowski
Copy link
Member

If I understand, the main outstanding issue is what to do with the "scales" argument. I would indeed remove it (soz Rémi ^^) and replace it by a by argument as in our other function. If users want to compute the amount of missing per dimension, they should do it using a more traditional approach and first pivot to longer and then run describe_missing(select="item", by="dimension") otherwise I'm afraid it gets messy if we have a bespoke scales argument only for this function

@rempsyc
Copy link
Member Author

rempsyc commented Dec 18, 2024

Alright, in this case, I think I can introduce select, exclude, and by and make it more consistent with the rest of datawizard 🤓

@rempsyc
Copy link
Member Author

rempsyc commented Dec 19, 2024

Alright, this is a much simplified version which now also support "by". So this is what I have so far:

library(datawizard)

describe_missing(airquality, select = "Ozone:Temp")
#>   variable n_missing missing_percent complete_percent
#> 1    Ozone        37           24.18            75.82
#> 2  Solar.R         7            4.58            95.42
#> 3     Wind         0            0.00           100.00
#> 4     Temp         0            0.00           100.00
#> 5    Total        44            7.19            92.81

describe_missing(airquality, exclude = "Ozone:Temp")
#>   variable n_missing missing_percent complete_percent
#> 1    Month         0               0              100
#> 2      Day         0               0              100
#> 3    Total         0               0              100

# Testing the 'by' argument for survey scales
set.seed(15)
fun <- function() {
  c(sample(c(NA, 1:10), replace = TRUE), NA, NA, NA)
}
df <- data.frame(
  ID = c("idz", NA),
  openness_1 = fun(), openness_2 = fun(), openness_3 = fun(),
  extroversion_1 = fun(), extroversion_2 = fun(), extroversion_3 = fun(),
  agreeableness_1 = fun(), agreeableness_2 = fun(), agreeableness_3 = fun()
)

df_long <- reshape_longer(
  df,
  select = -1,
  names_sep = "_",
  names_to = c("dimension", "item"))

describe_missing(df_long, 
                 select = -c(1, 3), 
                 by = "dimension")
#>        variable n_missing missing_percent complete_percent
#> 1 agreeableness        10           23.81            76.19
#> 2  extroversion        17           40.48            59.52
#> 3      openness        11           26.19            73.81
#> 4         Total        38           15.08            84.92

Created on 2024-12-19 with reprex v2.1.1

Anything else you'd find desirable in the function?

@@ -16,6 +16,10 @@ BREAKING CHANGES AND DEPRECATIONS
- if `select` (previously `pattern`) is a named vector, then all elements
must be named, e.g. `c(length = "Sepal.Length", "Sepal.Width")` errors.

NEW FUNCTIONS

* `describe_missing()`, to report on missing values in a data frame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `describe_missing()`, to report on missing values in a data frame.
* `describe_missing()`, to report on missing values in a data frame (#561).

#' @title Describe Missing Values in Data According to Guidelines
#'
#' @description Provides a detailed description of missing values in a data frame.
#' This function reports both absolute and percentage missing values of specified
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#' This function reports both absolute and percentage missing values of specified
#' This function reports both absolute number and percentage of missing values of specified

Comment on lines +10 to +11
#' variables and summary statistics will be computed for each group. Useful
#' for survey data by first reshaping the data to the long format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the last sentence is very specific and hard to understand. This usecase is part of the conversation so we have it in mind but it's a bit obscure for an external reader. It should be removed IMO.

Comment on lines +12 to +13
#' @param sort Logical. Whether to sort the result from highest to lowest
#' percentage of missing data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this can be done with an extra data_arrange(), I don't think it's necessary to add this argument.

#' - `n_missing`: Number of missing values.
#' - `missing_percent`: Percentage of missing values.
#' - `complete_percent`: Percentage of non-missing values.
#' @param ... Arguments passed down to other functions. Currently not used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need ...?

If we keep it, it should be positioned before @return

Comment on lines +44 to +48
#' describe_missing(
#' df_long,
#' select = -c(1, 3),
#' by = "dimension"
#' )
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails with an unclear message if there are more than one variable in by.

The way this argument works is also not very clear to me. For instance, I'd find it more natural if the by variables were used to return a list of dataframes, e.g.:

# group 1, subgroup 1
<output of describe_missing() for this particular (nested) group>

# group 1, subgroup 2
<output of describe_missing() for this particular (nested) group>

etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, the current implementation means that select has to be used to exclude the non-by variables, otherwise the output is weird

> describe_missing(df_long, by = "dimension")
        variable n_missing missing_percent complete_percent
1  agreeableness        21           50.00            50.00
2  agreeableness         0            0.00           100.00
3  agreeableness        10           23.81            76.19
4   extroversion        21           50.00            50.00
5   extroversion         0            0.00           100.00
6   extroversion        17           40.48            59.52
7       openness        21           50.00            50.00
8       openness         0            0.00           100.00
9       openness        11           26.19            73.81
10         Total       101           20.04            79.96

Comment on lines +14 to +18
#' @return A dataframe with the following columns:
#' - `variable`: Variables selected.
#' - `n_missing`: Number of missing values.
#' - `missing_percent`: Percentage of missing values.
#' - `complete_percent`: Percentage of non-missing values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it also have the total number of obs for better comparison? Although this number would be repeated for all rows...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suggestion of new function: describe_missing()
5 participants