Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prep quietly erases factor levels when strings_as_factors = TRUE (the default) #715

Open
mrkaye97 opened this issue Jun 1, 2021 · 2 comments
Labels
feature a feature request or enhancement

Comments

@mrkaye97
Copy link

mrkaye97 commented Jun 1, 2021

Hey everyone, just spent a couple hours debugging an issue where prep() was quietly replacing missing factor levels with NA in a variable that ultimately wasn't included in any preprocessing steps in my recipe or in my model (trained later). In my case, I ended up wanting to keep foo around to refer back to, but noticed I had a handful of missing values. It took me a while to figure out that this was happening because of strings_as_factors = TRUE (by default).

library(tidyverse)
library(tidymodels)

train <- tibble(
  foo = c('a', 'b', 'c'),
  bar = 1:3,
  baz = 10:12
)

test <- tibble(
  foo = c('a', 'b', 'd'),
  bar = 6:8,
  baz = 100:102
)

r <- recipe(baz ~ ., train) %>%
  prep()

test_2 <- bake(r, test)

test_2
#> # A tibble: 3 x 3
#>   foo     bar   baz
#>   <fct> <int> <int>
#> 1 a         6   100
#> 2 b         7   101
#> 3 NA        8   102

Created on 2021-06-01 by the reprex package (v1.0.0)

On much more complex data, of course. As it turned out, prep() was the issue, since there's a strings_as_factors argument I didn't know about. Here's the fixed result:

library(tidyverse)
library(tidymodels)

train <- tibble(
  foo = c('a', 'b', 'c'),
  bar = 1:3,
  baz = 10:12
)

test <- tibble(
  foo = c('a', 'b', 'd'),
  bar = 6:8,
  baz = 100:102
)

r <- recipe(baz ~ ., train) %>%
  prep(strings_as_factors = FALSE)

test_2 <- bake(r, test)

test_2
#> # A tibble: 3 x 3
#>   foo     bar   baz
#>   <chr> <int> <int>
#> 1 a         6   100
#> 2 b         7   101
#> 3 d         8   102

Created on 2021-06-01 by the reprex package (v1.0.0)

It makes sense to me why trying to coerce the test set's foos to a factor would give me an NA on d, since d doesn't exist in the training set.

That said, this seems like the type of thing that would be helpful to throw a warning about. That would also be more in line with this case (where we get the warning I'd expect):

library(tidyverse)
library(tidymodels)

train <- tibble(
  foo = c('a', 'b', 'c'),
  bar = 1:3,
  baz = 10:12
)

test <- tibble(
  foo = c('a', 'b', 'd'),
  bar = 6:8,
  baz = 100:102
)

r <- recipe(baz ~ ., train) %>%
  step_dummy(-baz, -bar) %>%
  prep()

test_2 <- bake(r, test)
#> Warning: There are new levels in a factor: d

test_2
#> # A tibble: 3 x 4
#>     bar   baz foo_b foo_c
#>   <int> <int> <dbl> <dbl>
#> 1     6   100     0     0
#> 2     7   101     1     0
#> 3     8   102    NA    NA

Created on 2021-06-01 by the reprex package (v1.0.0)

but it seems to me like it'd be worthwhile to just give me a heads up (warning or otherwise) that bake found some factor levels that were missing in the data that the recipe was prepped on, and that they were replaced.

Let me know if anyone else has thoughts on this! Just seemed like a dangerous default / dangerous default behavior to me.

Thanks!

@juliasilge
Copy link
Member

juliasilge commented Jun 7, 2021

Thanks so much for this report @mrkaye97! 🙌 That sounds super frustrating, if you weren't expecting the recipe to learn factor levels from your training data and apply to your testing data. We are currently in the process of moving where the strings_as_factors arg lives, from prep() to the recipe itself (outlined in #331), which may not have helped you in this case but we do think should be surfaced more clearly.

We may want to consider a more general warning for the levels found during bake() rather than step-level warnings.

@mrkaye97
Copy link
Author

mrkaye97 commented Jun 8, 2021

Thanks @juliasilge! Just looked into #331 and #706 and they both look good to me! Agreed on the general warning for the levels found during bake() though. AFAICT, this issue would only really surface in a case like mine -- where the variable with the level being dropped isn't actually used in the recipe, workflow, model, etc.

Do you think that a warning like this might work, or is this too brittle of a solution? (ccing @topepo here too)

#' @importFrom dplyr setdiff
strings2factors <- function(x, info) {
  check_lvls <- has_lvls(info)
  if (!any(check_lvls)) {
    return(x)
  }
  info <- info[check_lvls]
  vars <- names(info)
  info <- info[vars %in% names(x)]
  for (i in seq_along(info)) {
    lcol <- names(info)[i]

    ## check for missing factor levels
    missing_levels <- setdiff(
      as.character(x[[lcol]]),
      info[[i]]$values
    )
    
    ## if any levels are missing, warn the user about which will be coerced to `NA`
    if (length(missing_levels) != 0) {
      rlang::warn(
        sprintf("The following factor levels have been coerced to NA: %s",
                paste(missing_levels, collapse = ', '))
      )
    }

    x[, lcol] <-
      factor(as.character(x[[lcol]]),
             levels = info[[i]]$values,
             ordered = info[[i]]$ordered)
  }
  x
}

It seems like this type of approach could work to me, but I haven't looked through the rest of the code to know if this would have problematic effects on other uses of strings2factors.

Let me know what you think! Happy to PR this fix.

Separately -- thanks for all the awesome work you all are doing! Love the package and the whole tidymodels universe!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants