-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prep
quietly erases factor levels when strings_as_factors = TRUE
(the default)
#715
Comments
Thanks so much for this report @mrkaye97! 🙌 That sounds super frustrating, if you weren't expecting the recipe to learn factor levels from your training data and apply to your testing data. We are currently in the process of moving where the We may want to consider a more general warning for the levels found during |
Thanks @juliasilge! Just looked into #331 and #706 and they both look good to me! Agreed on the general warning for the levels found during Do you think that a warning like this might work, or is this too brittle of a solution? (ccing @topepo here too) #' @importFrom dplyr setdiff
strings2factors <- function(x, info) {
check_lvls <- has_lvls(info)
if (!any(check_lvls)) {
return(x)
}
info <- info[check_lvls]
vars <- names(info)
info <- info[vars %in% names(x)]
for (i in seq_along(info)) {
lcol <- names(info)[i]
## check for missing factor levels
missing_levels <- setdiff(
as.character(x[[lcol]]),
info[[i]]$values
)
## if any levels are missing, warn the user about which will be coerced to `NA`
if (length(missing_levels) != 0) {
rlang::warn(
sprintf("The following factor levels have been coerced to NA: %s",
paste(missing_levels, collapse = ', '))
)
}
x[, lcol] <-
factor(as.character(x[[lcol]]),
levels = info[[i]]$values,
ordered = info[[i]]$ordered)
}
x
} It seems like this type of approach could work to me, but I haven't looked through the rest of the code to know if this would have problematic effects on other uses of Let me know what you think! Happy to PR this fix. Separately -- thanks for all the awesome work you all are doing! Love the package and the whole |
Hey everyone, just spent a couple hours debugging an issue where
prep()
was quietly replacing missing factor levels withNA
in a variable that ultimately wasn't included in any preprocessing steps in myrecipe
or in my model (trained later). In my case, I ended up wanting to keepfoo
around to refer back to, but noticed I had a handful of missing values. It took me a while to figure out that this was happening because ofstrings_as_factors = TRUE
(by default).Created on 2021-06-01 by the reprex package (v1.0.0)
On much more complex data, of course. As it turned out,
prep()
was the issue, since there's astrings_as_factors
argument I didn't know about. Here's the fixed result:Created on 2021-06-01 by the reprex package (v1.0.0)
It makes sense to me why trying to coerce the test set's
foo
s to a factor would give me anNA
ond
, sinced
doesn't exist in the training set.That said, this seems like the type of thing that would be helpful to throw a warning about. That would also be more in line with this case (where we get the warning I'd expect):
Created on 2021-06-01 by the reprex package (v1.0.0)
but it seems to me like it'd be worthwhile to just give me a heads up (warning or otherwise) that
bake
found some factor levels that were missing in the data that the recipe wasprep
ped on, and that they were replaced.Let me know if anyone else has thoughts on this! Just seemed like a dangerous default / dangerous default behavior to me.
Thanks!
The text was updated successfully, but these errors were encountered: