From 154b54c62671c210a53e9eb1a6f790cb06860ecd Mon Sep 17 00:00:00 2001 From: topepo Date: Fri, 11 Aug 2023 14:12:58 -0400 Subject: [PATCH 1/4] initial changes for new validation interface --- 05-data-spending.Rmd | 21 ++++++++++++++++++++- 07-the-model-workflow.Rmd | 4 ++++ 10-resampling.Rmd | 15 +++++++++++---- 16-dimensionality-reduction.Rmd | 15 ++++++++------- DESCRIPTION | 5 +++-- 5 files changed, 46 insertions(+), 14 deletions(-) diff --git a/05-data-spending.Rmd b/05-data-spending.Rmd index d4291b5..68d32cc 100644 --- a/05-data-spending.Rmd +++ b/05-data-spending.Rmd @@ -112,7 +112,26 @@ It is common to hear about _validation sets_ as an answer to this question, espe Whether validation sets are a subset of the training set or a third allocation in the initial split of the data largely comes down to semantics. ::: -Validation sets are discussed more in Section \@ref(validation) as a special case of _resampling_ methods that are used on the training set. +Validation sets are discussed more in Section \@ref(validation) as a special case of _resampling_ methods that are used on the training set. If you are going to use a validation set, you can start with a different splitting function^[This interface is available as of rsample version 1.2.0 (circa September 2023).]: + +```{r ames-val-split, message = FALSE, warning = FALSE} +set.seed(52) +# To put 60% into training, 20% in validation, and 20% in testing: +ames_val_split <- initial_validation_split(ames, prop = c(0.6, 0.2)) +ames_val_split +``` + +Printing the split now shows the size of the training set (`r format(nrow(training(ames_val_split)), big.mark = ",")`), validation set (`r format(nrow(validation(ames_val_split)), big.mark = ",")`), and test set ((`r format(nrow(testing(ames_val_split)), big.mark = ",")`). + +To get the training and testing data, the same syntax is used: + +```{r ames-val-data, eval = FALSE} +ames_train <- training(ames_val_split) +ames_test <- testing(ames_val_split) +ames_val <- validation(ames_val_split) +``` + +Section \@ref(validation) will demonstrate how to use the `ames_val_split` for resampling and model optimization ## Multilevel Data diff --git a/07-the-model-workflow.Rmd b/07-the-model-workflow.Rmd index 3ef47ba..b94495f 100644 --- a/07-the-model-workflow.Rmd +++ b/07-the-model-workflow.Rmd @@ -323,6 +323,10 @@ collect_predictions(final_lm_res) %>% slice(1:5) We'll see more about `last_fit()` in action and how to use it again in Section \@ref(bean-models). +:::rmdnote +When using validation sets, `last_fit()` has an option called `add_validation_set` to specify if we should train the final model solely on the training set (the default) or the combination of the training and validation sets. +::: + ## Chapter Summary {#workflows-summary} In this chapter, you learned that the modeling process encompasses more than just estimating the parameters of an algorithm that connects predictors to an outcome. This process also includes preprocessing steps and operations taken after a model is fit. We introduced a concept called a *model workflow* that can capture the important components of the modeling process. Multiple workflows can also be created inside of a *workflow set*. The `last_fit()` function is convenient for fitting a final model to the training set and evaluating with the test set. diff --git a/10-resampling.Rmd b/10-resampling.Rmd index 5008398..853d1b1 100644 --- a/10-resampling.Rmd +++ b/10-resampling.Rmd @@ -302,15 +302,22 @@ With the `r pkg(rsample)` package, a validation set is like any other resampling knitr::include_graphics("premade/validation-alt.svg") ``` -To create a validation set object that uses 3/4 of the data for model fitting: - +To build on the code from Section \@ref(what-about-a-validation-set), the function `validation_set()` can take the results of `initial_validation_split()` and convert it to an rset object that is similar to the ones produced by functions such as `vfold_cv()`: ```{r resampling-validation-split} -set.seed(1002) -val_set <- validation_split(ames_train, prop = 3/4) +# Previously: + +set.seed(52) +# To put 60% into training, 20% in validation, and 20% in testing: +ames_val_split <- initial_validation_split(ames, prop = c(0.6, 0.2)) +ames_val_split + +# Object used for resampling: +val_set <- validation_set(ames_val_split) val_set ``` +As you'll see in Section \@ref(resampling-performance), the `fit_resamples()` function will be used to compute correct estimates of performance using resampling. The `val_set` object can be used in in this and other functions even though it is a single "resample" of the data. ### Bootstrapping {#bootstrap} diff --git a/16-dimensionality-reduction.Rmd b/16-dimensionality-reduction.Rmd index cfad3b9..e5dd161 100644 --- a/16-dimensionality-reduction.Rmd +++ b/16-dimensionality-reduction.Rmd @@ -103,17 +103,19 @@ For our analyses, we start by holding back a testing set with `initial_split()`. ```{r dimensionality-split} set.seed(1601) -bean_split <- initial_split(beans, strata = class, prop = 3/4) +bean_split <- initial_validation_split(beans, strata = class, prop = c(0.75, 0.125)) +bean_split bean_train <- training(bean_split) bean_test <- testing(bean_split) +bean_validation <- validation(bean_split) set.seed(1602) -bean_val <- validation_split(bean_train, strata = class, prop = 4/5) +bean_val <- validation_set(bean_split) bean_val$splits[[1]] ``` -To visually assess how well different methods perform, we can estimate the methods on the training set (n = `r analysis(bean_val$splits[[1]]) %>% nrow()` beans) and display the results using the validation set (n = `r assessment(bean_val$splits[[1]]) %>% nrow()`). +To visually assess how well different methods perform, we can estimate the methods on the training set (n = `r format(nrow(bean_train), big.mark = ",")` beans) and display the results using the validation set (n = `r format(nrow(bean_validation), big.mark = ",")`). Before beginning any dimensionality reduction, we can spend some time investigating our data. Since we know that many of these shape features are probably measuring similar concepts, let's take a look at the correlation structure of the data in Figure \@ref(fig:beans-corr-plot) using this code. @@ -145,7 +147,7 @@ It's time to look at the beans data in a smaller space. We can start with a basi library(bestNormalize) bean_rec <- # Use the training data from the bean_val split object - recipe(class ~ ., data = analysis(bean_val$splits[[1]])) %>% + recipe(class ~ ., data = bean_train) %>% step_zv(all_numeric_predictors()) %>% step_orderNorm(all_numeric_predictors()) %>% step_normalize(all_numeric_predictors()) @@ -224,7 +226,6 @@ Using `bake()` with a recipe is much like using `predict()` with a model; the op For example, the validation set samples can be processed: ```{r dimensionality-bake} -bean_validation <- bean_val$splits %>% pluck(1) %>% assessment() bean_val_processed <- bake(bean_rec_trained, new_data = bean_validation) ``` @@ -260,7 +261,7 @@ First, as previously mentioned, using `prep(recipe, retain = TRUE)` keeps the ex ```{r dimensionality-new-data-null} bake(bean_rec_trained, new_data = NULL) %>% nrow() -bean_val$splits %>% pluck(1) %>% analysis() %>% nrow() +bean_train %>% nrow() ``` If the training set is not pathologically large, using this value of `retain` can save a lot of computational time. @@ -275,7 +276,7 @@ Since recipes are the primary option in tidymodels for dimensionality reduction, ```{r dimensionality-function} library(ggforce) -plot_validation_results <- function(recipe, dat = assessment(bean_val$splits[[1]])) { +plot_validation_results <- function(recipe, dat = bean_validation) { recipe %>% # Estimate any additional steps prep() %>% diff --git a/DESCRIPTION b/DESCRIPTION index 9ce3065..8ce998d 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -67,7 +67,7 @@ Imports: rlang, rmarkdown, rpart, - rsample (>= 0.0.9), + rsample (>= 1.2.0), rstanarm, rules, sessioninfo, @@ -88,7 +88,8 @@ Imports: xgboost, yardstick Remotes: - tidymodels/learntidymodels + tidymodels/learntidymodels, + tidymodels/rsample biocViews: mixOmics Encoding: UTF-8 SystemRequirements: FFmpeg (>= 3.2); with at least libx264 and lame (mp3) From e50bcb580285d15dd87cb00269f6c20db0e834b1 Mon Sep 17 00:00:00 2001 From: Max Kuhn Date: Fri, 25 Aug 2023 11:29:58 -0400 Subject: [PATCH 2/4] rsample now on cran --- DESCRIPTION | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index 8ce998d..6b600e0 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -88,8 +88,7 @@ Imports: xgboost, yardstick Remotes: - tidymodels/learntidymodels, - tidymodels/rsample + tidymodels/learntidymodels biocViews: mixOmics Encoding: UTF-8 SystemRequirements: FFmpeg (>= 3.2); with at least libx264 and lame (mp3) From 53c4bd7135a74b46b59e8d2a41d410b24cdbad5b Mon Sep 17 00:00:00 2001 From: Max Kuhn Date: Tue, 5 Sep 2023 13:22:32 -0400 Subject: [PATCH 3/4] a little more explanation --- 16-dimensionality-reduction.Rmd | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/16-dimensionality-reduction.Rmd b/16-dimensionality-reduction.Rmd index e5dd161..84cbb4b 100644 --- a/16-dimensionality-reduction.Rmd +++ b/16-dimensionality-reduction.Rmd @@ -106,11 +106,14 @@ set.seed(1601) bean_split <- initial_validation_split(beans, strata = class, prop = c(0.75, 0.125)) bean_split +# Return data frames: bean_train <- training(bean_split) -bean_test <- testing(bean_split) +bean_test <- testing(bean_split) bean_validation <- validation(bean_split) + set.seed(1602) +# Return an 'rset' object to use with the tune functions: bean_val <- validation_set(bean_split) bean_val$splits[[1]] ``` From 6f67adc27c9b2b01ee2f466b45fb2c2eb70a80ca Mon Sep 17 00:00:00 2001 From: Julia Silge Date: Wed, 6 Sep 2023 10:27:55 -0600 Subject: [PATCH 4/4] Apply suggestions from code review --- 05-data-spending.Rmd | 4 ++-- 07-the-model-workflow.Rmd | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/05-data-spending.Rmd b/05-data-spending.Rmd index 68d32cc..59cf3d8 100644 --- a/05-data-spending.Rmd +++ b/05-data-spending.Rmd @@ -123,7 +123,7 @@ ames_val_split Printing the split now shows the size of the training set (`r format(nrow(training(ames_val_split)), big.mark = ",")`), validation set (`r format(nrow(validation(ames_val_split)), big.mark = ",")`), and test set ((`r format(nrow(testing(ames_val_split)), big.mark = ",")`). -To get the training and testing data, the same syntax is used: +To get the training, validation, and testing data, the same syntax is used: ```{r ames-val-data, eval = FALSE} ames_train <- training(ames_val_split) @@ -131,7 +131,7 @@ ames_test <- testing(ames_val_split) ames_val <- validation(ames_val_split) ``` -Section \@ref(validation) will demonstrate how to use the `ames_val_split` for resampling and model optimization +Section \@ref(validation) will demonstrate how to use the `ames_val_split` object for resampling and model optimization. ## Multilevel Data diff --git a/07-the-model-workflow.Rmd b/07-the-model-workflow.Rmd index b94495f..d85868f 100644 --- a/07-the-model-workflow.Rmd +++ b/07-the-model-workflow.Rmd @@ -324,7 +324,7 @@ collect_predictions(final_lm_res) %>% slice(1:5) We'll see more about `last_fit()` in action and how to use it again in Section \@ref(bean-models). :::rmdnote -When using validation sets, `last_fit()` has an option called `add_validation_set` to specify if we should train the final model solely on the training set (the default) or the combination of the training and validation sets. +When using validation sets, `last_fit()` has an argument called `add_validation_set` to specify if we should train the final model solely on the training set (the default) or the combination of the training and validation sets. ::: ## Chapter Summary {#workflows-summary}