tidymodels · juliasilge · Sep 6, 2023 · Aug 11, 2023 · Aug 25, 2023 · Sep 5, 2023
diff --git a/05-data-spending.Rmd b/05-data-spending.Rmd
@@ -112,7 +112,26 @@ It is common to hear about _validation sets_ as an answer to this question, espe
 Whether validation sets are a subset of the training set or a third allocation in the initial split of the data largely comes down to semantics.
 :::
 
-Validation sets are discussed more in Section \@ref(validation) as a special case of _resampling_ methods that are used on the training set.
+Validation sets are discussed more in Section \@ref(validation) as a special case of _resampling_ methods that are used on the training set. If you are going to use a validation set, you can start with a different splitting function^[This interface is available as of rsample version 1.2.0 (circa September 2023).]: 
+
+```{r ames-val-split, message = FALSE, warning = FALSE}
+set.seed(52)
+# To put 60% into training, 20% in validation, and 20% in testing:
+ames_val_split <- initial_validation_split(ames, prop = c(0.6, 0.2))
+ames_val_split
+```
+
+Printing the split now shows the size of the training set (`r format(nrow(training(ames_val_split)), big.mark = ",")`), validation set (`r format(nrow(validation(ames_val_split)), big.mark = ",")`), and test set ((`r format(nrow(testing(ames_val_split)), big.mark = ",")`). 
+
+To get the training, validation, and testing data, the same syntax is used: 
+
+```{r ames-val-data, eval = FALSE}
+ames_train <- training(ames_val_split)
+ames_test <- testing(ames_val_split)
+ames_val <- validation(ames_val_split)
+```
+
+Section \@ref(validation) will demonstrate how to use the `ames_val_split` object for resampling and model optimization.
 
 ## Multilevel Data
 

diff --git a/07-the-model-workflow.Rmd b/07-the-model-workflow.Rmd
@@ -323,6 +323,10 @@ collect_predictions(final_lm_res) %>% slice(1:5)
 
 We'll see more about `last_fit()` in action and how to use it again in Section \@ref(bean-models).
 
+:::rmdnote
+When using validation sets, `last_fit()` has an argument called `add_validation_set` to specify if we should train the final model solely on the training set (the default) or the combination of the training and validation sets. 
+:::
+
 ## Chapter Summary {#workflows-summary}
 
 In this chapter, you learned that the modeling process encompasses more than just estimating the parameters of an algorithm that connects predictors to an outcome. This process also includes preprocessing steps and operations taken after a model is fit. We introduced a concept called a *model workflow* that can capture the important components of the modeling process. Multiple workflows can also be created inside of a *workflow set*. The `last_fit()` function is convenient for fitting a final model to the training set and evaluating with the test set. 

diff --git a/10-resampling.Rmd b/10-resampling.Rmd
@@ -302,15 +302,22 @@ With the `r pkg(rsample)` package, a validation set is like any other resampling
 knitr::include_graphics("premade/validation-alt.svg")
 ```
 
-To create a validation set object that uses 3/4 of the data for model fitting: 
-
+To build on the code from Section \@ref(what-about-a-validation-set), the function `validation_set()` can take the results of `initial_validation_split()` and convert it to an rset object that is similar to the ones produced by functions such as `vfold_cv()`: 
 
 ```{r resampling-validation-split}
-set.seed(1002)
-val_set <- validation_split(ames_train, prop = 3/4)
+# Previously:
+
+set.seed(52)
+# To put 60% into training, 20% in validation, and 20% in testing:
+ames_val_split <- initial_validation_split(ames, prop = c(0.6, 0.2))
+ames_val_split
+
+# Object used for resampling: 
+val_set <- validation_set(ames_val_split)
 val_set
 ```
 
+As you'll see in Section \@ref(resampling-performance), the `fit_resamples()` function will be used to compute correct estimates of performance using resampling. The `val_set` object can be used in in this and other functions even though it is a single "resample" of the data. 
 
 ### Bootstrapping {#bootstrap}
 

diff --git a/16-dimensionality-reduction.Rmd b/16-dimensionality-reduction.Rmd
@@ -103,17 +103,22 @@ For our analyses, we start by holding back a testing set with `initial_split()`.
 
 ```{r dimensionality-split}
 set.seed(1601)
-bean_split <- initial_split(beans, strata = class, prop = 3/4)
+bean_split <- initial_validation_split(beans, strata = class, prop = c(0.75, 0.125))
+bean_split
 
+# Return data frames:
 bean_train <- training(bean_split)
-bean_test  <- testing(bean_split)
+bean_test <- testing(bean_split)
+bean_validation <- validation(bean_split)
+
 
 set.seed(1602)
-bean_val <- validation_split(bean_train, strata = class, prop = 4/5)
+# Return an 'rset' object to use with the tune functions:
+bean_val <- validation_set(bean_split)
 bean_val$splits[[1]]
 ```
 
-To visually assess how well different methods perform, we can estimate the methods on the training set (n = `r analysis(bean_val$splits[[1]]) %>% nrow()` beans) and display the results using the validation set (n = `r assessment(bean_val$splits[[1]]) %>% nrow()`).
+To visually assess how well different methods perform, we can estimate the methods on the training set (n = `r format(nrow(bean_train), big.mark = ",")` beans) and display the results using the validation set (n = `r format(nrow(bean_validation), big.mark = ",")`).
 
 Before beginning any dimensionality reduction, we can spend some time investigating our data. Since we know that many of these shape features are probably measuring similar concepts, let's take a look at the correlation structure of the data in Figure \@ref(fig:beans-corr-plot) using this code.
 
@@ -145,7 +150,7 @@ It's time to look at the beans data in a smaller space. We can start with a basi
 library(bestNormalize)
 bean_rec <-
   # Use the training data from the bean_val split object
-  recipe(class ~ ., data = analysis(bean_val$splits[[1]])) %>%
+  recipe(class ~ ., data = bean_train) %>%
   step_zv(all_numeric_predictors()) %>%
   step_orderNorm(all_numeric_predictors()) %>% 
   step_normalize(all_numeric_predictors())
@@ -224,7 +229,6 @@ Using `bake()` with a recipe is much like using `predict()` with a model; the op
 For example, the validation set samples can be processed: 
 
 ```{r dimensionality-bake}
-bean_validation <- bean_val$splits %>% pluck(1) %>% assessment()
 bean_val_processed <- bake(bean_rec_trained, new_data = bean_validation)
 ```
 
@@ -260,7 +264,7 @@ First, as previously mentioned, using `prep(recipe, retain = TRUE)` keeps the ex
 
 ```{r dimensionality-new-data-null}
 bake(bean_rec_trained, new_data = NULL) %>% nrow()
-bean_val$splits %>% pluck(1) %>% analysis() %>% nrow()
+bean_train %>% nrow()
 ```
 
 If the training set is not pathologically large, using this value of `retain` can save a lot of computational time. 
@@ -275,7 +279,7 @@ Since recipes are the primary option in tidymodels for dimensionality reduction,
 
 ```{r dimensionality-function}
 library(ggforce)
-plot_validation_results <- function(recipe, dat = assessment(bean_val$splits[[1]])) {
+plot_validation_results <- function(recipe, dat = bean_validation) {
   recipe %>%
     # Estimate any additional steps
     prep() %>%

diff --git a/DESCRIPTION b/DESCRIPTION
@@ -67,7 +67,7 @@ Imports:
     rlang,
     rmarkdown,
     rpart,
-    rsample (>= 0.0.9),
+    rsample (>= 1.2.0),
     rstanarm,
     rules,
     sessioninfo,