Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 20 additions & 1 deletion 05-data-spending.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,26 @@ It is common to hear about _validation sets_ as an answer to this question, espe
Whether validation sets are a subset of the training set or a third allocation in the initial split of the data largely comes down to semantics.
:::

Validation sets are discussed more in Section \@ref(validation) as a special case of _resampling_ methods that are used on the training set.
Validation sets are discussed more in Section \@ref(validation) as a special case of _resampling_ methods that are used on the training set. If you are going to use a validation set, you can start with a different splitting function^[This interface is available as of rsample version 1.2.0 (circa September 2023).]:

```{r ames-val-split, message = FALSE, warning = FALSE}
set.seed(52)
# To put 60% into training, 20% in validation, and 20% in testing:
ames_val_split <- initial_validation_split(ames, prop = c(0.6, 0.2))
ames_val_split
```

Printing the split now shows the size of the training set (`r format(nrow(training(ames_val_split)), big.mark = ",")`), validation set (`r format(nrow(validation(ames_val_split)), big.mark = ",")`), and test set ((`r format(nrow(testing(ames_val_split)), big.mark = ",")`).

To get the training, validation, and testing data, the same syntax is used:

```{r ames-val-data, eval = FALSE}
ames_train <- training(ames_val_split)
ames_test <- testing(ames_val_split)
ames_val <- validation(ames_val_split)
```

Section \@ref(validation) will demonstrate how to use the `ames_val_split` object for resampling and model optimization.

## Multilevel Data

Expand Down
4 changes: 4 additions & 0 deletions 07-the-model-workflow.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -323,6 +323,10 @@ collect_predictions(final_lm_res) %>% slice(1:5)

We'll see more about `last_fit()` in action and how to use it again in Section \@ref(bean-models).

:::rmdnote
When using validation sets, `last_fit()` has an argument called `add_validation_set` to specify if we should train the final model solely on the training set (the default) or the combination of the training and validation sets.
:::

## Chapter Summary {#workflows-summary}

In this chapter, you learned that the modeling process encompasses more than just estimating the parameters of an algorithm that connects predictors to an outcome. This process also includes preprocessing steps and operations taken after a model is fit. We introduced a concept called a *model workflow* that can capture the important components of the modeling process. Multiple workflows can also be created inside of a *workflow set*. The `last_fit()` function is convenient for fitting a final model to the training set and evaluating with the test set.
Expand Down
15 changes: 11 additions & 4 deletions 10-resampling.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -302,15 +302,22 @@ With the `r pkg(rsample)` package, a validation set is like any other resampling
knitr::include_graphics("premade/validation-alt.svg")
```

To create a validation set object that uses 3/4 of the data for model fitting:

To build on the code from Section \@ref(what-about-a-validation-set), the function `validation_set()` can take the results of `initial_validation_split()` and convert it to an rset object that is similar to the ones produced by functions such as `vfold_cv()`:

```{r resampling-validation-split}
set.seed(1002)
val_set <- validation_split(ames_train, prop = 3/4)
# Previously:

set.seed(52)
# To put 60% into training, 20% in validation, and 20% in testing:
ames_val_split <- initial_validation_split(ames, prop = c(0.6, 0.2))
ames_val_split

# Object used for resampling:
val_set <- validation_set(ames_val_split)
val_set
```

As you'll see in Section \@ref(resampling-performance), the `fit_resamples()` function will be used to compute correct estimates of performance using resampling. The `val_set` object can be used in in this and other functions even though it is a single "resample" of the data.

### Bootstrapping {#bootstrap}

Expand Down
20 changes: 12 additions & 8 deletions 16-dimensionality-reduction.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -103,17 +103,22 @@ For our analyses, we start by holding back a testing set with `initial_split()`.

```{r dimensionality-split}
set.seed(1601)
bean_split <- initial_split(beans, strata = class, prop = 3/4)
bean_split <- initial_validation_split(beans, strata = class, prop = c(0.75, 0.125))
bean_split

# Return data frames:
bean_train <- training(bean_split)
bean_test <- testing(bean_split)
bean_test <- testing(bean_split)
bean_validation <- validation(bean_split)


set.seed(1602)
bean_val <- validation_split(bean_train, strata = class, prop = 4/5)
# Return an 'rset' object to use with the tune functions:
bean_val <- validation_set(bean_split)
bean_val$splits[[1]]
```

To visually assess how well different methods perform, we can estimate the methods on the training set (n = `r analysis(bean_val$splits[[1]]) %>% nrow()` beans) and display the results using the validation set (n = `r assessment(bean_val$splits[[1]]) %>% nrow()`).
To visually assess how well different methods perform, we can estimate the methods on the training set (n = `r format(nrow(bean_train), big.mark = ",")` beans) and display the results using the validation set (n = `r format(nrow(bean_validation), big.mark = ",")`).

Before beginning any dimensionality reduction, we can spend some time investigating our data. Since we know that many of these shape features are probably measuring similar concepts, let's take a look at the correlation structure of the data in Figure \@ref(fig:beans-corr-plot) using this code.

Expand Down Expand Up @@ -145,7 +150,7 @@ It's time to look at the beans data in a smaller space. We can start with a basi
library(bestNormalize)
bean_rec <-
# Use the training data from the bean_val split object
recipe(class ~ ., data = analysis(bean_val$splits[[1]])) %>%
recipe(class ~ ., data = bean_train) %>%
step_zv(all_numeric_predictors()) %>%
step_orderNorm(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())
Expand Down Expand Up @@ -224,7 +229,6 @@ Using `bake()` with a recipe is much like using `predict()` with a model; the op
For example, the validation set samples can be processed:

```{r dimensionality-bake}
bean_validation <- bean_val$splits %>% pluck(1) %>% assessment()
bean_val_processed <- bake(bean_rec_trained, new_data = bean_validation)
```

Expand Down Expand Up @@ -260,7 +264,7 @@ First, as previously mentioned, using `prep(recipe, retain = TRUE)` keeps the ex

```{r dimensionality-new-data-null}
bake(bean_rec_trained, new_data = NULL) %>% nrow()
bean_val$splits %>% pluck(1) %>% analysis() %>% nrow()
bean_train %>% nrow()
```

If the training set is not pathologically large, using this value of `retain` can save a lot of computational time.
Expand All @@ -275,7 +279,7 @@ Since recipes are the primary option in tidymodels for dimensionality reduction,

```{r dimensionality-function}
library(ggforce)
plot_validation_results <- function(recipe, dat = assessment(bean_val$splits[[1]])) {
plot_validation_results <- function(recipe, dat = bean_validation) {
recipe %>%
# Estimate any additional steps
prep() %>%
Expand Down
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ Imports:
rlang,
rmarkdown,
rpart,
rsample (>= 0.0.9),
rsample (>= 1.2.0),
rstanarm,
rules,
sessioninfo,
Expand Down