tidymodels · juliasilge · Feb 3, 2022 · Feb 3, 2022 · Feb 3, 2022 · Feb 4, 2022
diff --git a/.github/workflows/bookdown.yaml b/.github/workflows/bookdown.yaml
@@ -34,7 +34,7 @@ jobs:
       - uses: r-lib/actions/setup-r-dependencies@v2
 
       - name: Build site
-        run: Rscript -e 'bookdown::render_book("index.Rmd", quiet = TRUE)'
+        run: Rscript -e 'bookdown::render_book("index.Rmd", output_format = bookdown::html_book(keep_md = TRUE), quiet = TRUE)'
 
       - name: Deploy to Netlify
         if: contains(env.isExtPR, 'false')
@@ -52,3 +52,8 @@ jobs:
           NETLIFY_AUTH_TOKEN: ${{ secrets.NETLIFY_AUTH_TOKEN }}
           NETLIFY_SITE_ID: ${{ secrets.NETLIFY_SITE_ID }}
         timeout-minutes: 1
+
+      - uses: actions/upload-artifact@v1
+        with:
+          name: _book
+          path: _book/
diff --git a/.gitignore b/.gitignore
@@ -6,7 +6,7 @@
 _book
 _main.*
 libs
-figures
+^figures/*
 _bookdown_files
 figures/introduction-cricket-plot-1.svg
 figures/introduction-descr-examples-1.pdf
@@ -19,3 +19,6 @@ figures/tidyverse-interaction-plots-1.svg
 extras/iowa_highway.shx
 extras/iowa_highway.shp
 files_for_print*
+tmwr-to-ch9*
+extras/iowa_highway.zip
+extras/iowa_highway/iowa_highway.shp
diff --git a/01-software-modeling.Rmd b/01-software-modeling.Rmd
@@ -7,7 +7,6 @@ knitr::opts_chunk$set(fig.path = "figures/")
 library(tidyverse)
 library(gridExtra)
 library(tibble)
-library(kableExtra)
 
 data(ames, package = "modeldata")
 ```
@@ -66,7 +65,7 @@ For example, large scale measurements of RNA have been possible for some time us
 
 An early method for evaluating such issues were probe-level models, or PLMs [@bolstad2004]. A statistical model would be created that accounted for the known differences in the data, such as the chip, the RNA sequence, the type of sequence, and so on. If there were other, unknown factors in the data, these effects would be captured in the model residuals. When the residuals were plotted by their location on the chip, a good quality chip would show no patterns. When a problem did occur, some sort of spatial pattern would be discernible. Often the type of pattern would suggest the underlying issue (e.g., a fingerprint) and a possible solution (wipe off the chip and rescan, repeat the sample, etc.). Figure \@ref(fig:software-descr-examples)(a) shows an application of this method for two microarrays taken from @Gentleman2005. The images show two different color values; areas that are darker are where the signal intensity was larger than the model expects while the lighter color shows lower than expected values. The left-hand panel demonstrates a fairly random pattern while the right-hand panel exhibits an undesirable artifact in the middle of the chip. 
 
-```{r software-descr-examples, echo = FALSE, fig.cap = "Two examples of how descriptive models can be used to illustrate specific patterns", out.width = '80%', dev = "png", fig.height = 8, warning = FALSE, message = FALSE}
+```{r software-descr-examples, echo = FALSE, fig.cap = "Two examples of how descriptive models can be used to illustrate specific patterns", out.width = '80%', fig.height = 8, warning = FALSE, message = FALSE}
 load("RData/plm_resids.RData")
 
 resid_cols <- RColorBrewer::brewer.pal(8, "Set1")[1:2]
@@ -255,28 +254,12 @@ monolog <-
     "Model Evaluation", "2",
     "Let’s drop K-NN from the model list. "
   )
-if (knitr::is_html_output()) {
-  tab <- 
-    monolog %>% 
+monolog %>% 
     dplyr::select(Thoughts, Activity) %>% 
-    kable(
+    knitr::kable(
       caption = "Hypothetical inner monologue of a model developer.",
       label = "inner-monologue"
-    ) %>%
-    kable_styling() %>% 
-    column_spec(2, width = "25%") %>%
-    column_spec(1, width = "75%", italic = TRUE)
-} else {
-  tab <- 
-    monolog %>% 
-    dplyr::select(Thoughts, Activity) %>% 
-    kable(
-      caption = "Hypothetical inner monologue of a model developer.",
-      label = "inner-monologue"
-    ) %>%
-    kable_styling()
-}
-tab
+    ) 
 ```
 
 ## Chapter Summary {#software-summary}

diff --git a/03-base-r.Rmd b/03-base-r.Rmd
@@ -4,7 +4,6 @@
 knitr::opts_chunk$set(fig.path = "figures/")
 data(crickets, package = "modeldata")
 library(tidyverse)
-library(kableExtra)
 ```
 
 Before describing how to use tidymodels for applying tidy data principles to building models with R, let's review how models are created, trained, and used in the core R language (often called "base R"). This chapter is a brief illustration of core language conventions that are important to be aware of even if you never use base R for models at all. This chapter is not exhaustive, but it provides readers (especially those new to R) the basic, most commonly used motifs. 
@@ -75,7 +74,7 @@ rate ~ temp + species
 Species is not a quantitative variable; in the data frame, it is represented as a factor column with levels `"O. exclamationis"` and `"O. niveus"`. The vast majority of model functions cannot operate on nonnumeric data. For species, the model needs to encode the species data in a numeric format. The most common approach is to use indicator variables (also known as dummy variables) in place of the original qualitative values. In this instance, since species has two possible values, the model formula will automatically encode this column as numeric by adding a new column that has a value of zero when the species is `"O. exclamationis"` and a value of one when the data correspond to `"O. niveus"`. The underlying formula machinery automatically converts these values for the data set used to create the model, as well as for any new data points (for example, when the model is used for prediction). 
 
 :::rmdnote
-Suppose there were five species instead of two. The model formula would automatically add four additional binary columns that are binary indicators for four of the species. The _reference level_ of the factor (i.e., the first level) is always left out of the predictor set. The idea is that, if you know the values of the four indicator variables, the value of the species can be determined. We discuss binary indicator variables in more detail in Section \@ref(dummies).
+Suppose there were five species instead of two. The model formula would automatically add four additional binary columns that are binary indicators for four of the species. The _reference level_ of the factor (i.e., the first level) is always left out of the predictor set. The idea is that, if you know the values of the four indicator variables, the value of the species can be determined. We discuss binary indicator variables in more detail in Chapter \@ref(recipes).
 :::
 
 The model formula `rate ~ temp + species` creates a model with different y-intercepts for each species; the slopes of the regression lines could be different for each species as well. To accommodate this structure, an interaction term can be added to the model. This can be specified in a few different ways, and the most basic uses the colon:
@@ -199,7 +198,7 @@ For the most part, practitioners' understanding of what the formula does is domi
 (temp + species)^2
 ```
 
-Our focus, when seeing this, is that there are two predictors and the model should contain their main effects and the two-way interactions.  However, this formula also implies that, since `species` is a factor, it should also create indicator variable columns for this predictor (see Section \@ref(dummies)) and multiply those columns by the `temp` column to create the interactions. This transformation represents our second bullet point on encoding; the formula also defines how each column is encoded and can create additional columns that are not in the original data. 
+Our focus, when seeing this, is that there are two predictors and the model should contain their main effects and the two-way interactions.  However, this formula also implies that, since `species` is a factor, it should also create indicator variable columns for this predictor (see Chapter \@ref(recipes)) and multiply those columns by the `temp` column to create the interactions. This transformation represents our second bullet point on encoding; the formula also defines how each column is encoded and can create additional columns that are not in the original data. 
 
 :::rmdwarning
 This is an important point that will come up multiple times in this text, especially when we discuss more complex feature engineering in Chapter \@ref(recipes) and beyond. The formula in R has some limitations, and our approaches to overcoming them contend with all three aspects. 
@@ -246,14 +245,11 @@ prob_tbl <-
   ) 
 
 prob_tbl %>% 
-  kable(
+  knitr::kable(
     caption = "Heterogeneous argument names for different modeling functions.",
     label = "probability-args",
     escape = FALSE
-  ) %>%
-  kable_styling(full_width = FALSE) %>%
-  column_spec(1, monospace = ifelse(prob_tbl$Function == "various", FALSE, TRUE)) %>%
-  column_spec(3, monospace = TRUE)
+  ) 
 ```
 
 Note that the last example has a custom function to make predictions instead of using the more common `predict()` interface (the generic `predict()` method). This lack of consistency is a barrier to day-to-day usage of R for modeling.
@@ -396,7 +392,7 @@ conflict_prefer("filter", winner = "dplyr")
 
 For convenience, `r pkg(tidymodels)` contains a function that captures most of the common naming conflicts that we might encounter:
 
-```{r base-r-clonflicts}
+```{r base-r-conflicts}
 tidymodels_prefer(quiet = FALSE)
 ```
 

diff --git a/04-ames.Rmd b/04-ames.Rmd
@@ -40,6 +40,20 @@ data(ames, package = "modeldata")
 dim(ames)
 ```
 
+Figure \@ref(fig:ames-map) shows the locations of the properties in Ames. The locations will be revisited in the next section. 
+
+```{r ames-map}
+#| out.width = "100%", 
+#| echo = FALSE, 
+#| warning = FALSE,
+#| fig.cap = "Property locations in Ames, IA.",
+#| fig.alt = "A scatter plot of house locations in Ames superimposed over a street map. There is a significant area in the center of the map where no homes were sold."
+# See file extras/ames_sf.R
+knitr::include_graphics("premade/ames_plain.png")
+```
+
+The void of data points in the center of Ames corresponds to Iowa State University. 
+
 ## Exploring Features of Homes in Ames
 
 Let's start our exploratory data analysis by focusing on the outcome we want to predict: the last sale price of the house (in USD). We can create a histogram to see the distribution of sale prices in Figure \@ref(fig:ames-sale-price-hist).
@@ -92,16 +106,16 @@ Despite these drawbacks, the models used in this book use the log transformation
 ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))
 ```
 
-Another important aspect of these data for our modeling is their geographic locations. This spatial information is contained in the data in two ways: a qualitative `Neighborhood` label as well as quantitative longitude and latitude data. To visualize the spatial information, let's use both together to plot the data on a map in Figure \@ref(fig:ames-map).
+Another important aspect of these data for our modeling are their geographic locations. This spatial information is contained in the data in two ways: a qualitative `Neighborhood` label as well as quantitative longitude and latitude data. To visualize the spatial information, Figure \@ref(fig:ames-chull) duplicates the data from Figure \@ref(fig:ames-map) with convex hulls around the data from each neighborhood. 
 
-```{r ames-map}
+```{r ames-chull}
 #| out.width = "100%", 
 #| echo = FALSE, 
 #| warning = FALSE,
-#| fig.cap = "Neighborhoods in Ames, IA",
-#| fig.alt = "A scatter plot of house locations in Ames superimposed over a street map. There is a significant area in the center of the map where no homes were sold."
+#| fig.cap = "Neighborhoods in Ames represented using a convex hull",
+#| fig.alt = "A scatter plot of house locations in Ames superimposed over a street map with colored regions that show the locations of neighborhoods. Show neighborhoods overlap and a few are nested within other neighborhoods."
 # See file extras/ames_sf.R
-knitr::include_graphics("premade/ames.png")
+knitr::include_graphics("premade/ames_chull.png")
 ```
 
 We can see a few noticeable patterns. First, there is a void of data points in the center of Ames. This corresponds to the campus of Iowa State University where there are no residential houses. Second, while there are a number of adjacent neighborhoods, others are geographically isolated. For example, as Figure \@ref(fig:ames-timberland) shows, Timberland is located apart from almost all other neighborhoods.

diff --git a/05-data-spending.Rmd b/05-data-spending.Rmd
@@ -28,7 +28,7 @@ The other portion of the data is placed into the _test set_. This is held in res
 How should we conduct this split of the data? The answer depends on the context. 
 :::
 
-Suppose we allocate 80% of the data to the training set and the remaining 20% for testing.  The most common method is to use simple random sampling. The [`r pkg(rsample)`](https://rsample.tidymodels.org/) package has tools for making data splits such as this; the function `initial_split()` was created for this purpose. It takes the data frame as an argument as well as the proportion to be placed into training. Using the data frame produced by the code snippet from the summary in Section \@ref(ames-summary) that prepared the Ames data set: 
+Suppose we allocate 80% of the data to the training set and the remaining 20% for testing.  The most common method is to use simple random sampling. The [`r pkg(rsample)`](https://rsample.tidymodels.org/) package has tools for making data splits such as this; the function `initial_split()` was created for this purpose. It takes the data frame as an argument as well as the proportion to be placed into training. Using the data frame produced by the code snippet from the summary at the end of Chapter \@ref(ames): 
 
 ```{r ames-split, message = FALSE, warning = FALSE}
 library(tidymodels)
@@ -106,13 +106,13 @@ The proportion of data that should be allocated for splitting is highly dependen
 
 When describing the goals of data splitting, we singled out the test set as the data that should be used to properly evaluate of model performance on the final model(s). This begs the question: "How can we tell what is best if we don't measure performance until the test set?" 
 
-It is common to hear about _validation sets_ as an answer to this question, especially in the neural network and deep learning literature. During the early days of neural networks, researchers realized that measuring performance by re-predicting the training set samples led to results that were overly optimistic (significantly, unrealistically so). This led to models that overfit, meaning that they performed very well on the training set but poorly on the test set.^[This is discussed in much greater detail in Section \@ref(overfitting-bad).] To combat this issue, a small validation set of data were held back and used to measure performance as the network was trained. Once the validation set error rate began to rise, the training would be halted. In other words, the validation set was a means to get a rough sense of how well the model performed prior to the test set. 
+It is common to hear about _validation sets_ as an answer to this question, especially in the neural network and deep learning literature. During the early days of neural networks, researchers realized that measuring performance by re-predicting the training set samples led to results that were overly optimistic (significantly, unrealistically so). This led to models that overfit, meaning that they performed very well on the training set but poorly on the test set.^[This is discussed in much greater detail in Chapter \@ref(tuning).] To combat this issue, a small validation set of data were held back and used to measure performance as the network was trained. Once the validation set error rate began to rise, the training would be halted. In other words, the validation set was a means to get a rough sense of how well the model performed prior to the test set. 
 
 :::rmdnote
 Whether validation sets are a subset of the training set or a third allocation in the initial split of the data largely comes down to semantics.
 :::
 
-Validation sets are discussed more in Section \@ref(validation) as a special case of _resampling_ methods that are used on the training set.
+Validation sets are discussed more in Chapter \@ref(resampling) as a special case of _resampling_ methods that are used on the training set.
 
 ## Multilevel Data