Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
bd0a8eb
Adjust Ch 1-9 figs for grayscale
juliasilge Feb 3, 2022
1e3ab6d
Adjust Ch 10-15 figs for grayscale
juliasilge Feb 3, 2022
b8ddc17
Adjust Ch 16-21 figs for grayscale
juliasilge Feb 3, 2022
449c34f
Fixes for Ch 16
juliasilge Feb 4, 2022
89ff841
Fixes for Ch 18
juliasilge Feb 4, 2022
5c91d17
Fixes for Ch 19
juliasilge Feb 4, 2022
9a121ea
Merge branch 'main' into figs-for-print
juliasilge Feb 4, 2022
6a6a3b5
Merge branch 'main' into figs-for-print
juliasilge Feb 4, 2022
4cd5d03
Fix trailing +
juliasilge Feb 4, 2022
9bf4f58
Remove Section 7.6. Closes #224.
juliasilge Feb 4, 2022
08b98b2
Merge branch 'main' into figs-for-print
juliasilge Feb 4, 2022
6a935b9
Merge from main
juliasilge Feb 4, 2022
5c2d810
Fix `ncol` in Fig 11
juliasilge Feb 4, 2022
8f92c16
Merge branch 'main' into figs-for-print
juliasilge Feb 10, 2022
33316c2
Steps to convert for O'Reilly
juliasilge Feb 13, 2022
88bd87b
Ignore this dir
juliasilge Feb 13, 2022
2b13565
More details on generating not entire book
juliasilge Feb 13, 2022
2118e62
Merge branch 'main' into figs-for-print
juliasilge Feb 13, 2022
b204472
move changes from #241 to this PR
topepo Feb 16, 2022
aa912e5
Merge from main
juliasilge Feb 17, 2022
3796c24
More work on rendering for O'Reilly
juliasilge Mar 3, 2022
389e2f0
Merge from main
juliasilge Mar 4, 2022
94a382f
Got it to work :sob:
juliasilge Mar 6, 2022
3ee373a
No kableExtra for print
juliasilge Mar 6, 2022
ea4fc31
Ignore new folder
juliasilge Mar 6, 2022
ae8d398
Add dedication
juliasilge Mar 8, 2022
a18fbee
Add dedication as .adoc
juliasilge Mar 8, 2022
af36f2d
Merge from main
juliasilge Mar 13, 2022
3006818
Update instructions for O'Reilly conversion
juliasilge Mar 13, 2022
4500f2d
Clarify some more what I'm doing
juliasilge Mar 13, 2022
2649441
typo fix
topepo Mar 16, 2022
0733dba
automate description of the remaining points
topepo Mar 16, 2022
04029ff
fix SA plot to include iterations after the last restart
topepo Mar 16, 2022
13b52bf
Merge from main
juliasilge Mar 20, 2022
2b7a096
No more "Section" for O'Reilly
juliasilge Mar 20, 2022
d06b22d
More sed, just so much sed
juliasilge Mar 21, 2022
1e34770
tune and textrecipes are on CRAN
juliasilge Mar 21, 2022
ea6e75f
color-sensitive ames plots
topepo Mar 22, 2022
0891522
some color changes
topepo Mar 22, 2022
6d6ad4d
update of Ames plots in chapter 4
topepo Mar 22, 2022
67b833e
Use file ext for O'Reilly
juliasilge Mar 25, 2022
a2aabc6
Merge branch 'figs-for-print' of https://github.com/tidymodels/TMwR i…
juliasilge Mar 25, 2022
3837e89
Merge from main
juliasilge Apr 8, 2022
9f2d6eb
Merge from main
juliasilge Apr 10, 2022
9be6827
Editorial feedback in Ch 14
juliasilge Apr 10, 2022
b5dc72f
Merge branch 'main' into figs-for-print
juliasilge Apr 11, 2022
e29f3a5
Fix contributor code
juliasilge Apr 11, 2022
5d28e3a
Keep those .md files because now they are precious
juliasilge Apr 11, 2022
6bca39a
fixed broken link
topepo Apr 11, 2022
d3a8e98
markdown files
topepo Apr 12, 2022
60cfb85
Add rendered asciidoc
juliasilge Apr 12, 2022
7f5e3f6
Fix noteboxes and images
juliasilge Apr 12, 2022
8e44b9f
figures
topepo Apr 12, 2022
2f47d08
Merge branch 'figs-for-print' of https://github.com/tidymodels/TMwR i…
topepo Apr 12, 2022
5d4d4ea
Merge from main
juliasilge Apr 13, 2022
e9f39f1
Merge branch 'main' into figs-for-print
juliasilge Apr 14, 2022
35985e6
refresh results
topepo Apr 14, 2022
af97a49
Render asciidoc
juliasilge Apr 14, 2022
40f5fd8
Missed a table
juliasilge Apr 14, 2022
3f58645
Merge from main
juliasilge Apr 15, 2022
0219f42
Render book actually on GH actions
juliasilge Apr 15, 2022
973aea6
Namespace function
juliasilge Apr 15, 2022
7a9b92e
More table fixes
juliasilge Apr 15, 2022
f0c4f94
No need to remove all these now
juliasilge Apr 15, 2022
9361438
New asciidoc
juliasilge Apr 15, 2022
9d92977
More fixes for tables
juliasilge Apr 16, 2022
1d719c1
re-run to generate figures
topepo Apr 20, 2022
fa3cd70
stop labels from going off page
topepo Apr 20, 2022
b463872
fix image plot ranges
topepo Apr 20, 2022
c06a3ea
re-run with fixed figures
topepo Apr 20, 2022
78cfaa3
re-run with fixed figures
topepo Apr 20, 2022
26074fb
Merge branch 'main' into figs-for-print
juliasilge May 4, 2022
d706ce1
Merge from main
juliasilge May 24, 2022
dc9c69d
high res png files
topepo Jun 26, 2022
e53dd65
Bump to 1.0.1 for first edition (print)
juliasilge Jul 22, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
7 changes: 6 additions & 1 deletion .github/workflows/bookdown.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
- uses: r-lib/actions/setup-r-dependencies@v2

- name: Build site
run: Rscript -e 'bookdown::render_book("index.Rmd", quiet = TRUE)'
run: Rscript -e 'bookdown::render_book("index.Rmd", output_format = bookdown::html_book(keep_md = TRUE), quiet = TRUE)'

- name: Deploy to Netlify
if: contains(env.isExtPR, 'false')
Expand All @@ -52,3 +52,8 @@ jobs:
NETLIFY_AUTH_TOKEN: ${{ secrets.NETLIFY_AUTH_TOKEN }}
NETLIFY_SITE_ID: ${{ secrets.NETLIFY_SITE_ID }}
timeout-minutes: 1

- uses: actions/upload-artifact@v1
with:
name: _book
path: _book/
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
_book
_main.*
libs
figures
^figures/*
_bookdown_files
figures/introduction-cricket-plot-1.svg
figures/introduction-descr-examples-1.pdf
Expand All @@ -19,3 +19,6 @@ figures/tidyverse-interaction-plots-1.svg
extras/iowa_highway.shx
extras/iowa_highway.shp
files_for_print*
tmwr-to-ch9*
extras/iowa_highway.zip
extras/iowa_highway/iowa_highway.shp
25 changes: 4 additions & 21 deletions 01-software-modeling.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ knitr::opts_chunk$set(fig.path = "figures/")
library(tidyverse)
library(gridExtra)
library(tibble)
library(kableExtra)

data(ames, package = "modeldata")
```
Expand Down Expand Up @@ -66,7 +65,7 @@ For example, large scale measurements of RNA have been possible for some time us

An early method for evaluating such issues were probe-level models, or PLMs [@bolstad2004]. A statistical model would be created that accounted for the known differences in the data, such as the chip, the RNA sequence, the type of sequence, and so on. If there were other, unknown factors in the data, these effects would be captured in the model residuals. When the residuals were plotted by their location on the chip, a good quality chip would show no patterns. When a problem did occur, some sort of spatial pattern would be discernible. Often the type of pattern would suggest the underlying issue (e.g., a fingerprint) and a possible solution (wipe off the chip and rescan, repeat the sample, etc.). Figure \@ref(fig:software-descr-examples)(a) shows an application of this method for two microarrays taken from @Gentleman2005. The images show two different color values; areas that are darker are where the signal intensity was larger than the model expects while the lighter color shows lower than expected values. The left-hand panel demonstrates a fairly random pattern while the right-hand panel exhibits an undesirable artifact in the middle of the chip.

```{r software-descr-examples, echo = FALSE, fig.cap = "Two examples of how descriptive models can be used to illustrate specific patterns", out.width = '80%', dev = "png", fig.height = 8, warning = FALSE, message = FALSE}
```{r software-descr-examples, echo = FALSE, fig.cap = "Two examples of how descriptive models can be used to illustrate specific patterns", out.width = '80%', fig.height = 8, warning = FALSE, message = FALSE}
load("RData/plm_resids.RData")

resid_cols <- RColorBrewer::brewer.pal(8, "Set1")[1:2]
Expand Down Expand Up @@ -255,28 +254,12 @@ monolog <-
"Model Evaluation", "2",
"Let’s drop K-NN from the model list. "
)
if (knitr::is_html_output()) {
tab <-
monolog %>%
monolog %>%
dplyr::select(Thoughts, Activity) %>%
kable(
knitr::kable(
caption = "Hypothetical inner monologue of a model developer.",
label = "inner-monologue"
) %>%
kable_styling() %>%
column_spec(2, width = "25%") %>%
column_spec(1, width = "75%", italic = TRUE)
} else {
tab <-
monolog %>%
dplyr::select(Thoughts, Activity) %>%
kable(
caption = "Hypothetical inner monologue of a model developer.",
label = "inner-monologue"
) %>%
kable_styling()
}
tab
)
```

## Chapter Summary {#software-summary}
Expand Down
14 changes: 5 additions & 9 deletions 03-base-r.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
knitr::opts_chunk$set(fig.path = "figures/")
data(crickets, package = "modeldata")
library(tidyverse)
library(kableExtra)
```

Before describing how to use tidymodels for applying tidy data principles to building models with R, let's review how models are created, trained, and used in the core R language (often called "base R"). This chapter is a brief illustration of core language conventions that are important to be aware of even if you never use base R for models at all. This chapter is not exhaustive, but it provides readers (especially those new to R) the basic, most commonly used motifs.
Expand Down Expand Up @@ -75,7 +74,7 @@ rate ~ temp + species
Species is not a quantitative variable; in the data frame, it is represented as a factor column with levels `"O. exclamationis"` and `"O. niveus"`. The vast majority of model functions cannot operate on nonnumeric data. For species, the model needs to encode the species data in a numeric format. The most common approach is to use indicator variables (also known as dummy variables) in place of the original qualitative values. In this instance, since species has two possible values, the model formula will automatically encode this column as numeric by adding a new column that has a value of zero when the species is `"O. exclamationis"` and a value of one when the data correspond to `"O. niveus"`. The underlying formula machinery automatically converts these values for the data set used to create the model, as well as for any new data points (for example, when the model is used for prediction).

:::rmdnote
Suppose there were five species instead of two. The model formula would automatically add four additional binary columns that are binary indicators for four of the species. The _reference level_ of the factor (i.e., the first level) is always left out of the predictor set. The idea is that, if you know the values of the four indicator variables, the value of the species can be determined. We discuss binary indicator variables in more detail in Section \@ref(dummies).
Suppose there were five species instead of two. The model formula would automatically add four additional binary columns that are binary indicators for four of the species. The _reference level_ of the factor (i.e., the first level) is always left out of the predictor set. The idea is that, if you know the values of the four indicator variables, the value of the species can be determined. We discuss binary indicator variables in more detail in Chapter \@ref(recipes).
:::

The model formula `rate ~ temp + species` creates a model with different y-intercepts for each species; the slopes of the regression lines could be different for each species as well. To accommodate this structure, an interaction term can be added to the model. This can be specified in a few different ways, and the most basic uses the colon:
Expand Down Expand Up @@ -199,7 +198,7 @@ For the most part, practitioners' understanding of what the formula does is domi
(temp + species)^2
```

Our focus, when seeing this, is that there are two predictors and the model should contain their main effects and the two-way interactions. However, this formula also implies that, since `species` is a factor, it should also create indicator variable columns for this predictor (see Section \@ref(dummies)) and multiply those columns by the `temp` column to create the interactions. This transformation represents our second bullet point on encoding; the formula also defines how each column is encoded and can create additional columns that are not in the original data.
Our focus, when seeing this, is that there are two predictors and the model should contain their main effects and the two-way interactions. However, this formula also implies that, since `species` is a factor, it should also create indicator variable columns for this predictor (see Chapter \@ref(recipes)) and multiply those columns by the `temp` column to create the interactions. This transformation represents our second bullet point on encoding; the formula also defines how each column is encoded and can create additional columns that are not in the original data.

:::rmdwarning
This is an important point that will come up multiple times in this text, especially when we discuss more complex feature engineering in Chapter \@ref(recipes) and beyond. The formula in R has some limitations, and our approaches to overcoming them contend with all three aspects.
Expand Down Expand Up @@ -246,14 +245,11 @@ prob_tbl <-
)

prob_tbl %>%
kable(
knitr::kable(
caption = "Heterogeneous argument names for different modeling functions.",
label = "probability-args",
escape = FALSE
) %>%
kable_styling(full_width = FALSE) %>%
column_spec(1, monospace = ifelse(prob_tbl$Function == "various", FALSE, TRUE)) %>%
column_spec(3, monospace = TRUE)
)
```

Note that the last example has a custom function to make predictions instead of using the more common `predict()` interface (the generic `predict()` method). This lack of consistency is a barrier to day-to-day usage of R for modeling.
Expand Down Expand Up @@ -396,7 +392,7 @@ conflict_prefer("filter", winner = "dplyr")

For convenience, `r pkg(tidymodels)` contains a function that captures most of the common naming conflicts that we might encounter:

```{r base-r-clonflicts}
```{r base-r-conflicts}
tidymodels_prefer(quiet = FALSE)
```

Expand Down
24 changes: 19 additions & 5 deletions 04-ames.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,20 @@ data(ames, package = "modeldata")
dim(ames)
```

Figure \@ref(fig:ames-map) shows the locations of the properties in Ames. The locations will be revisited in the next section.

```{r ames-map}
#| out.width = "100%",
#| echo = FALSE,
#| warning = FALSE,
#| fig.cap = "Property locations in Ames, IA.",
#| fig.alt = "A scatter plot of house locations in Ames superimposed over a street map. There is a significant area in the center of the map where no homes were sold."
# See file extras/ames_sf.R
knitr::include_graphics("premade/ames_plain.png")
```

The void of data points in the center of Ames corresponds to Iowa State University.

## Exploring Features of Homes in Ames

Let's start our exploratory data analysis by focusing on the outcome we want to predict: the last sale price of the house (in USD). We can create a histogram to see the distribution of sale prices in Figure \@ref(fig:ames-sale-price-hist).
Expand Down Expand Up @@ -92,16 +106,16 @@ Despite these drawbacks, the models used in this book use the log transformation
ames <- ames %>% mutate(Sale_Price = log10(Sale_Price))
```

Another important aspect of these data for our modeling is their geographic locations. This spatial information is contained in the data in two ways: a qualitative `Neighborhood` label as well as quantitative longitude and latitude data. To visualize the spatial information, let's use both together to plot the data on a map in Figure \@ref(fig:ames-map).
Another important aspect of these data for our modeling are their geographic locations. This spatial information is contained in the data in two ways: a qualitative `Neighborhood` label as well as quantitative longitude and latitude data. To visualize the spatial information, Figure \@ref(fig:ames-chull) duplicates the data from Figure \@ref(fig:ames-map) with convex hulls around the data from each neighborhood.

```{r ames-map}
```{r ames-chull}
#| out.width = "100%",
#| echo = FALSE,
#| warning = FALSE,
#| fig.cap = "Neighborhoods in Ames, IA",
#| fig.alt = "A scatter plot of house locations in Ames superimposed over a street map. There is a significant area in the center of the map where no homes were sold."
#| fig.cap = "Neighborhoods in Ames represented using a convex hull",
#| fig.alt = "A scatter plot of house locations in Ames superimposed over a street map with colored regions that show the locations of neighborhoods. Show neighborhoods overlap and a few are nested within other neighborhoods."
# See file extras/ames_sf.R
knitr::include_graphics("premade/ames.png")
knitr::include_graphics("premade/ames_chull.png")
```

We can see a few noticeable patterns. First, there is a void of data points in the center of Ames. This corresponds to the campus of Iowa State University where there are no residential houses. Second, while there are a number of adjacent neighborhoods, others are geographically isolated. For example, as Figure \@ref(fig:ames-timberland) shows, Timberland is located apart from almost all other neighborhoods.
Expand Down
6 changes: 3 additions & 3 deletions 05-data-spending.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ The other portion of the data is placed into the _test set_. This is held in res
How should we conduct this split of the data? The answer depends on the context.
:::

Suppose we allocate 80% of the data to the training set and the remaining 20% for testing. The most common method is to use simple random sampling. The [`r pkg(rsample)`](https://rsample.tidymodels.org/) package has tools for making data splits such as this; the function `initial_split()` was created for this purpose. It takes the data frame as an argument as well as the proportion to be placed into training. Using the data frame produced by the code snippet from the summary in Section \@ref(ames-summary) that prepared the Ames data set:
Suppose we allocate 80% of the data to the training set and the remaining 20% for testing. The most common method is to use simple random sampling. The [`r pkg(rsample)`](https://rsample.tidymodels.org/) package has tools for making data splits such as this; the function `initial_split()` was created for this purpose. It takes the data frame as an argument as well as the proportion to be placed into training. Using the data frame produced by the code snippet from the summary at the end of Chapter \@ref(ames):

```{r ames-split, message = FALSE, warning = FALSE}
library(tidymodels)
Expand Down Expand Up @@ -106,13 +106,13 @@ The proportion of data that should be allocated for splitting is highly dependen

When describing the goals of data splitting, we singled out the test set as the data that should be used to properly evaluate of model performance on the final model(s). This begs the question: "How can we tell what is best if we don't measure performance until the test set?"

It is common to hear about _validation sets_ as an answer to this question, especially in the neural network and deep learning literature. During the early days of neural networks, researchers realized that measuring performance by re-predicting the training set samples led to results that were overly optimistic (significantly, unrealistically so). This led to models that overfit, meaning that they performed very well on the training set but poorly on the test set.^[This is discussed in much greater detail in Section \@ref(overfitting-bad).] To combat this issue, a small validation set of data were held back and used to measure performance as the network was trained. Once the validation set error rate began to rise, the training would be halted. In other words, the validation set was a means to get a rough sense of how well the model performed prior to the test set.
It is common to hear about _validation sets_ as an answer to this question, especially in the neural network and deep learning literature. During the early days of neural networks, researchers realized that measuring performance by re-predicting the training set samples led to results that were overly optimistic (significantly, unrealistically so). This led to models that overfit, meaning that they performed very well on the training set but poorly on the test set.^[This is discussed in much greater detail in Chapter \@ref(tuning).] To combat this issue, a small validation set of data were held back and used to measure performance as the network was trained. Once the validation set error rate began to rise, the training would be halted. In other words, the validation set was a means to get a rough sense of how well the model performed prior to the test set.

:::rmdnote
Whether validation sets are a subset of the training set or a third allocation in the initial split of the data largely comes down to semantics.
:::

Validation sets are discussed more in Section \@ref(validation) as a special case of _resampling_ methods that are used on the training set.
Validation sets are discussed more in Chapter \@ref(resampling) as a special case of _resampling_ methods that are used on the training set.

## Multilevel Data

Expand Down
Loading