Skip to content

Commit d4db7dc

Browse files
authored
2025 may release (#86)
* release chapters only * renv update * no errata anymore * fix some chunks with fig- labels and fencing * fix some chunks with fig- labels and fencing * stable port number * temp fill in color value * something is off with theme_light_bl() * kknn no longer on cran * fix some tables * renv refresh * other updates * upgraded tables * temp remotes changes * re-render * update snapshot * change release date
1 parent f92cb0a commit d4db7dc

File tree

31 files changed

+881
-685
lines changed

31 files changed

+881
-685
lines changed

DESCRIPTION

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,6 @@ Imports:
6363
jsonlite,
6464
kableExtra,
6565
kernlab,
66-
kknn,
6766
klaR,
6867
knitr,
6968
leaflet,
@@ -139,7 +138,8 @@ Remotes:
139138
Bioconductor/BiocParallel,
140139
mixOmicsTeam/mixOmics,
141140
stevenpawley/colino,
142-
JamesHWade/measure
141+
JamesHWade/measure,
142+
tidymodels/[email protected]
143143
Config/testthat/edition: 3
144144
Encoding: UTF-8
145145
LazyData: true

R/shiny-polynomial.R

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,3 +152,4 @@ server <- function(input, output, session) {
152152
}
153153

154154
app <- shinyApp(ui, server)
155+

RData/deliveries_cubist.RData

12 Bytes
Binary file not shown.

RData/deliveries_lm.RData

-6.68 KB
Binary file not shown.

RData/mlp_rf_mtr.RData

-2 Bytes
Binary file not shown.

_freeze/chapters/categorical-predictors/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

_freeze/chapters/contributing/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

_freeze/chapters/embeddings/execute-results/html.json

Lines changed: 3 additions & 5 deletions
Large diffs are not rendered by default.

_freeze/chapters/feature-selection/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

_freeze/chapters/grid-search/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

_freeze/chapters/initial-data-splitting/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

_freeze/chapters/interactions-nonlinear/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

_freeze/chapters/introduction/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

_freeze/chapters/iterative-search/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

_freeze/chapters/missing-data/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

_freeze/chapters/numeric-predictors/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

_freeze/chapters/overfitting/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

_freeze/chapters/resampling/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

_freeze/chapters/whole-game/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

_freeze/index/execute-results/html.json

Lines changed: 2 additions & 2 deletions
Large diffs are not rendered by default.

_quarto.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
project:
22
type: book
3+
preview:
4+
port: 3763
35

46
filters:
57
- shinylive
@@ -80,7 +82,6 @@ book:
8082
- chapters/grid-search.qmd
8183
- chapters/iterative-search.qmd
8284
- chapters/feature-selection.qmd
83-
- chapters/comparing-models.qmd
8485
- part: "Classification"
8586
- part: "Regression"
8687
- part: "Characterization"

chapters/categorical-predictors.qmd

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,7 @@ As a simple example, consider the customer type predictor with categories: "cont
138138
@tbl-indicators shows how this works for the customer type. The rows depict the possible values in the data, while the columns are the resulting features used in place of the original column. This table uses the most common indicator encoding method called _reference cell_ parameterization (also called a _treatment contrast_). First, a category is chosen as the reference value. In @tbl-indicators, the first alpha-numeric value is used (`"contract"`), but this is an arbitrary choice. After this, we create separate columns for all possible values except for the reference value. Each of these columns has a value of one when the data matches the column for that value (and is zero otherwise).
139139

140140
```{r}
141-
#| label: indicators
141+
#| label: tbl-indicators
142142
#| tbl-cap: "Indicator columns produced from a categorical column using a reference cell parameterization."
143143
144144
customer_types <-
@@ -154,7 +154,8 @@ customer_types %>%
154154
rename_all(~ gsub("_", " ", .x)) %>%
155155
gt() %>%
156156
tab_spanner(label = "Indicator Columns", columns = c(-`customer type`)) %>%
157-
tab_options(table.width = pct(50))
157+
tab_options(table.width = pct(50)) |>
158+
tab_style_body(style = cell_text(color = "gray70"), values = 0)
158159
```
159160

160161
The rationale for excluding one column is that you can infer the reference value if you know the values of all of the existing indicator columns^[In other words, we know that the vector of indicators (0, 0, 0) must represent the contract customers.]. Including all possible indicator columns embeds a redundancy in the data. As we will see shortly, data that contain this type of redundancy pose problems for some models like linear regression.
@@ -188,20 +189,24 @@ hot_mod <-
188189
@tbl-ref-cell-effects shows the encoding in @tbl-indicators and adds columns for a numeric outcome and the intercept term. The outcome column shows the average daily rate (in €). Using a standard estimation procedure (called ordinary least squares), the bottom row of @tbl-ref-cell-effects shows the `r ncol(ref_cell_mod$x)` parameter estimates. Since all of the indicators for the contract customer row are zero, the intercept column estimates the mean value for that level ($\widehat{\beta}_0$ = `r round(coef(ref_cell_mod)[["(Intercept)"]], 1)`). The variable $x_{i1}$ only has an indicator for the "group" customers. Hence, its estimate corresponds to the difference in the average group outcome values (`r mean_adr$rounded[mean_adr$customer_type == "group"]`) minus the effect of the reference cell: `r mean_adr$rounded[mean_adr$customer_type == "contract"]` - `r mean_adr$rounded[mean_adr$customer_type == "group"]`. From this, the resulting estimate ($\widehat{\beta}_1$ = `r round(coef(ref_cell_mod)[["customer_typegroup"]], 2)`) is the effect of the group customers above and beyond the impact of the contract customers. The parameter estimates for the other possible values follow analogous interpretations.
189190
190191
```{r}
191-
#| label: ref-cell-effects
192+
#| label: tbl-ref-cell-effects
192193
#| tbl-cap: "An example of linear regression parameter estimates corresponding to a reference cell parameterization."
193194
194195
format_encoding(ref_cell_mod) %>%
195-
tab_options(table.width = pct(66))
196+
tab_options(table.width = pct(66)) |>
197+
tab_style_body(style = cell_text(color = "gray70"), values = 0) |>
198+
cols_width(info ~ pct(25))
196199
```
197200
198201
Another popular method for making indicator variables is called **one-hot encoding** (also known as a cell means encoding). This technique, shown in @tbl-one-hot, makes indicators for all possible levels of the predictor and does _not_ show an intercept column (for reasons described shortly). In this model parameterization, indicators are specific to each value in the data, and the linear regression estimates are the average response values for each customer type.
199202
200203
```{r}
201-
#| label: one-hot
204+
#| label: tbl-one-hot
202205
#| tbl-cap: "One-hot encoded indicator variables from a categorical column of data."
203206
format_encoding(hot_mod) %>%
204-
tab_options(table.width = pct(69))
207+
tab_options(table.width = pct(69)) |>
208+
tab_style_body(style = cell_text(color = "gray70"), values = 0) |>
209+
cols_width(info ~ pct(25))
205210
```
206211
207212
One-hot encodings are often used in nonlinear models, especially in neural networks and tree-based models. Indicators are not generally required for the latter but can be used^[This is discussed in greater detail for one of the case studies in @sec-reg-summary.].
@@ -319,7 +324,7 @@ hash_256 <-
319324
```
320325
321326
```{r}
322-
#| label: feature-hash
327+
#| label: tbl-feature-hash
323328
#| tbl-cap: "Signed indicators for agent via feature hashing."
324329
325330
remake_name <- function(x) {
@@ -377,8 +382,9 @@ bind_rows(hash_top, hash_middle, hash_bottom, hash_summary) %>%
377382
align = "right",
378383
columns = everything()
379384
) %>%
380-
tab_options(table.width = pct(70))
381-
385+
tab_options(table.width = pct(70)) |>
386+
tab_style_body(style = cell_text(color = "gray70"), values = " 0") |>
387+
cols_width(Agent ~ pct(25))
382388
```
383389
384390
The main downside of this method is that the use of hash values makes it impossible to explain the model. If the tenth feature column is critical, we can't explain why this is the case for new data (since the hash function is practically non-reversible and may include collisions). This may be fine if the primary objective is prediction rather than interpretation. When the goal is to optimize predictive performance, then the number of hashing columns to use can be included as a tuning parameter. The model tuning process can then determine an optimal value of the number of hashing columns.
@@ -542,7 +548,7 @@ The amount of shrinkage was driven mainly by the number of bookings per agent. @
542548
To reiterate how these values are used for pre-processing this type of predictor, @tbl-effect-estimates shows the linear mixed model analysis results. The numeric column is our primary model's data for representing the agent names. This avoids creating a large number of indicator variables for this predictor.
543549
544550
```{r}
545-
#| label: effect-estimates
551+
#| label: tbl-effect-estimates
546552
#| tbl-cap: "Examples of the numeric values that are used in place of each agent's data when effect encodings are used."
547553
effect_chr <-
548554
encoded_results %>%

chapters/embeddings.qmd

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -501,7 +501,7 @@ Note that the first component alone captured `r round(barley_cumulative_variance
501501
::: {.figure-content}
502502

503503
```{shinylive-r}
504-
#| label: fig-linear-scores
504+
#| label: shiny-linear-scores
505505
#| out-width: "80%"
506506
#| viewerHeight: 550
507507
#| standalone: true
@@ -519,8 +519,8 @@ source("https://raw.githubusercontent.com/aml4td/website/main/R/shiny-setup.R")
519519
source("https://raw.githubusercontent.com/aml4td/website/main/R/shiny-linear-scores.R")
520520
521521
app
522-
`
523-
``
522+
```
523+
524524
:::
525525

526526
A visualization of the four new features for different linear embedding methods. The data shown are the validation set results.
@@ -536,7 +536,7 @@ For PCA, it can be very instructive to visualize the loadings for each component
536536
::: {.figure-content}
537537

538538
```{shinylive-r}
539-
#| label: fig-linear-loadings
539+
#| label: shiny-linear-loadings
540540
#| viewerHeight: 550
541541
#| standalone: true
542542
@@ -552,8 +552,8 @@ source(
552552
)
553553
554554
app
555-
``
556-
`
555+
```
556+
557557
:::
558558

559559
The loadings for the first four components of each linear embedding method as a function of wavelength.
@@ -939,7 +939,7 @@ Take @fig-mds-example(a) as an example. There are ten points in two dimensions (
939939
#| label: mds-example-computations
940940
#| include: false
941941
942-
pens <- penguins[complete.cases(penguins),]
942+
pens <- modeldata::penguins[complete.cases(modeldata::penguins),]
943943
944944
n <- 10
945945
set.seed(119)
@@ -1274,7 +1274,7 @@ For supervised UMAP, there is an additional weighting parameter (between zero an
12741274
::: {.figure-content}
12751275

12761276
```{shinylive-r}
1277-
#| label: fig-umap
1277+
#| label: shiny-umap
12781278
#| viewerHeight: 550
12791279
#| standalone: true
12801280
@@ -1288,8 +1288,8 @@ source("https://raw.githubusercontent.com/aml4td/website/main/R/shiny-setup.R")
12881288
source("https://raw.githubusercontent.com/aml4td/website/main/R/shiny-umap.R")
12891289
12901290
app
1291-
``
1292-
`
1291+
```
1292+
12931293
:::
12941294

12951295
A visualization of UMAP results for the barley data using different values for several tuning parameters. The points are the validation set values.

chapters/grid-search.qmd

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -575,6 +575,7 @@ At each resampling estimate beyond the first $B_{min}$ iterations, the current c
575575
\end{algorithmic}
576576
\end{algorithm}
577577
```
578+
578579
:::
579580

580581
::: {.column width="10%"}
@@ -755,7 +756,7 @@ Using the same 10-fold cross-validation scheme, @fig-1d-boost shows the results
755756
::: {.figure-content}
756757
757758
```{r}
758-
#| label: 1d-boost
759+
#| label: shiny-1d-boost
759760
#| echo: false
760761
#| fig-align: center
761762
#| out-width: 70%

chapters/initial-data-splitting.qmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ This chapter will examine how we can appropriately utilize our data. Except in @
5252
These data, originally published by @ames, are an excellent teaching example. Data were collected for `r format(nrow(ames), big.mark = ",")` houses in Ames, Iowa, via the local assessor's office. A variety of different characteristics of the houses were measured. [Chapter 4](https://www.tmwr.org/ames.html) of @tmwr contains a detailed examination of these data. For illustration, we will focus on a smaller set of predictors, summarized in Tables [-@tbl-ames-numeric] and [-@tbl-ames-categorical]. The geographic locations of the properties are shown in @fig-ames-selection.
5353

5454
```{r}
55-
#| label: ames-numeric
55+
#| label: tbl-ames-numeric
5656
#| echo: false
5757
#| warning: false
5858
#| message: false
@@ -145,7 +145,7 @@ bind_cols(
145145
```
146146

147147
```{r}
148-
#| label: ames-categorical
148+
#| label: tbl-ames-categorical
149149
#| echo: false
150150
#| tbl-cap: A summary of categorical predictors in the Ames housing data.
151151
#| html-table-processing: none

0 commit comments

Comments
 (0)