-
Notifications
You must be signed in to change notification settings - Fork 288
/
14-iterative-search.Rmd
950 lines (740 loc) · 48.8 KB
/
14-iterative-search.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
```{r iterative-setup, include = FALSE}
knitr::opts_chunk$set(fig.path = "figures/")
library(tidymodels)
library(finetune)
library(patchwork)
library(kableExtra)
library(av)
library(doMC)
registerDoMC(cores = parallel::detectCores(logical = TRUE))
tidymodels_prefer()
## -----------------------------------------------------------------------------
source("extras/verify_results.R")
source("extras/sa_2d_plot.R")
source("extras/bo_3panel_plot.R")
load(file.path("RData", "svm_large.RData"))
## -----------------------------------------------------------------------------
data(cells)
cells <- cells %>% select(-case)
set.seed(1304)
cell_folds <- vfold_cv(cells)
roc_res <- metric_set(roc_auc)
```
# Iterative Search {#iterative-search}
Chapter \@ref(grid-search) demonstrated how grid search takes a pre-defined set of candidate values, evaluates them, then chooses the best settings. Iterative search methods pursue a different strategy. During the search process, they predict which values to test next.
:::rmdnote
When grid search is infeasible or inefficient, iterative methods are a sensible approach for optimizing tuning parameters.
:::
This chapter outlines two search methods. First, we discuss _Bayesian optimization_, which uses a statistical model to predict better parameter settings. After that, the chapter describes a global search method called _simulated annealing_.
We use the same data on cell characteristics as the previous chapter for illustration, but change the model. This chapter uses a support vector machine model because it provides nice two-dimensional visualizations of the search processes.
## A Support Vector Machine Model {#svm}
We once again use the cell segmentation data, described in Section \@ref(evaluating-grid), for modeling, with a support vector machine (SVM) model to demonstrate sequential tuning methods. See @apm for more information on this model. The two tuning parameters to optimize are the SVM cost value and the radial basis function kernel parameter $\sigma$. Both parameters can have a profound effect on the model complexity and performance.
The SVM model uses a dot product and, for this reason, it is necessary to center and scale the predictors. Like the multilayer perceptron model, this model would benefit from the use of PCA feature extraction. However, we will not use this third tuning parameter in this chapter so that we can visualize the search process in two dimensions.
Along with the previously used objects (shown in Section \@ref(grid-summary)), the tidymodels objects `svm_rec`, `svm_spec`, and `svm_wflow` define the model process:
```{r iterative-svm-defs, message = FALSE}
library(tidymodels)
tidymodels_prefer()
svm_rec <-
recipe(class ~ ., data = cells) %>%
step_YeoJohnson(all_numeric_predictors()) %>%
step_normalize(all_numeric_predictors())
svm_spec <-
svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
set_engine("kernlab") %>%
set_mode("classification")
svm_wflow <-
workflow() %>%
add_model(svm_spec) %>%
add_recipe(svm_rec)
```
The default parameter ranges for the two tuning parameters `cost` and `rbf_sigma` are:
```{r iterative-svm-param}
cost()
rbf_sigma()
```
For illustration, let's slightly change the kernel parameter range, to improve the visualizations of the search:
```{r iterative-svm-param-set}
svm_param <-
svm_wflow %>%
extract_parameter_set_dials() %>%
update(rbf_sigma = rbf_sigma(c(-7, -1)))
```
Before discussing specific details about iterative search and how it works, let's explore the relationship between the two SVM tuning parameters and the area under the ROC curve for this specific data set. We constructed a very large regular grid, composed of 2,500 candidate values, and evaluated the grid using resampling. This is obviously impractical in regular data analysis and tremendously inefficient. However, it elucidates the path that the search process should take and where the numerically optimal value(s) occur.
Figure \@ref(fig:roc-surface) shows the results of evaluating this grid, with lighter color corresponding to higher (better) model performance. There is a large swath in the lower diagonal of the parameter space that is relatively flat with poor performance. A ridge of best performance occurs in the upper-right portion of the space. The black dot indicates the best settings. The transition from the plateau of poor results to the ridge of best performance is very sharp. There is also a sharp drop in the area under the ROC curve just to the right of the ridge.
```{r roc-surface}
#| out.width = "80%",
#| echo = FALSE,
#| warning = FALSE,
#| fig.cap = "Heatmap of the mean area under the ROC curve for a high density grid of tuning parameter values. The best point is a solid dot in the upper-right corner. ",
#| fig.alt = "A heatmap of the mean area under the ROC curve for a high density grid of tuning parameter values. The best point is a solid dot in the upper-right corner. The surface has a ridge of high performance moving to the lower right. "
# See file extras/cells_svm_large.R
knitr::include_graphics("premade/roc_surface.png")
```
The following search procedures require at least some resampled performance statistics before proceeding. For this purpose, the following code creates a small regular grid that resides in the flat portion of the parameter space. The `tune_grid()` function resamples this grid:
```{r iterative-svm-initial, message = FALSE}
set.seed(1401)
start_grid <-
svm_param %>%
update(
cost = cost(c(-6, 1)),
rbf_sigma = rbf_sigma(c(-6, -4))
) %>%
grid_regular(levels = 2)
set.seed(1402)
svm_initial <-
svm_wflow %>%
tune_grid(resamples = cell_folds, grid = start_grid, metrics = roc_res)
collect_metrics(svm_initial)
```
This initial grid shows fairly equivalent results, with no individual point much better than any of the others. These results can be ingested by the iterative tuning functions discussed in the following sections to be used as initial values.
## Bayesian Optimization
Bayesian optimization techniques analyze the current resampling results and create a predictive model to suggest tuning parameter values that have yet to be evaluated. The suggested parameter combination is then resampled. These results are then used in another predictive model that recommends more candidate values for testing, and so on. The process proceeds for a set number of iterations or until no further improvements occur. @Shahriari and @frazier2018tutorial are good introductions to Bayesian optimization.
When using Bayesian optimization, the primary concerns are how to create the model and how to select parameters recommended by that model. First, let's consider the technique most commonly used for Bayesian optimization, the Gaussian process model.
### A Gaussian process model
Gaussian process (GP) [@SCHULZ20181] models are well-known statistical techniques that have a history in spatial statistics (under the name of _kriging methods_). They can be derived in multiple ways, including as a Bayesian model; see @RaWi06 for an excellent reference.
Mathematically, a GP is a collection of random variables whose joint probability distribution is multivariate Gaussian. In the context of our application, this is the collection of performance metrics for the tuning parameter candidate values. For the previous initial grid of four samples, the realization of these four random variables were `r knitr::combine_words(round(collect_metrics(svm_initial)$mean, 4))`. These are assumed to be distributed as multivariate Gaussian. The inputs that define the independent variables/predictors for the GP model are the corresponding tuning parameter values (shown in Table \@ref(tab:initial-gp-data)).
```{r iterative-gp-data-table, echo = FALSE, results = "asis"}
collect_metrics(svm_initial) %>%
select(ROC = mean, cost, rbf_sigma) %>%
as.data.frame() %>%
format(digits = 4, scientific = FALSE) %>%
kable(
caption = "Resampling statistics used as the initial substrate to the Gaussian process model.",
label = "initial-gp-data"
) %>%
kableExtra::kable_styling(full_width = FALSE) %>%
kableExtra::add_header_above(c("outcome" = 1, "predictors" = 2))
```
Gaussian process models are specified by their mean and covariance functions, although the latter has the most effect on the nature of the GP model. The covariance function is often parameterized in terms of the input values (denoted as $x$). As an example, a commonly used covariance function is the squared exponential^[This equation is also the same as the _radial basis function_ used in kernel methods, such as the SVM model that is currently being used. This is a coincidence; this covariance function is unrelated to the SVM tuning parameter that we are using. ] function:
$$\operatorname{cov}(\boldsymbol{x}_i, \boldsymbol{x}_j) = \exp\left(-\frac{1}{2}|\boldsymbol{x}_i - \boldsymbol{x}_j|^2\right) + \sigma^2_{ij}$$
where $\sigma^2_{ij}$ is a constant error variance term that is zero when $i=j$. This equation translates to:
> As the distance between two tuning parameter combinations increases, the covariance between the performance metrics increase exponentially.
The nature of the equation also implies that the variation of the outcome metric is minimized at the points that have already been observed (i.e., when $|\boldsymbol{x}_i - \boldsymbol{x}_j|^2$ is zero).
The nature of this covariance function allows the Gaussian process to represent highly nonlinear relationships between model performance and the tuning parameters even when only a small amount of data exists.
:::rmdwarning
However, fitting these models can be difficult in some cases, and the model becomes more computationally expensive as the number of tuning parameter combinations increases.
:::
An important virtue of this model is that, since a full probability model is specified, the predictions for new inputs can reflect the entire distribution of the outcome. In other words, new performance statistics can be predicted in terms of both mean and variance.
Suppose that two new tuning parameters were under consideration. In Table \@ref(tab:tuning-candidates), candidate _A_ has a slightly better mean ROC value than candidate _B_ (the current best is `r round(max(collect_metrics(svm_initial)$mean), 4)`). However, its variance is four-fold larger than _B_. Is this good or bad? Choosing option _A_ is riskier but has potentially higher return. The increase in variance also reflects that this new value is farther from the existing data than _B_. The next section considers these aspects of GP predictions for Bayesian optimization in more detail.
```{r iterative-gp-pred, echo = FALSE, result = "asis"}
best_val <- max(collect_metrics(svm_initial)$mean)
tmp <- tibble(candidate = LETTERS[1:2], .mean = c(.90, .89), .sd = c(0.02, 0.005))
tmp %>%
select(candidate, mean = .mean, variance = .sd) %>%
mutate(variance = variance^2) %>%
as.data.frame() %>%
format(digits = 4, scientific = FALSE) %>%
kable(
caption = "Two example tuning parameters considered for further sampling.",
label = "tuning-candidates"
) %>%
kableExtra::kable_styling(full_width = FALSE) %>%
kableExtra::add_header_above(c(" " = 1, "GP Prediction of ROC AUC" = 2))
```
:::rmdnote
Bayesian optimization is an iterative process.
:::
Based on the initial grid of four results, the GP model is fit, candidates are predicted, and a fifth tuning parameter combination is selected. We compute performance estimates for the new configuration, the GP is refit with the five existing results (and so on).
### Acquisition functions
Once the Gaussian process is fit to the current data, how is it used? Our goal is to choose the next tuning parameter combination that is most likely to have "better results" than the current best. One approach to do this is to create a large candidate set (perhaps using a space-filling design) and then make mean and variance predictions on each. Using this information, we choose the most advantageous tuning parameter value.
A class of objective functions, called _acquisition functions_, facilitate the trade-off between mean and variance. Recall that the predicted variance of the GP models are mostly driven by how far away they are from the existing data. The trade-off between the predicted mean and variance for new candidates is frequently viewed through the lens of exploration and exploitation:
* _Exploration_ biases the selection towards regions where there are fewer (if any) observed candidate models. This tends to give more weight to candidates with higher variance and focuses on finding new results.
* _Exploitation_ principally relies on the mean prediction to find the best (mean) value. It focuses on existing results.
```{r iterative-aq-data, include = FALSE}
source("extras/nonlinear_function.R")
grid <-
tibble(x = seq(0, 1, length.out = 200)) %>%
mutate(y = purrr::map_dbl(x, nonlin_function, error = FALSE))
set.seed(121)
current_iter <- tibble(x = c(.09, .41, .55, .7, .8)) %>%
mutate(y = purrr::map_dbl(x, nonlin_function))
gp <- GPfit::GP_fit(matrix(current_iter$x, ncol = 1), current_iter$y)
gp_pred <-
predict(gp, matrix(grid$x, ncol = 1))$complete_data %>%
as_tibble() %>%
setNames(c("x", ".mean", ".sd")) %>%
mutate(.sd = sqrt(.sd))
gp_pred <-
gp_pred %>%
bind_cols(
exp_improve() %>%
predict(gp_pred, maximize = TRUE, iter = 1, best = max(current_iter$y)) %>%
setNames("exp_imp")
) %>%
bind_cols(
conf_bound(kappa = .1) %>%
predict(gp_pred, maximize = TRUE, iter = 1, best = max(current_iter$y)) %>%
setNames("conf_int_01")
) %>%
bind_cols(
conf_bound(kappa = 1) %>%
predict(gp_pred, maximize = TRUE, iter = 1, best = max(current_iter$y)) %>%
setNames("conf_int_1")
)
```
To demonstrate, let's look at a toy example with a single parameter that has values between [0, 1] and the performance metric is $R^2$. The true function is shown in Figure \@ref(fig:performance-profile), along with `r xfun::numbers_to_words(nrow(current_iter))` candidate values that have existing results as points.
```{r performance-profile}
#| echo = FALSE,
#| fig.height = 4,
#| fig.cap = "Hypothetical true performance profile over an arbitrary tuning parameter, with five estimated points",
#| fig.alt = "A hypothetical true performance profile over an arbitrary tuning parameter. Five estimated points are also shown. The profile is highly nonlinear with a peak in-between two of the observed points."
y_lab <- expression(Estimated ~ italic(R^2))
ggplot(grid, aes(x = x, y = y)) +
geom_line(color = "red", alpha = .5, linewidth = 1.25) +
labs(y = y_lab, x = "Tuning Parameter") +
geom_point(data = current_iter)
```
For these data, the GP model fit is shown in Figure \@ref(fig:estimated-profile). The shaded region indicates the mean $\pm$ 1 standard error. The two vertical lines indicate two candidate points that are examined in more detail later.
The shaded confidence region demonstrates the squared exponential variance function; it becomes very large between points and converges to zero at the existing data points.
```{r estimated-profile}
#| echo = FALSE,
#| fig.height = 4,
#| fig.cap = "Estimated performance profile generated by the Gaussian process model. The shaded region shows one-standard-error bounds.",
#| fig.alt = "The estimated performance profile generated by the Gaussian process model. The shaded region shows one-standard-error bounds. Two vertical lines show potential points to be sampled in the next iteration."
y_lab <- expression(Estimated ~ italic(R^2))
gp_pred %>%
ggplot(aes(x = x)) +
geom_point(data = current_iter, aes(y = y)) +
geom_line(aes(y = .mean)) +
geom_vline(xintercept = c(0.1, 0.25), lty = 2, alpha = .5) +
geom_ribbon(aes(ymin = .mean - 1 * .sd, ymax = .mean + 1 * .sd), alpha = .1) +
labs(y = y_lab, x = "Tuning Parameter")
```
This nonlinear trend passes through each observed point, but the model is not perfect. There are no observed points near the true optimum setting and, in this region, the fit could be much better. Despite this, the GP model can effectively point us in the right direction.
From a pure exploitation standpoint, the best choice would select the parameter value that has the best mean prediction. Here, this would be a value of `r round(gp_pred$x[which.max(gp_pred$.mean)], 3)`, just to the right of the existing best observed point at 0.09.
As a way to encourage exploration, a simple (but not often used) approach is to find the tuning parameter associated with the largest confidence interval. For example, by using a single standard deviation for the $R^2$ confidence bound, the next point to sample would be `r round(gp_pred$x[which.max(gp_pred$conf_int_1)], 3)`. This is slightly more into the region with no observed results. Increasing the number of standard deviations used in the upper bound would push the selection farther into empty regions.
One of the most commonly used acquisition functions is _expected improvement_. The notion of improvement requires a value for the current best results (unlike the confidence bound approach). Since the GP can describe a new candidate point using a distribution, we can weight the parts of the distribution that show improvement using the probability of the improvement occurring.
For example, consider two candidate parameter values of 0.10 and 0.25 (indicated by the vertical lines in Figure \@ref(fig:estimated-profile)). Using the fitted GP model, their predicted $R^2$ distributions are shown in Figure \@ref(fig:two-candidates) along with a reference line for the current best results.
```{r two-candidates}
#| echo = FALSE,
#| fig.height = 4,
#| fig.width = 6.25,
#| out.width = "80%",
#| fig.cap = "Predicted performance distributions for two sampled tuning parameter values",
#| fig.alt = "Predicted performance distributions for two sampled tuning parameter values. For one, the distribution is slightly better than the current value with a small spread. The other parameter value is slightly worse but has a very wide spread."
small_pred <-
predict(gp, c(0.1, 0.25))$complete_data %>%
as_tibble() %>%
setNames(c("x", ".mean", ".sd")) %>%
mutate(
value = c(0.1, 0.25),
.sd = sqrt(.sd),
max = .mean + 3 * .sd,
min = .mean - 3 * .sd
)
small_pred <-
small_pred %>%
bind_cols(
exp_improve() %>%
predict(small_pred, maximize = TRUE, iter = 1, best = max(current_iter$y)) %>%
setNames("exp_imp")
)
get_density <- function(dat) {
res <- tibble(x = seq(dat$min, dat$max, length.out = 200)) %>%
mutate(density = dnorm(x, dat$.mean, dat$.sd),
`Parameter Value` = format(dat$value))
res
}
x_lab <- expression(Predicted ~ italic(R^2) ~ Distribution)
small_pred %>%
group_by(value) %>%
do(get_density(.)) %>%
ungroup() %>%
ggplot(aes(x = x, y = density, color = `Parameter Value`, lty = `Parameter Value`)) +
geom_line() +
geom_vline(xintercept = max(current_iter$y), lty = 3) +
labs(x = x_lab) +
scale_color_brewer(palette = "Set1")
```
When only considering the mean $R^2$ prediction, a parameter value of 0.10 is the better choice (see Table \@ref(tab:two-exp-improve)). The tuning parameter recommendation for 0.25 is, on average, predicted to be worse than the current best. However, since it has higher variance, it has more overall probability area above the current best. As a result, it has a larger expected improvement:
```{r iterative-gp-data-table-2, echo = FALSE, results = "asis"}
small_pred %>%
select(`Parameter Value` = x, Mean = .mean, `Std Dev` = .sd, `Expected Improvment` = exp_imp) %>%
as.data.frame() %>%
format(digits = 4, scientific = FALSE) %>%
kable(
caption = "Expected improvement for the two candidate tuning parameters.",
label = "two-exp-improve"
) %>%
kableExtra::kable_styling(full_width = FALSE) %>%
kableExtra::add_header_above(c(" " = 1, "Predictions" = 3))
```
When expected improvement is computed across the range of the tuning parameter, the recommended point to sample is much closer to 0.25 than 0.10, as shown in Figure \@ref(fig:expected-improvement).
```{r expected-improvement}
#| echo = FALSE,
#| fig.height= 5.5,
#| fig.cap = "The estimated performance profile generated by the Gaussian process model (top panel) and the expected improvement (bottom panel). The vertical line indicates the point of maximum improvement.",
#| fig.alt = "The estimated performance profile generated by the Gaussian process model (top panel) and the expected improvement (bottom panel). The vertical line indicates the point of maximum improvement where estimated performance is high and the predicted variation is also large."
y_lab <- expression(Estimated ~ italic(R^2))
p1 <-
gp_pred %>%
ggplot(aes(x = x)) +
geom_point(data = current_iter, aes(y = y)) +
geom_line(aes(y = .mean)) +
geom_ribbon(aes(ymin = .mean - 1 * .sd, ymax = .mean + 1 * .sd), alpha = .1) +
labs(y = y_lab, x = NULL) +
theme(
axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
plot.margin = unit(c(0, 0, 0, 0), "null")
)
p2 <-
gp_pred %>%
ggplot(aes(x = x, y = exp_imp)) +
geom_line() +
labs(y = "Expected Improvement", x = "Tuning Parameter") +
theme(plot.margin = unit(c(0, 0, 0, 0), "null")) +
geom_vline(xintercept = gp_pred$x[which.max(gp_pred$exp_imp)], lty = 2)
p1/p2
```
Numerous acquisition functions have been proposed and discussed; in tidymodels, expected improvement is the default.
```{r iterative-cells-bo-calcs, echo = FALSE}
# We will do the calculations here but use some nonstandard options. First,
# purrr is used to capture the output in a vector so that we can show the
# results piecemeal. Also, a hidden option is used to save the grid of candidate
# values for each iteration of the search. These will be used to make an
# animation in a later chunk.
#
# This means that any changes to this chunk have to be made to the next chunk
# (where the code is shown and not executed).
ctrl <- control_bayes(verbose = TRUE)
ctrl$save_gp_scoring <- TRUE
tune_bayes_sssshhh <- purrr::quietly(tune_bayes)
set.seed(1403)
svm_bo_sshh <-
svm_wflow %>%
tune_bayes_sssshhh(
resamples = cell_folds,
metrics = roc_res,
initial = svm_initial,
param_info = svm_param,
iter = 25,
control = ctrl
)
# Make sure that the results have not changed since 2022-02-20. Save the results with:
# svm_bo_metrics <- collect_metrics(svm_bo_sshh$result)
# save(svm_bo_metrics, file = "RData/svm_bo_metrics.RData")
# Check with:
verify_consistent_bo(collect_metrics(svm_bo_sshh$result))
svm_bo <- svm_bo_sshh$result
svm_bo_output <- svm_bo_sshh$messages
gp_candidates <- collect_gp_results(svm_bo)
```
### The `tune_bayes()` function {#tune-bayes}
To implement iterative search via Bayesian optimization, use the `tune_bayes()` function. Its syntax is very similar to `tune_grid()` but with several additional arguments:
* `iter` is the maximum number of search iterations.
* `initial` can be either an integer, an object produced using `tune_grid()`, or one of the racing functions. Using an integer specifies the size of a space-filling design that is sampled prior to the first GP model.
* `objective` is an argument for which acquisition function should be used. The `r pkg(tune)` package contains functions to pass here, such as `exp_improve()` or `conf_bound()`.
* The `param_info` argument, in this case, specifies the range of the parameters as well as any transformations that are used. These are used to define the search space. In situations where the default parameter objects are insufficient, `param_info` is used to override the defaults.
The `control` argument now uses the results of `control_bayes()`. Some helpful arguments there are:
* `no_improve` is an integer that will stop the search if improved parameters are not discovered within `no_improve` iterations.
* `uncertain` is also an integer (or `Inf`) that will take an _uncertainty sample_ if there is no improvement within `uncertain` iterations. This will select the next candidate that has large variation. It has the effect of pure exploration since it does not consider the mean prediction.
* `verbose` is a logical that will print logging information as the search proceeds.
Let's use the first SVM results from Section \@ref(svm) as the initial substrate for the Gaussian process model. Recall that, for this application, we want to maximize the area under the ROC curve. Our code is:
```{r iterative-cells-bo, eval = FALSE}
ctrl <- control_bayes(verbose = TRUE)
set.seed(1403)
svm_bo <-
svm_wflow %>%
tune_bayes(
resamples = cell_folds,
metrics = roc_res,
initial = svm_initial,
param_info = svm_param,
iter = 25,
control = ctrl
)
```
```{r iterative-cells-info, include = FALSE}
bo_res <- collect_metrics(svm_bo) %>% mutate(current_best = FALSE)
for(i in 1:nrow(bo_res)) {
bo_res$current_best[i] <- bo_res$mean[i] > max(bo_res$mean[1:(i-1)])
}
init_vals <- bo_res %>% dplyr::filter(.iter == 0)
best_init <- max(bo_res$mean[bo_res$.iter == 0])
best_bo <- max(bo_res$mean)
best_bo_iter <- bo_res$.iter[which.max(bo_res$mean)]
new_best_iter <- bo_res$.iter[which(bo_res$current_best)]
new_best_iter <- new_best_iter[new_best_iter > 0]
num_improve <- length(new_best_iter)
last_iter <- max(collect_metrics(svm_bo)$.iter)
iter_1_roc <- bo_res$mean[bo_res$.iter == 1]
iter_1_imp <- iter_1_roc > best_init
iter_1_text <-
paste0(
ifelse(iter_1_imp, "showed an improvement, resulting in an ROC value of ",
"failed to improve the outcome with an ROC value of "),
round(iter_1_roc, 5), "."
)
iter_2_roc <- bo_res$mean[bo_res$.iter == 2]
iter_2_imp <- iter_2_roc > max(bo_res$mean[bo_res$.iter < 2])
iter_2_text <-
dplyr::case_when(
!iter_1_imp & !iter_2_imp ~
paste0("the second iteration also failed to yield an improvement."),
!iter_1_imp & iter_2_imp ~
paste0("the second iteration did yield a better result with an area under the ROC curve of ",
round(iter_2_roc, 5), "."),
iter_1_imp & !iter_2_imp ~
paste0("the second iteration did not continue the trend with a suboptimal ROC value of ",
round(iter_2_roc, 5), "."),
iter_1_imp & !iter_2_imp ~
paste0("the second iteration further increased the outcome value (ROC = ",
round(iter_2_roc, 5), ").")
)
if (num_improve > 1) {
improve_text <-
paste0(
"There were a total of ",
num_improve,
" improvements in the outcome along the way at iterations ",
knitr::combine_words(new_best_iter),
"."
)
} else {
improve_text <-
paste0("There was only a single improvement in the outcome at iteration ",
new_best_iter,
".")
}
if (last_iter < 25) {
last_bo_text <-
paste0(
"There were no more improvements and the default option is to stop if no progress is made after `no_improve = ",
ctrl$no_improve,
"` more steps. The last step was:"
)
} else {
last_bo_text <- "The last step was:"
}
```
The search process starts with an initial best value of `r round(best_init, 5)` for the area under the ROC curve. A Gaussian process model uses these `r xfun::numbers_to_words(nrow(init_vals))` statistics to create a model. The large candidate set is automatically generated and scored using the expected improvement acquisition function. The first iteration `r iter_1_text` After fitting another Gaussian process model with the new outcome value, `r iter_2_text`
The log of the first two iterations, produced by the `verbose` option, was:
```{r iterative-cells-bo-print-first, echo = FALSE}
so_stop_index <- grep("Iteration 3", svm_bo_output)
if (length(so_stop_index) > 0) {
cat(svm_bo_output[1:(so_stop_index - 2)], sep = "")
}
```
The search continues. `r improve_text` The best result occurred at iteration `r max(new_best_iter)` with an area under the ROC curve of `r round(best_bo, 5)`.
```{r iterative-cells-bo-print-impr, echo = FALSE}
all_imp_index <- grep("♥", svm_bo_output)
so_stop_index <- all_imp_index[length(all_imp_index)]
if (length(all_imp_index) > 0) {
so_start_index <- so_stop_index - 10
cat(svm_bo_output[so_start_index:(so_stop_index + 1)], sep = "")
}
```
`r last_bo_text`
```{r iterative-cells-bo-print-last, echo = FALSE}
so_start <- paste("Iteration", last_iter)
so_start_index <- grep(so_start, svm_bo_output)
if (length(so_start_index) > 0) {
cat(svm_bo_output[so_start_index:length(svm_bo_output)], sep = "")
}
```
The functions that are used to interrogate the results are the same as those used for grid search (e.g., `collect_metrics()`, etc.). For example:
```{r iterative-bo-best}
show_best(svm_bo)
```
The `autoplot()` function has several options for iterative search methods. Figure \@ref(fig:progress-plot) shows how the outcome changed over the search by using `autoplot(svm_bo, type = "performance")`.
```{r progress-plot}
#| fig.height = 4,
#| echo = FALSE,
#| fig.cap = 'The progress of the Bayesian optimization produced when the `autoplot()` method is used with `type = "performance"`',
#| fig.alt = "The progress of the Bayesian optimization produced when the `autoplot()` method is used with `type = 'performance'`. The plot shows the estimated performance on the y axis versus the iteration number on the x axis. Confidence intervals are shown for the points."
autoplot(svm_bo, type = "performance")
```
An additional type of plot uses `type = "parameters"` that shows the parameter values over iterations.
The animation below visualizes the results of the search. The black $\times$ values show the starting values contained in `svm_initial`. The top-left blue panel shows the predicted mean value of the area under the ROC curve. The red panel on the top-right displays the predicted variation in the ROC values while the bottom plot visualizes the expected improvement. In each panel, darker colors indicate less attractive values (e.g., small mean values, large variation, and small improvements).
```{r iterative-bo-progress, include = FALSE}
av_capture_graphics(
make_bo_animation(gp_candidates, svm_bo),
output = "bo_search.mp4",
width = 760,
height = 760,
res = 100,
vfilter = 'framerate=fps=10',
framerate = 1/3
)
```
<video width="720" height="720" controls>
<source src="bo_search.mp4" type="video/mp4">
</video>
The surface of the predicted mean surface is very inaccurate in the first few iterations of the search. Despite this, it does help guide the process to the region of good performance. In other words, the Gaussian process model is wrong but shows itself to be very useful. Within the first ten iterations, the search is sampling near the optimum location.
While the best tuning parameter combination is on the boundary of the parameter space, Bayesian optimization will often choose new points on other sides of the boundary. While we can adjust the ratio of exploration and exploitation, the search tends to sample boundary points early on.
:::rmdnote
If the search is seeded with an initial grid, a space-filling design would probably be a better choice than a regular design. It samples more unique values of the parameter space and would improve the predictions of the standard deviation in the early iterations.
:::
Finally, if the user interrupts the `tune_bayes()` computations, the function returns the current results (instead of resulting in an error).
## Simulated Annealing
_Simulated annealing_ (SA) [@kirkpatrick1983optimization;@van1987simulated] is a general nonlinear search routine inspired by the process in which metal cools. It is a global search method that can effectively navigate many different types of search landscapes, including discontinuous functions. Unlike most gradient-based optimization routines, simulated annealing can reassess previous solutions.
### Simulated annealing search process
The process of using simulated annealing starts with an initial value and embarks on a controlled random walk through the parameter space. Each new candidate parameter value is a small perturbation of the previous value that keeps the new point within a local neighborhood.
The candidate point is resampled to obtain its corresponding performance value. If this achieves better results than the previous parameters, it is accepted as the new best and the process continues. If the results are worse than the previous value the search procedure may still use this parameter to define further steps. This depends on two factors. First, the likelihood of accepting a bad result decreases as performance becomes worse. In other words, a slightly worse result has a better chance of acceptance than one with a large drop in performance. The other factor is the number of search iterations. Simulated annealing wants to accept fewer suboptimal values as the search proceeds. From these two factors, the _acceptance probability_ for a bad result can be formalized as:
$$\operatorname{Pr}[\text{accept suboptimal parameters at iteration } i] = \exp(c\times D_i \times i)$$
where $i$ is the iteration number, $c$ is a user-specified constant, and $D_i$ is the percent difference between the old and new values (where negative values imply worse results). For a bad result, we determine the acceptance probability and compare it to a random uniform number. If the random number is greater than the probability value, the search discards the current parameters and the next iteration creates its candidate value in the neighborhood of the previous value. Otherwise, the next iteration forms the next set of parameters based on the current (suboptimal) values.
:::rmdnote
The acceptance probabilities of simulated annealing allow the search to proceed in the wrong direction, at least for the short term, with the potential to find a much better region of the parameter space in the long run.
:::
How are the acceptance probabilities influenced? The heatmap in Figure \@ref(fig:acceptance-prob) shows how the acceptance probability can change over iterations, performance, and the user-specified coefficient.
```{r acceptance-prob}
#| echo = FALSE,
#| dev = "png",
#| fig.height = 4.5,
#| out.width = "80%",
#| fig.cap = "Heatmap of the simulated annealing acceptance probabilities for different coefficient values",
#| fig.alt = "A heatmap of the simulated annealing acceptance probabilities for different coefficient values. The probabilities are affected by the both the iteration number as well as how far off the the performance is from the current best."
get_accept_probs <- function(coef, pct_diff) {
# pct loss to abs value
candidate <- .8 - (pct_diff * .8 /100)
x <- finetune:::acceptance_prob(0.8, candidate, 1:50, coef = coef, maximize = TRUE)
tibble(
`Acceptance Probability` = x,
iteration = 1:50,
pct_diff = pct_diff,
coefficient = coef
)
}
prob_settings <- crossing(pct_diff = 1:10, coefficient = c(10, 20, 30)/1000)
prob_res <- purrr::map2_dfr(prob_settings$coefficient, prob_settings$pct_diff, get_accept_probs)
ggplot(prob_res, aes(x = iteration, y = pct_diff, fill = `Acceptance Probability`)) +
geom_raster() +
facet_wrap( ~ coefficient, labeller = label_both) +
scale_fill_gradientn(
colours = scales::brewer_pal(palette = "Greens")(8),
limits = 0:1
) +
labs(y = "Percent Loss", x = "Iteration")
```
The user can adjust the coefficients to find a probability profile that suits their needs. In `finetune::control_sim_anneal()`, the default for this `cooling_coef` argument is `r control_sim_anneal()$cooling_coef`. Decreasing this coefficient will encourage the search to be more forgiving of poor results.
This process continues for a set amount of iterations but can halt if no globally best results occur within a pre-determined number of iterations. However, it can be very helpful to set a _restart threshold_. If there are a string of failures, this feature revisits the last globally best parameter settings and starts anew.
The main important detail is to define how to perturb the tuning parameters from iteration to iteration. There are a variety of methods in the literature for this. We follow the method given in @gsa called _generalized simulated annealing_. For continuous tuning parameters, we define a small radius to specify the local "neighborhood." For example, suppose there are two tuning parameters and each is bounded by zero and one. The simulated annealing process generates random values on the surrounding radius and randomly chooses one to be the current candidate value.
In our implementation, the neighborhood is determined by scaling the current candidate to be between zero and one based on the range of the parameter object, so radius values between 0.05 and 0.15 seem reasonable. For these values, the fastest that the search could go from one side of the parameter space to the other is about 10 iterations. The size of the radius controls how quickly the search explores the parameter space. In our implementation, a range of radii is specified so different magnitudes of "local" define the new candidate values.
To illustrate, we'll use the two main `r pkg(glmnet)` tuning parameters:
* The amount of total regularization (`penalty`). The default range for this parameter is $10^{`r penalty()$range[[1]]`}$ to $10^{`r penalty()$range[[2]]`}$. It is typical to use a log (base-10) transformation for this parameter.
* The proportion of the lasso penalty (`mixture`). This is bounded at zero and one with no transformation.
```{r iterative-neighborhood-calcs, echo = FALSE}
#| echo = FALSE,
#| message = FALSE,
#| warning = FALSE,
#| out.width = "80%",
#| fig.cap = "How simulated annealing determines the local neighborhood for two numeric tuning parameters. The clouds of points show possible next values where one would be selected at random. ",
#| fig.alt = "An illustration of how simulated annealing determines what is the local neighborhood for two numeric tuning parameters. The clouds of points show possible next values where one would be selected at random. The candidate points are small circular clouds surrounding the current best point."
glmn_param <- parameters(penalty(), mixture())
pen_rng <- unlist(range_get(penalty(), original = TRUE))
mix_rng <- 0:1
iter_1 <- tibble(penalty = 0.025, mixture = .05)
next_neighbors <-
finetune:::random_real_neighbor(iter_1, iter_1, glmn_param, retain = 300) %>%
mutate(Iteration = 1)
set.seed(1)
neighbors_values <- next_neighbors
best_values <- iter_1 %>% mutate(Iteration = 1)
scoring <- function(x) {
- log10(x$penalty) * .1 + x$mixture * 2 + rnorm(nrow(x), sd = .5)
}
path <- best_values
for (i in 2:6) {
set.seed(i + 5)
next_scores <- scoring(next_neighbors)
next_ind <- which.max(next_scores)
next_value <- next_neighbors %>% slice(next_ind) %>% mutate(Iteration = i)
best_values <-
bind_rows(
best_values,
next_value
)
path <- bind_rows(path, best_values %>% mutate(Iteration = i))
next_neighbors <-
finetune:::random_real_neighbor(next_value %>% select(-Iteration),
path %>% select(-Iteration),
glmn_param, retain = 300) %>%
mutate(Iteration = i)
neighbors_values <-
bind_rows(
neighbors_values,
next_neighbors
)
}
```
The process starts with initial values of `penalty = 0.025` and `mixture = 0.050`. Using a radius that randomly fluctuates between 0.050 and 0.015, the data are appropriately scaled, random values are generated on radii around the initial point, then one is randomly chosen as the candidate. For illustration, we will assume that all candidate values are improvements. Using the new value, a set of new random neighbors are generated, one is chosen, and so on. Figure \@ref(fig:iterative-neighborhood) shows `r xfun::numbers_to_words(max(best_values$Iteration))` iterations as the search proceeds toward the upper left corner.
```{r iterative-neighborhood}
#| echo = FALSE,
#| message = FALSE,
#| warning = FALSE,
#| out.width = "80%",
#| fig.cap = "An illustration of how simulated annealing determines what is the local neighborhood for two numeric tuning parameters. The clouds of points show possible next values where one would be selected at random. ",
#| fig.alt = "An illustration of how simulated annealing determines what is the local neighborhood for two numeric tuning parameters. The clouds of points show possible next values where one would be selected at random. The candidate points are small circular clouds surrounding the current best point."
ggplot(neighbors_values, aes(x = penalty, y = mixture)) +
geom_point(alpha = .3, size = 3/4, aes(color = factor(Iteration)), show.legend = FALSE) +
scale_x_continuous(trans = "log10", limits = pen_rng) +
scale_y_continuous(limits = mix_rng) +
geom_point(data = best_values) +
geom_path(data = path) +
geom_point(data = path) +
facet_wrap(vars(Iteration), labeller = label_both) +
labs(
x = paste(penalty()$label, "(penalty)"),
y = paste(mixture()$label, "(mixture)")
)
```
Note that, during some iterations, the candidate sets along the radius exclude points outside of the parameter boundaries. Also, our implementation biases the choice of the next tuning parameter configurations _away_ from new values that are very similar to previous configurations.
For non-numeric parameters, we assign a probability for how often the parameter value changes.
### The `tune_sim_anneal()` function {#tune-sim-anneal}
To implement iterative search via simulated annealing, use the `tune_sim_anneal()` function. The syntax for this function is nearly identical to `tune_bayes()`. There are no options for acquisition functions or uncertainty sampling. The `control_sim_anneal()` function has some details that define the local neighborhood and the cooling schedule:
* `no_improve`, for simulated annealing, is an integer that will stop the search if no global best or improved results are discovered within `no_improve` iterations. Accepted suboptimal or discarded parameters count as "no improvement."
* `restart` is the number of iterations with no new best results before starting from the previous best results.
* `radius` is a numeric vector on (0, 1) that defines the minimum and maximum radius of the local neighborhood around the initial point.
* `flip` is a probability value that defines the chances of altering the value of categorical or integer parameters.
* `cooling_coef` is the $c$ coefficient in $\exp(c\times D_i \times i)$ that modulates how quickly the acceptance probability decreases over iterations. Larger values of `cooling_coef` decrease the probability of accepting a suboptimal parameter setting.
For the cell segmentation data, the syntax is very consistent with the previously used functions:
```{r iterative-cells-sa-calcs, include = FALSE}
# As we did above, this chunk executes the code with some extra options that
# capture the output and save some internal objects for plotting.
# This means that any changes to this chunk have to be made to the next chunk
# (where the code is shown and not executed).
ctrl_sa <- control_sim_anneal(no_improve = 10L, verbose = TRUE, save_history = TRUE)
tune_sim_anneal_sssshhh <- purrr::quietly(tune_sim_anneal)
set.seed(1404)
svm_sa_sshh <-
svm_wflow %>%
tune_sim_anneal_sssshhh(
resamples = cell_folds,
metrics = roc_res,
initial = svm_initial,
param_info = svm_param,
iter = 50,
control = ctrl_sa
)
# Make sure that the results have not changed since 2022-02-20. Save the results with:
# svm_sa_metrics <- collect_metrics(svm_sa_sshh$result)
# save(svm_sa_metrics, file = "RData/svm_sa_metrics.RData")
# Check with:
verify_consistent_sa(collect_metrics(svm_sa_sshh$result))
svm_sa <- svm_sa_sshh$result
svm_sa_output <- svm_sa_sshh$messages
# We set tune_sim_anneal() to save a file to the temp directory.
file.copy(
file.path(tempdir(), "sa_history.RData"),
"RData/sa_history.RData",
overwrite = TRUE
)
```
```{r iterative-cells-sa, eval = FALSE}
ctrl_sa <- control_sim_anneal(verbose = TRUE, no_improve = 10L)
set.seed(1404)
svm_sa <-
svm_wflow %>%
tune_sim_anneal(
resamples = cell_folds,
metrics = roc_res,
initial = svm_initial,
param_info = svm_param,
iter = 50,
control = ctrl_sa
)
```
```{r iterative-sa-history, include = FALSE}
load("RData/sa_history.RData")
## -----------------------------------------------------------------------------
restart_iter <- result_history$.iter[result_history$results == "restart from best"]
restart_num <- length(restart_iter)
sa_iter_list <- knitr::combine_words(restart_iter)
restart_txt <-
dplyr::case_when(
restart_num == 0 ~ paste0("There were no restarts during the search."),
restart_num == 1 ~ paste0("There was a single restart at iteration ", restart_iter)[1],
TRUE ~ paste0("There were ", restart_num, " restarts at iterations ", sa_iter_list)[1]
)
discard_num<- length(result_history$.iter[result_history$results == "discard suboptimal"])
if (discard_num > 0) {
restart_txt <-
paste0(
restart_txt,
" as well as ",
discard_num,
" discarded ",
ifelse(discard_num > 1, "candidates ", "candidate "),
"during the process."
)
} else {
restart_txt <- paste0(restart_txt, ".")
}
## -----------------------------------------------------------------------------
best_iters <- result_history$.iter[result_history$results == "new best"]
best_init <- max(result_history$mean[result_history$.iter == 0])
best_sa_res <- max(result_history$mean[result_history$.iter > 0])
best_sa_inds <- result_history$.iter[which.max(result_history$mean)]
best_txt <-
dplyr::case_when(
restart_num == 1 ~ paste0("a new global optimum once at iteration ", best_iters, "."),
TRUE ~ paste0("new global optimums at ", length(best_iters), " different iterations.")[1]
)
best_txt <- best_txt[1]
if (length(best_iters) > 1) {
best_txt <-
paste0(
best_txt,
" The earliest improvement was at iteration ",
min(best_iters),
" and the final optimum occured at iteration ",
max(best_iters),
". The best overall results occured at iteration ",
best_sa_inds, " with a mean area under the ROC curve of ",
round(best_sa_res, 4), " (compared to an initial best of ",
round(best_init, 4), ")."
)
}
```
The simulated annealing process discovered `r best_txt` `r restart_txt`
The `verbose` option prints details of the search process. The output for the first five iterations was:
```{r iterative-cells-sa-print-start, echo = FALSE}
so_stop_index <- grep("^ 5", svm_sa_output)
if (length(so_stop_index) > 0) {
cat(svm_sa_output[1:so_stop_index], sep = "\n")
}
```
The output for last ten iterations was:
```{r iterative-cells-sa-print-end, echo = FALSE}
last_sa_iter <- max(result_history$.iter)
so_start_index <- grep(paste0("^", last_sa_iter - 10), svm_sa_output)
so_stop_index <- grep(paste0("^", last_sa_iter), svm_sa_output)
if (length(so_stop_index) > 0) {
cat(svm_sa_output[so_start_index:so_stop_index], sep = "\n")
}
```
As with the other `tune_*()` functions, the corresponding `autoplot()` function produces visual assessments of the results. Using `autoplot(svm_sa, type = "performance")` shows the performance over iterations (Figure \@ref(fig:sa-iterations)) while `autoplot(svm_sa, type = "parameters")` plots performance versus specific tuning parameter values (Figure \@ref(fig:sa-parameters)).
```{r sa-iterations}
#| fig.height = 4,
#| echo = FALSE,
#| fig.cap = 'Progress of the simulated annealing process shown when the `autoplot()` method is used with `type = "performance"`',
#| fig.alt = "The progress of the simulated annealing process shown when the `autoplot()` method is used with `type = 'performance'`. The plot shows the estimated performance on the y axis versus the iteration number on the x axis. Confidence intervals are shown for the points."
autoplot(svm_sa, type = "performance")
```
```{r sa-parameters}
#| fig.height = 4,
#| echo = FALSE,
#| fig.cap = 'Performance versus tuning parameter values when the `autoplot()` method is used with `type = "parameters"`.',
#| fig.alt = "A visualization of performance versus tuning parameter values when the `autoplot()` method is used with `type = 'parameters'`. The plot shows different panels for each tuning parameter in their transformed units."
autoplot(svm_sa, type = "parameters")
```
A visualization of the search path helps to understand where the search process did well and where it went astray:
```{r iterative-sa-plot, include = FALSE}
av_capture_graphics(
sa_2d_plot(svm_sa, result_history, svm_large),
output = "sa_search.mp4",
width = 720,
height = 720,
res = 120,
vfilter = 'framerate=fps=10',
framerate = 1/3
)
```
<video width="720" height="720" controls>
<source src="sa_search.mp4" type="video/mp4">
</video>
Like `tune_bayes()`, manually stopping execution will return the completed iterations.
## Chapter Summary {#iterative-summary}
This chapter described two iterative search methods for optimizing tuning parameters. Bayes optimization uses a predictive model trained on existing resampling results to suggest tuning parameter values, while simulated annealing walks through the hyperparameter space to find good values. Both can be effective at finding good values alone or as a follow-up method used after an initial grid search to further `r pkg(finetune)` performance.