Skip to content

Commit 23de4db

Browse files
rwillanslgibson7lgibson7
authored
RJW chap 11 visualisations notes (#12)
* Initial commit of Chap 3 notes * Comment out package install and create new slide for meeting videos * All slides up to the rabies example * commit before presenting * final tweak to slides * final commit pre presentation --------- Co-authored-by: lgibson7 <“[email protected]”> Co-authored-by: Lydia Gibson <[email protected]>
1 parent b07dd23 commit 23de4db

5 files changed

+273
-13
lines changed
Lines changed: 162 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,177 @@
11
# Techniques for Machine Learning Applications
22

3-
**Learning objectives:**
3+
**Learning Objectives:**
44

5-
- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY
5+
- How to manipulate data through feature engineering\
6+
- Select the most suitable model for your data\
7+
- Learn about machine learning algorithms
68

7-
## SLIDE 1 {-}
9+
## Goals of the Analysis and Nature of Data
810

9-
- ADD SLIDES AS SECTIONS (`##`).
10-
- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF.
11+
### Output is *Continuous*
1112

12-
## Meeting Videos {-}
13+
- Example: How do explanatory variables such as lifestyle or chronic diagnoses affect LE / DALYs
14+
- Traditional **regression models**, including linear regression (OLS), ridge & lasso
15+
- Coefficient estimates quantify the association between changes in input and changes in outcome.
1316

14-
### Cohort 1 {-}
17+
### Output is *Categorical* or *Binary*
18+
19+
- Outcome is categorical (e.g., disease/no disease)
20+
- Can use logistic regression (more common for *explaining*)
21+
- Or classification (more common for *predicting*)
22+
23+
### Systemic Modelling / Simulation
24+
25+
- For complex systems modelled by multiple equations
26+
- Typically more *predictive*
27+
- Have a series of equations to fit to data, for example SIR model
28+
- May wish to change parameters for sensitivity or explore how changes to inputs affects predicted outcome
29+
30+
### Time-Series
31+
32+
- Data has a temporal or seasonal aspect (influenza?)
33+
- Models like ARIMA can be used to model autocorrelation & trends
34+
35+
## Statistical and Machine Learning Methods
36+
37+
- Several pre-analysis steps are common to many methods
38+
39+
### Exploratory Data Analysis
40+
41+
- Aim is to understand the data
42+
- Descriptive statistics of central tendencies and variation
43+
- Basic plots of distributions / skewness (histograms)
44+
- Correlation plots
45+
46+
### Feature Engineering / Transforming Variables
47+
48+
- Reducing skew (log or other transformation)
49+
- Encoding category variables as dummies
50+
- Creating new predictor variables, interaction terms
51+
- Centering / Scaling Variables
52+
53+
## Case Study: Predicting Rabies
54+
55+
### Goal:
56+
57+
Predict DALYs due to rabies in 'Asia' and 'Global' regions, using the `hmsidwR::rabies` dataset
58+
59+
### Exploratory Data Analysis (EDA)
60+
61+
- Dataset contains all cause and rabies mortality plus DALYs for the Asian and Global region, subdivided by year
62+
63+
- Values have an estimate and upper and lower boundaries in separate columns
64+
65+
- 240 observations across 7 variables.
66+
67+
- Examining the data shows that death rates (`dx_rabies`) and DALYs (`dalys_rabies`) are different in magnitude and scale
68+
69+
```{r rabiesdata, message=FALSE, warning=FALSE}
70+
71+
library(tidyverse)
72+
rabies <- hmsidwR::rabies %>%
73+
filter(year >= 1990 & year <= 2019) %>%
74+
select(-upper, -lower) %>%
75+
pivot_wider(names_from = measure, values_from = val) %>%
76+
filter(cause == "Rabies") %>%
77+
rename(dx_rabies = Deaths, dalys_rabies = DALYs) %>%
78+
select(-cause)
79+
80+
rabies %>% head()
81+
82+
```
83+
84+
- After scaling, these values are closer together in magnitude, avoiding the issue of larger variables dominating others in prediction
85+
86+
```{r}
87+
88+
library(patchwork)
89+
90+
p1 <- rabies %>%
91+
ggplot(aes(x = year, group = location, linetype = location)) +
92+
geom_line(aes(y = dx_rabies),
93+
linewidth = 1) +
94+
geom_line(aes(y = dalys_rabies))
95+
96+
p2 <- rabies %>%
97+
# apply a scale transformation to the numeric variables
98+
mutate(year = as.integer(year),
99+
across(where(is.double), scale)) %>%
100+
ggplot(aes(x = year, group = location, linetype = location)) +
101+
geom_line(aes(y = dx_rabies),
102+
linewidth = 1) +
103+
geom_line(aes(y = dalys_rabies))
104+
105+
p1 + p2
106+
107+
```
108+
109+
### Training and Resampling
110+
111+
- The dataset was split into 80% training and 20% final test, stratified by location
112+
- The 80% training set was then used to create a series of 'folds' or resamples of the data
113+
- These folds can then be used to validate how well each model (and selected parameters) match unseen data
114+
- K-fold cross validation was used to generate 10 folds using the `vfold_cv()` function from the tidymodels package
115+
116+
### Preprocessing
117+
118+
- Handled using 'recipes' as part of tidymodels pipelines
119+
- **Recipe 0** - all predictors, no transformations [reference model]
120+
- **Recipe 1** - encoding of dummy variable for region, standardised numeric variables
121+
- **Recipe 2** - as recipe 2, with addition of method to reduce skewness of `dalys_rabies` outcome
122+
- Advantage of 'recipe' approach in tidymodels is that they can be piped / swapped out easily.
123+
124+
### Multicollinearity
125+
126+
- DALYs & mortality likely to be strongly correlated (DALYs = Years_life_lost + Years_lived_w_disability))
127+
- All cause and specific cause mortality also will have some correlation
128+
- This can cause issues with some prediction methods, making it hard for the model to determine which variables have the best predictive power.
129+
- In this analysis, dealt with by the choice of prediction method: Random forests and GLM with lasso penalty both robust to multicollinearity
130+
131+
### Model 1: Random forest
132+
133+
- Specified using `rand_forest()` function within tidymodels framework
134+
- Hyperparameters tuned using cross-validation and `tune_grid()` / grid search
135+
- Optimal parameters gave RMSE 0.506
136+
- Fig 7.4a shows close relationship between predictions and observed data
137+
138+
[![Fig 7.4a from chapter](https://fgazzelloni.quarto.pub/06-techniques_files/figure-html/fig-rf-predictions-1.png)](https://fgazzelloni.quarto.pub/06-techniques.html#fig-rf-predictions-1)
139+
140+
### Model 2: GLM w lasso penalty
141+
142+
- Generalised Linear Model with penalty term ($\lambda$)
143+
- Cross-validation process (as done for model 1) to tune $\lambda$ parameter
144+
- Results in lower RMSE than random forest
145+
146+
### Additional models!
147+
148+
- Last section showed code using `parsnip` package and `workflow_set()` to test more models
149+
- SVN with yeo_johnson transformation of output may actually improve on GLM (graded on RSME)
150+
151+
## Summary
152+
153+
This chapter focussed on ML techniques as a holistic analysis pipeline, not on individual ML algorithms or methods. Best practices are summarised at the end of the chapter:
154+
155+
- Conduct exploratory data analysis to understand the underlying structure of the data and relationships between variables.
156+
- Apply feature engineering techniques to create new variables and enhance the model’s predictive power.
157+
- Select machine learning models that are contextually appropriate and robust for public health data analysis. Such as Random Forest, Generalised Linear Models, and others.
158+
- Use parameter calibration techniques such as cross-validation, regularisation, monte carlo, and grid search to optimise model performance.
159+
- Evaluate model performance using appropriate metrics and visualisation tools to assess predictive accuracy and relevance.
160+
161+
TLDR: It's not just about applying the individual ML model but about considering the goals, dataset, preprocessing, calibration and evaluation of the model.
162+
163+
## Meeting Videos {.unnumbered}
164+
165+
### Cohort 1 {.unnumbered}
15166

16167
`r knitr::include_url("https://www.youtube.com/embed/URL")`
17168

18169
<details>
19-
<summary> Meeting chat log </summary>
20170

21-
```
171+
<summary>Meeting chat log</summary>
172+
173+
```
22174
LOG
23175
```
176+
24177
</details>

11_interpreting-model-results-through-visualisation.Rmd

Lines changed: 111 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,119 @@
22

33
**Learning objectives:**
44

5-
- THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY
5+
- Visualize predicted vs. observed values and assess residuals
6+
- Interpret model metrics with VIP, accuracy, and partial dependency plots
7+
- Create and customize ROC curves and compute AUC for classification models
68

7-
## SLIDE 1 {-}
89

9-
- ADD SLIDES AS SECTIONS (`##`).
10-
- TRY TO KEEP THEM RELATIVELY SLIDE-LIKE; THESE ARE NOTES, NOT THE BOOK ITSELF.
10+
11+
## Why Plot Model Fits? {-}
12+
13+
- Single measures like MSE or $R^2$ can occasionally be misleading
14+
15+
![Anscombe's Quartet - all 4 plots have the same linear model fit and R^2 but different input-output relationships ](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Anscombe%27s_quartet_3.svg/960px-Anscombe%27s_quartet_3.svg.png)
16+
17+
18+
## Predicted vs Actual Plots {-}
19+
20+
- Predicted vs Actual Plot: Scatter of model predictions vs observed values
21+
- 45° line = perfect match of prediction and actual (very rare!)
22+
- Near line = predictions close to true values (high accuracy)
23+
- Far off line = larger errors, indicators performing less well
24+
- Can reveals patterns of bias: e.g. consistent under-prediction or over-prediction
25+
26+
<BR>
27+
28+
![Example body fat prediction model](images/ch10-bodyfat_predicted_vs_actual.png)
29+
30+
## Residual Plots {-}
31+
32+
- Residual = Actual – Predicted: measures error for each prediction
33+
- Residual Plot: residuals on y-axis vs. predicted value or input on x axis
34+
- Ideal outcome: Residuals normally distributed around 0 (no pattern)
35+
- A pattern in residuals indicates model misspecification (e.g. curve suggests missing non-linear term)
36+
- Heteroskedasticity: Residuals get bigger with x axis (fan shape), error variance isn’t constant and std errors may be wrong
37+
- Can identify outliers: Large residuals may indicate points that distort or need investigating
38+
39+
![Figure 11.6](https://fgazzelloni.quarto.pub/10-applications_files/figure-html/fig-meningitis-residuals-1.png)
40+
41+
42+
## Influential Observations {-}
43+
44+
- High-Leverage Point: An observation with extreme values (far from the average input/output)
45+
- These points pull linear model fit disproportionately (can change slope/coefficients)
46+
- High leverage + large residual = Influential outlier (can skew model)
47+
- Can use diagnostics like Cook’s Distance (in stats package) to identify influential points
48+
- Investigate high-leverage cases and consider fixes if needed
49+
50+
```r
51+
52+
library(RColorBrewer)
53+
54+
data_pred |>
55+
mutate(cook_over_1 = cooks.distance(mod3) > 1) |>
56+
ggplot(aes(Deaths, Residuals, colour = cook_over_1)) +
57+
geom_point() +
58+
geom_hline(yintercept = 0, color = "red") +
59+
scale_colour_manual(values = brewer.pal(3, "Set1")[1:2],
60+
labels = c("Cook's D ≤ 1", "Cook's D > 1"),
61+
name = "Influence") +
62+
labs(title = "Residuals vs. Deaths due to COVID-19",
63+
x = "Deaths", y = "Residuals")
64+
65+
```
66+
67+
![Points where Cook's Distance > 1](images/chap10_cook.png)
68+
69+
## Comparing Models {-}
70+
71+
- Multiple Models Comparison: e.g. bar chart of R²/AIC for each model
72+
- Actual vs Predicted with additional aesthetic or facet for each model
73+
- Aids the choice of model / specification / hyperparameters
74+
75+
![Previous Example - COVID-19 model](https://fgazzelloni.quarto.pub/10-applications_files/figure-html/fig-covid19_model-comparison-1.png)
76+
![Figure 11.5 from the book](https://fgazzelloni.quarto.pub/10-applications_files/figure-html/fig-meningitis-1.png)
77+
78+
## Communicating Results {-}
79+
80+
- Some models conducive to clear visualisation, e.g. decision-tree model can be ploted with {rpart.plot}
81+
- This sets out the fitted algorithmic choices and their effect on the predicted outputs
82+
83+
![Fig 11.8: Decision tree for Ischaemic Stroke](https://fgazzelloni.quarto.pub/10-applications_files/figure-html/fig-dv-tree-1.png)
84+
85+
- Random forests can show variable importance plots
86+
- These show the ranked importance of each predictive element
87+
- Low ranking elements can be dropped if little effect on the model
88+
89+
![Fig 11.9: Variable Importance for Ischaemic Stroke](https://fgazzelloni.quarto.pub/10-applications_files/figure-html/fig-rf-importance-1.png)
90+
91+
- Neither visualisation makes a *causal* claim, but can often give clues to make inferences.
92+
93+
## ROC Plots {-}
94+
95+
- ROC curve = Sensitivity (True positive rate) vs. 1–Specificity (False Positive Rate) across all thresholds – shows the trade-off.
96+
- AUC (Area Under Curve) is the overall performance indicator (1.0 = perfect, 0.5 = chance). Higher AUC = better model on average.
97+
- Use ROC to pick a threshold that fits your needs: E.g., for initial diagnoses, you might choose a threshold giving high sensitivity (accepting more false positives to catch more true cases).
98+
- If false positives are costly, pick a threshold with higher specificity (fewer false alarms, but you may miss some positives).
99+
- Compare models with ROC/AUC: a higher AUC or a curve closer to the top-left means a stronger model.
100+
101+
![Figure 11.10: ROC Curve for Ischaemic Stroke](https://fgazzelloni.quarto.pub/10-applications_files/figure-html/fig-roc-curve-1.png)
102+
103+
## Partial Independence Plots {- }
104+
105+
- Plot how prediction / output changes with changes in one input
106+
- **Assuming** other variables held constant (typically at average for the dataset)
107+
- Can show marginal effects or key thresholds as the value of the input changes
108+
- Need to be aware that assuming other variables don't change may not be realistic -
109+
- inputs may be correlated
110+
111+
![Figure 11.11: Partial Dependence Plot](https://fgazzelloni.quarto.pub/10-applications_files/figure-html/fig-partial-dependence-1.png)
112+
113+
## Conclusion {-}
114+
115+
- Visualising the model helps evaluating the quality of the model specification
116+
- It can also be a boon to help communicate results to non-technical audiences
117+
- Visualisations can also assist in fine-tuning models and evaluating individual predictor effects
11118

12119
## Meeting Videos {-}
13120

Predicted Vs Actual Bodyfat.png

11.1 KB
Loading
11.1 KB
Loading

images/chap10_cook.png

11.6 KB
Loading

0 commit comments

Comments
 (0)