generated from r4ds/bookclub-template
-
Notifications
You must be signed in to change notification settings - Fork 3
/
01_introduction.Rmd
190 lines (114 loc) · 5.57 KB
/
01_introduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
output:
pdf_document: default
html_document: default
---
# Introduction
**Learning objectives:**
- Introduce the concept of predictive models.
- Detail the evolution of statistical models.
- Highlight the challenges and risks associated with advanced predictive models.
- Emphasize the importance of model exploration tools to:
- Understand
- Evaluate
- Assure ethical compliance
## Purpose of Predictive Models {-}
*Predict values of a variable of interest based on values of other variables*.
$$
Y = f(X) + \epsilon
$$
We can predict:
- The risk of heart disease based on patient’s characteristics
- Political attitudes based on Facebook comments
## First Stage {-}
- The **least squares method** was created by **Legendre in 1805** and by **Gauss in 1809** to be the first **regression model** for predicting _continuous_ variables.
- **Classification models** are used to predict nominal variables and were introduced by **Ronald Fisher in 1936**.
- After that many statistical models have been developed like:
- Generalized linear models
- Classification and regression trees
- Rule-based models
## Starting the Big Data Era {-}
- As we have **more data** and **more computational power** new models have new capacities:
- Flexibility
- Internal variable selection
- High precision of predictions
- Most the models techniques combine **hundreds or thousands** of simpler models into one super-model like happens with:
- Bagging
- Boosting
- Model stacking
- Large deep-neural models may have over a **billion parameters**.
## The Dangers of Black Box Models {-}
- **IBM’s Watson for Oncology** was criticized by oncologists for delivering unsafe and inaccurate recommendations
- **Amazon’s** system for **curriculum-vitae screening** was found to be biased against women.
- Algorithms behind the **Apple Credit Card** are accused of being gender-biased.
> The General Data Protection Regulation (GDPR, 2018) **creates a right to explanation**
## A New Modeling Process {-}
![](img/00-welcome/02-UMEP-Importance.png){width=80% height=80%}
## We need more tools {-}
The true bottleneck in predictive modelling is the lack of tools for **model exploration**:
- **Model explanation**: Obtaining insight into model-based predictions.
- **Model examination**: Evaluation of model’s performance and understanding the weaknesses.
## Ethics requirements for any predictive model {-}
1. **Prediction’s validation**: One should be able to verify how strong the evidence is that supports the prediction.
2. **Prediction’s justification**: One should be able to understand which variables affect the prediction and to what extent.
3. **Prediction’s speculation**: One should be able to understand how the prediction would change if the values of the variables included in the model changed.
## Terminology {-}
|**Words**|**Equivalent word**|
|:-------:|:-----------------:|
|Independent variables|Explanatory variables <br> Predictors <br> Covariates <br> Input variables <br> Features|
|Dependent variable|Predicted <br> Response <br> Output <br> Target|
|Observations|Instances <br> Cases|
|Fit|Trained|
|Coefficients|Parameters|
## Terminology {-}
- Level of explanation methods:
- **Instance-level**: They are designed to extract information about the behaviour of a model related to a specific observation (or instance).
- **Dataset-level**: They allow obtaining information about the behaviour of the model for an entire dataset.
- Types of dependent variables
- Categorical
- Continuous
- Numbers with an ordering ( _age_ or _number of children_ )
- Counts ( _number of floors_ or _number of steps_ )
## Glass-box {-}
It is a model that is easy to understand, as it has:
- A simple structure
- Decision or regression trees
- Linear models
- A limited number of coefficients _(not much higher than 10)_
![](img/00-welcome/03-tree-model.png){width=37.5% height=60% fig-align="center"}
## Glass-box benefits {-}
- We can see which explanatory **variables are included** in the model and which **are not**.
- We can easily link **changes in the model's predictions** with changes in particular explanatory variables.
- Allow us to challenge the model on the ground of domain knowledge.
## Model-specific {-}
- For **linear models** and **generalized linear models**
- qq-plots
- Diagnostic plots
- Tests for model structure
- Tools for identification of outliers
- For **random forest** and **boosting**
- Out-of-bag method of evaluating performance
- Variable importance
- Possible interactions
- For **neural networks**
- Layer-wise relevance propagation technique
- Saliency maps technique
## Model-specific Limitations {-}
- One cannot easily **compare explanations for two models** with different structures.
![](img/00-welcome/04-decision-meme.webp){width=60% height=40% fig-align="center"}
## Model-specific Limitations {-}
- We need to create new methods for each new model proposed.
![](img/00-welcome/05-never-ending-work.jpg){width=40% height=60% fig-align="center"}
## Model-agnostic techniques {-}
- We don't assume anything about the model structure.
- We will assume that the model operates on p-dimensional vector of explanatory variables/features and, for a single observation, it **returns a single value (score/probability)**, which is a real number.
![](img/00-welcome/06-win-dance-meme.png){width=40% height=80% fig-align="center"}
## Meeting Videos {-}
### Cohort 1 {-}
`r knitr::include_url("https://www.youtube.com/embed/URL")`
<details>
<summary> Meeting chat log </summary>
```
LOG
```
</details>