Skip to content

Commit 7c5ea47

Browse files
committed
add nested model bias warning
1 parent 96b7456 commit 7c5ea47

File tree

2 files changed

+525
-0
lines changed

2 files changed

+525
-0
lines changed
Lines changed: 343 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,343 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# [`vtreat`](https://github.com/WinVector/pyvtreat) Nested Model Bias Warning\n",
8+
"\n",
9+
"For quite a while we have been teaching estimating variable re-encodings on the exact same data they\n",
10+
"are later *naively* using to train a model on leads to an undesirable nested model bias. The `vtreat`\n",
11+
"package (both the [`R` version](https://github.com/WinVector/vtreat) and \n",
12+
"[`Python` version](https://github.com/WinVector/pyvtreat)) both incorporate a cross-frame method\n",
13+
"that allows one to use all the training data both to build learn variable re-encodings and to correctly train a subsequent model (for an example please see our recent [PyData LA talk](http://www.win-vector.com/blog/2019/12/pydata-los-angeles-2019-talk-preparing-messy-real-world-data-for-supervised-machine-learning/)).\n",
14+
"\n",
15+
"The next version of `vtreat` will warn the user if they have improperly used the same data for both `vtreat` impact code inference and downstream modeling. So in addition to us warning you not to do this, the package now also checks and warns against this situation.\n"
16+
]
17+
},
18+
{
19+
"cell_type": "markdown",
20+
"metadata": {},
21+
"source": [
22+
"## Set up the Example"
23+
]
24+
},
25+
{
26+
"cell_type": "markdown",
27+
"metadata": {},
28+
"source": [
29+
"\n",
30+
"This example is copied from [some of our classification documentation](https://github.com/WinVector/pyvtreat/blob/master/Examples/Classification/Classification.md).\n",
31+
"\n",
32+
"\n",
33+
"Load modules/packages."
34+
]
35+
},
36+
{
37+
"cell_type": "code",
38+
"execution_count": 1,
39+
"metadata": {
40+
"collapsed": false,
41+
"jupyter": {
42+
"outputs_hidden": false
43+
},
44+
"pycharm": {
45+
"name": "#%%\n"
46+
}
47+
},
48+
"outputs": [],
49+
"source": [
50+
"import pkg_resources\n",
51+
"import pandas\n",
52+
"import numpy\n",
53+
"import numpy.random\n",
54+
"import vtreat\n",
55+
"import vtreat.util\n",
56+
"\n",
57+
"numpy.random.seed(2019)"
58+
]
59+
},
60+
{
61+
"cell_type": "markdown",
62+
"metadata": {},
63+
"source": [
64+
"Generate example data. \n",
65+
"\n",
66+
"* `y` is a noisy sinusoidal function of the variable `x`\n",
67+
"* `yc` is the output to be predicted: : whether `y` is > 0.5. \n",
68+
"* Input `xc` is a categorical variable that represents a discretization of `y`, along some `NaN`s\n",
69+
"* Input `x2` is a pure noise variable with no relationship to the output"
70+
]
71+
},
72+
{
73+
"cell_type": "code",
74+
"execution_count": 2,
75+
"metadata": {
76+
"collapsed": false,
77+
"jupyter": {
78+
"outputs_hidden": false
79+
},
80+
"pycharm": {
81+
"name": "#%%\n"
82+
}
83+
},
84+
"outputs": [
85+
{
86+
"data": {
87+
"text/html": [
88+
"<div>\n",
89+
"<style scoped>\n",
90+
" .dataframe tbody tr th:only-of-type {\n",
91+
" vertical-align: middle;\n",
92+
" }\n",
93+
"\n",
94+
" .dataframe tbody tr th {\n",
95+
" vertical-align: top;\n",
96+
" }\n",
97+
"\n",
98+
" .dataframe thead th {\n",
99+
" text-align: right;\n",
100+
" }\n",
101+
"</style>\n",
102+
"<table border=\"1\" class=\"dataframe\">\n",
103+
" <thead>\n",
104+
" <tr style=\"text-align: right;\">\n",
105+
" <th></th>\n",
106+
" <th>x</th>\n",
107+
" <th>y</th>\n",
108+
" <th>xc</th>\n",
109+
" <th>x2</th>\n",
110+
" <th>yc</th>\n",
111+
" </tr>\n",
112+
" </thead>\n",
113+
" <tbody>\n",
114+
" <tr>\n",
115+
" <th>0</th>\n",
116+
" <td>-1.088395</td>\n",
117+
" <td>-0.956311</td>\n",
118+
" <td>NaN</td>\n",
119+
" <td>-1.424184</td>\n",
120+
" <td>False</td>\n",
121+
" </tr>\n",
122+
" <tr>\n",
123+
" <th>1</th>\n",
124+
" <td>4.107277</td>\n",
125+
" <td>-0.671564</td>\n",
126+
" <td>level_-0.5</td>\n",
127+
" <td>0.427360</td>\n",
128+
" <td>False</td>\n",
129+
" </tr>\n",
130+
" <tr>\n",
131+
" <th>2</th>\n",
132+
" <td>7.406389</td>\n",
133+
" <td>0.906303</td>\n",
134+
" <td>level_1.0</td>\n",
135+
" <td>0.668849</td>\n",
136+
" <td>True</td>\n",
137+
" </tr>\n",
138+
" <tr>\n",
139+
" <th>3</th>\n",
140+
" <td>NaN</td>\n",
141+
" <td>0.222792</td>\n",
142+
" <td>level_0.0</td>\n",
143+
" <td>-0.015787</td>\n",
144+
" <td>False</td>\n",
145+
" </tr>\n",
146+
" <tr>\n",
147+
" <th>4</th>\n",
148+
" <td>NaN</td>\n",
149+
" <td>-0.975431</td>\n",
150+
" <td>NaN</td>\n",
151+
" <td>-0.491017</td>\n",
152+
" <td>False</td>\n",
153+
" </tr>\n",
154+
" </tbody>\n",
155+
"</table>\n",
156+
"</div>"
157+
],
158+
"text/plain": [
159+
" x y xc x2 yc\n",
160+
"0 -1.088395 -0.956311 NaN -1.424184 False\n",
161+
"1 4.107277 -0.671564 level_-0.5 0.427360 False\n",
162+
"2 7.406389 0.906303 level_1.0 0.668849 True\n",
163+
"3 NaN 0.222792 level_0.0 -0.015787 False\n",
164+
"4 NaN -0.975431 NaN -0.491017 False"
165+
]
166+
},
167+
"execution_count": 2,
168+
"metadata": {},
169+
"output_type": "execute_result"
170+
}
171+
],
172+
"source": [
173+
"def make_data(nrows):\n",
174+
" d = pandas.DataFrame({'x': 5*numpy.random.normal(size=nrows)})\n",
175+
" d['y'] = numpy.sin(d['x']) + 0.1*numpy.random.normal(size=nrows)\n",
176+
" d.loc[numpy.arange(3, 10), 'x'] = numpy.nan # introduce a nan level\n",
177+
" d['xc'] = ['level_' + str(5*numpy.round(yi/5, 1)) for yi in d['y']]\n",
178+
" d['x2'] = numpy.random.normal(size=nrows)\n",
179+
" d.loc[d['xc']=='level_-1.0', 'xc'] = numpy.nan # introduce a nan level\n",
180+
" d['yc'] = d['y']>0.5\n",
181+
" return d\n",
182+
"\n",
183+
"training_data = make_data(500)\n",
184+
"\n",
185+
"training_data.head()"
186+
]
187+
},
188+
{
189+
"cell_type": "code",
190+
"execution_count": 3,
191+
"metadata": {},
192+
"outputs": [],
193+
"source": [
194+
"outcome_name = 'yc' # outcome variable / column\n",
195+
"outcome_target = True # value we consider positive"
196+
]
197+
},
198+
{
199+
"cell_type": "markdown",
200+
"metadata": {},
201+
"source": [
202+
"## Demonstrate the Warning"
203+
]
204+
},
205+
{
206+
"cell_type": "markdown",
207+
"metadata": {},
208+
"source": [
209+
"Now that we have the data, we want to treat it prior to modeling: we want training data where all the input variables are numeric and have no missing values or `NA`s.\n",
210+
"\n",
211+
"First create the data treatment transform design object, in this case a treatment for a binomial classification problem.\n",
212+
"\n",
213+
"We use the training data `training_data` to fit the transform and the return a treated training set: completely numeric, with no missing values."
214+
]
215+
},
216+
{
217+
"cell_type": "code",
218+
"execution_count": 4,
219+
"metadata": {
220+
"collapsed": false,
221+
"jupyter": {
222+
"outputs_hidden": false
223+
},
224+
"pycharm": {
225+
"name": "#%%\n"
226+
}
227+
},
228+
"outputs": [],
229+
"source": [
230+
"treatment = vtreat.BinomialOutcomeTreatment(\n",
231+
" outcome_name=outcome_name, # outcome variable\n",
232+
" outcome_target=outcome_target, # outcome of interest\n",
233+
" cols_to_copy=['y'], # columns to \"carry along\" but not treat as input variables\n",
234+
") "
235+
]
236+
},
237+
{
238+
"cell_type": "code",
239+
"execution_count": 5,
240+
"metadata": {},
241+
"outputs": [],
242+
"source": [
243+
"test_prepared = treatment.fit_transform(training_data, training_data['yc'])"
244+
]
245+
},
246+
{
247+
"cell_type": "markdown",
248+
"metadata": {},
249+
"source": [
250+
"`d_prepared` is the correct way to use the same training data for inferring the impact-coded variables.\n",
251+
"\n",
252+
"We prepare new test or application data as follows."
253+
]
254+
},
255+
{
256+
"cell_type": "code",
257+
"execution_count": 6,
258+
"metadata": {},
259+
"outputs": [],
260+
"source": [
261+
"test_data = make_data(100)\n",
262+
"\n",
263+
"test_prepared = treatment.transform(test_data)"
264+
]
265+
},
266+
{
267+
"cell_type": "markdown",
268+
"metadata": {},
269+
"source": [
270+
"The issue is: for training data we should not call `transform()`, but instead use the value returned by `.fit_transform()`.\n",
271+
"\n",
272+
"The point is we should not do the following:"
273+
]
274+
},
275+
{
276+
"cell_type": "code",
277+
"execution_count": 7,
278+
"metadata": {},
279+
"outputs": [
280+
{
281+
"name": "stderr",
282+
"output_type": "stream",
283+
"text": [
284+
"/Users/johnmount/opt/anaconda3/envs/ai_academy_3_7/lib/python3.7/site-packages/vtreat/vtreat_api.py:370: UserWarning: possibly called transform on same data used to fit (this causes over-fit, please use fit_transform() instead)\n",
285+
" \"possibly called transform on same data used to fit (this causes over-fit, please use fit_transform() instead)\")\n"
286+
]
287+
}
288+
],
289+
"source": [
290+
"train_prepared_wrong = treatment.transform(training_data)"
291+
]
292+
},
293+
{
294+
"cell_type": "markdown",
295+
"metadata": {},
296+
"source": [
297+
"\n",
298+
"Notice we now get a warning that we should not have done this, and in doing so we may have a nested model bias data leak.\n",
299+
"\n",
300+
"And that is the new nested model bias warning feature.\n",
301+
"\n",
302+
"The `R`-version of this document can be found [here](https://github.com/WinVector/vtreat/blob/master/Examples/Classification/ClassificationWarningExample.md)."
303+
]
304+
},
305+
{
306+
"cell_type": "code",
307+
"execution_count": null,
308+
"metadata": {},
309+
"outputs": [],
310+
"source": []
311+
}
312+
],
313+
"metadata": {
314+
"kernelspec": {
315+
"display_name": "Python 3",
316+
"language": "python",
317+
"name": "python3"
318+
},
319+
"language_info": {
320+
"codemirror_mode": {
321+
"name": "ipython",
322+
"version": 3
323+
},
324+
"file_extension": ".py",
325+
"mimetype": "text/x-python",
326+
"name": "python",
327+
"nbconvert_exporter": "python",
328+
"pygments_lexer": "ipython3",
329+
"version": "3.7.5"
330+
},
331+
"pycharm": {
332+
"stem_cell": {
333+
"cell_type": "raw",
334+
"metadata": {
335+
"collapsed": false
336+
},
337+
"source": []
338+
}
339+
}
340+
},
341+
"nbformat": 4,
342+
"nbformat_minor": 4
343+
}

0 commit comments

Comments
 (0)