WinVector
diff --git a/‎Examples/Classification/ClassificationWarningExample.ipynb‎
Lines changed: 343 additions & 0 deletions b/‎Examples/Classification/ClassificationWarningExample.ipynb‎
Lines changed: 343 additions & 0 deletions
@@ -0,0 +1,343 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# [`vtreat`](https://github.com/WinVector/pyvtreat) Nested Model Bias Warning\n",
+    "\n",
+    "For quite a while we have been teaching estimating variable re-encodings on the exact same data they\n",
+    "are later *naively* using to train a model on leads to an undesirable nested model bias.  The `vtreat`\n",
+    "package (both the [`R` version](https://github.com/WinVector/vtreat) and \n",
+    "[`Python` version](https://github.com/WinVector/pyvtreat)) both incorporate a cross-frame method\n",
+    "that allows one to use all the training data both to build learn variable re-encodings and to correctly train a subsequent model (for an example please see our recent [PyData LA talk](http://www.win-vector.com/blog/2019/12/pydata-los-angeles-2019-talk-preparing-messy-real-world-data-for-supervised-machine-learning/)).\n",
+    "\n",
+    "The next version of `vtreat` will warn the user if they have improperly used the same data for both `vtreat` impact code inference and downstream modeling.  So in addition to us warning you not to do this, the package now also checks and warns against this situation.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Set up the Example"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "This example is copied from [some of our classification documentation](https://github.com/WinVector/pyvtreat/blob/master/Examples/Classification/Classification.md).\n",
+    "\n",
+    "\n",
+    "Load modules/packages."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    },
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import pkg_resources\n",
+    "import pandas\n",
+    "import numpy\n",
+    "import numpy.random\n",
+    "import vtreat\n",
+    "import vtreat.util\n",
+    "\n",
+    "numpy.random.seed(2019)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Generate example data. \n",
+    "\n",
+    "* `y` is a noisy sinusoidal function of the variable `x`\n",
+    "* `yc` is the output to be predicted: : whether `y` is > 0.5. \n",
+    "* Input `xc` is a categorical variable that represents a discretization of `y`, along some `NaN`s\n",
+    "* Input `x2` is a pure noise variable with no relationship to the output"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    },
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>x</th>\n",
+       "      <th>y</th>\n",
+       "      <th>xc</th>\n",
+       "      <th>x2</th>\n",
+       "      <th>yc</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>0</th>\n",
+       "      <td>-1.088395</td>\n",
+       "      <td>-0.956311</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>-1.424184</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>1</th>\n",
+       "      <td>4.107277</td>\n",
+       "      <td>-0.671564</td>\n",
+       "      <td>level_-0.5</td>\n",
+       "      <td>0.427360</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>2</th>\n",
+       "      <td>7.406389</td>\n",
+       "      <td>0.906303</td>\n",
+       "      <td>level_1.0</td>\n",
+       "      <td>0.668849</td>\n",
+       "      <td>True</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>3</th>\n",
+       "      <td>NaN</td>\n",
+       "      <td>0.222792</td>\n",
+       "      <td>level_0.0</td>\n",
+       "      <td>-0.015787</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>4</th>\n",
+       "      <td>NaN</td>\n",
+       "      <td>-0.975431</td>\n",
+       "      <td>NaN</td>\n",
+       "      <td>-0.491017</td>\n",
+       "      <td>False</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "          x         y          xc        x2     yc\n",
+       "0 -1.088395 -0.956311         NaN -1.424184  False\n",
+       "1  4.107277 -0.671564  level_-0.5  0.427360  False\n",
+       "2  7.406389  0.906303   level_1.0  0.668849   True\n",
+       "3       NaN  0.222792   level_0.0 -0.015787  False\n",
+       "4       NaN -0.975431         NaN -0.491017  False"
+      ]
+     },
+     "execution_count": 2,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "def make_data(nrows):\n",
+    "    d = pandas.DataFrame({'x': 5*numpy.random.normal(size=nrows)})\n",
+    "    d['y'] = numpy.sin(d['x']) + 0.1*numpy.random.normal(size=nrows)\n",
+    "    d.loc[numpy.arange(3, 10), 'x'] = numpy.nan                           # introduce a nan level\n",
+    "    d['xc'] = ['level_' + str(5*numpy.round(yi/5, 1)) for yi in d['y']]\n",
+    "    d['x2'] = numpy.random.normal(size=nrows)\n",
+    "    d.loc[d['xc']=='level_-1.0', 'xc'] = numpy.nan  # introduce a nan level\n",
+    "    d['yc'] = d['y']>0.5\n",
+    "    return d\n",
+    "\n",
+    "training_data = make_data(500)\n",
+    "\n",
+    "training_data.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "outcome_name = 'yc'    # outcome variable / column\n",
+    "outcome_target = True  # value we consider positive"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Demonstrate the Warning"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that we have the data, we want to treat it prior to modeling: we want training data where all the input variables are numeric and have no missing values or `NA`s.\n",
+    "\n",
+    "First create the data treatment transform design object, in this case a treatment for a binomial classification problem.\n",
+    "\n",
+    "We use the training data `training_data` to fit the transform and the return a treated training set: completely numeric, with no missing values."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "collapsed": false,
+    "jupyter": {
+     "outputs_hidden": false
+    },
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "treatment = vtreat.BinomialOutcomeTreatment(\n",
+    "    outcome_name=outcome_name,      # outcome variable\n",
+    "    outcome_target=outcome_target,  # outcome of interest\n",
+    "    cols_to_copy=['y'],  # columns to \"carry along\" but not treat as input variables\n",
+    ")  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_prepared = treatment.fit_transform(training_data, training_data['yc'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`d_prepared` is the correct way to use the same training data for inferring the impact-coded variables.\n",
+    "\n",
+    "We prepare new test or application data as follows."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "test_data = make_data(100)\n",
+    "\n",
+    "test_prepared = treatment.transform(test_data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The issue is: for training data we should not call `transform()`, but instead use the value returned by `.fit_transform()`.\n",
+    "\n",
+    "The point is we should not do the following:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/johnmount/opt/anaconda3/envs/ai_academy_3_7/lib/python3.7/site-packages/vtreat/vtreat_api.py:370: UserWarning: possibly called transform on same data used to fit (this causes over-fit, please use fit_transform() instead)\n",
+      "  \"possibly called transform on same data used to fit (this causes over-fit, please use fit_transform() instead)\")\n"
+     ]
+    }
+   ],
+   "source": [
+    "train_prepared_wrong = treatment.transform(training_data)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "Notice we now get a warning that we should not have done this, and in doing so we may have a nested model bias data leak.\n",
+    "\n",
+    "And that is the new nested model bias warning feature.\n",
+    "\n",
+    "The `R`-version of this document can be found [here](https://github.com/WinVector/vtreat/blob/master/Examples/Classification/ClassificationWarningExample.md)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.5"
+  },
+  "pycharm": {
+   "stem_cell": {
+    "cell_type": "raw",
+    "metadata": {
+     "collapsed": false
+    },
+    "source": []
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}