switch default cross-val to y-stratified

JohnMount · JohnMount · commit 84fd0bf8080e · 2019-10-04T20:11:28.000-07:00
diff --git a/Examples/CustomizedCrossPlan/CustomizedCrossPlan.ipynb b/Examples/CustomizedCrossPlan/CustomizedCrossPlan.ipynb
@@ -159,7 +159,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "First, try preparing this data using `vtreat`. `Python` `vtreat` defaults to a simple `k`-fold cross validation plan."
+    "First, try preparing this data using `vtreat`.\n",
+    "\n",
+    "By default, `Python` `vtreat` uses a `y`-stratified randomized k-way cross validation when creating and evaluating complex synthetic variables. \n",
+    "\n",
+    "Here we start with a simple `k`-way cross validation plan. This will work well for the majority of applications. However, there may be times when you need a more specialized cross validation scheme for your modeling projects. In this document, we'll show how to replace the cross validation scheme in `vtreat`."
    ]
   },
   {
@@ -178,6 +182,7 @@
     "    outcome_name='y',\n",
     "    outcome_target=1,\n",
     "    params=vtreat.vtreat_parameters({\n",
+    "        'cross_validation_plan': vtreat.cross_plan.KWayCrossPlan(),\n",
     "        'cross_validation_k': k\n",
     "    })\n",
     ")\n",
diff --git a/Examples/CustomizedCrossPlan/CustomizedCrossPlan.md b/Examples/CustomizedCrossPlan/CustomizedCrossPlan.md
@@ -108,7 +108,11 @@ d.describe()
 
 
 
-First, try preparing this data using `vtreat`. `Python` `vtreat` defaults to a simple `k`-fold cross validation plan.
+First, try preparing this data using `vtreat`.
+
+By default, `Python` `vtreat` uses a `y`-stratified randomized k-way cross validation when creating and evaluating complex synthetic variables. 
+
+Here we start with a simple `k`-way cross validation plan. This will work well for the majority of applications. However, there may be times when you need a more specialized cross validation scheme for your modeling projects. In this document, we'll show how to replace the cross validation scheme in `vtreat`.
 
 
 ```python
@@ -122,6 +126,7 @@ treatment_unstratified = vtreat.BinomialOutcomeTreatment(
     outcome_name='y',
     outcome_target=1,
     params=vtreat.vtreat_parameters({
+        'cross_validation_plan': vtreat.cross_plan.KWayCrossPlan(),
         'cross_validation_k': k
     })
 )
diff --git a/README.md b/README.md
@@ -12,7 +12,7 @@ in a statistically sound manner.
 Install `vtreat` with either of:
 
   * `pip install vtreat`
-  * `pip install https://github.com/WinVector/pyvtreat/raw/master/pkg/dist/vtreat-0.2.9.tar.gz`
+  * `pip install https://github.com/WinVector/pyvtreat/raw/master/pkg/dist/vtreat-0.3.0.tar.gz`
 
 # Details
 
@@ -1225,7 +1225,9 @@ To install, please run:
 pip install vtreat
 ```
 
+Some notes on controlling `vtreat` cross-validation can be found [here](https://github.com/WinVector/pyvtreat/blob/master/Examples/CustomizedCrossPlan/CustomizedCrossPlan.md).
+
 ## Note on data types.
 
-`.fit_transform()` expects the first argument to be a `pandas.DataFrame` with trivial row-indexing, (i.e. `.reset_index(inplace=True, drop=True)`) and the second to be a vector-like object with a `len()` equal to the number of rows of the first argument. We are working on supporting column types other than string and numeric at this time.
+`.fit_transform()` expects the first argument to be a `pandas.DataFrame` with trivial row-indexing and scalar column names, (i.e. `.reset_index(inplace=True, drop=True)`) and the second to be a vector-like object with a `len()` equal to the number of rows of the first argument. We are working on supporting column types other than string and numeric at this time.
 
diff --git a/coverage.txt b/coverage.txt
@@ -16,13 +16,13 @@ pkg/tests/test_util.py .                                                 [100%]
 Name                        Stmts   Miss  Cover
 -----------------------------------------------
 pkg/vtreat/__init__.py          7      0   100%
-pkg/vtreat/cross_plan.py       94     62    34%
+pkg/vtreat/cross_plan.py       94     52    45%
 pkg/vtreat/transform.py        13      8    38%
 pkg/vtreat/util.py             80      8    90%
 pkg/vtreat/vtreat_api.py      209     61    71%
 pkg/vtreat/vtreat_impl.py     467     80    83%
 -----------------------------------------------
-TOTAL                         870    219    75%
+TOTAL                         870    209    76%
 
 
-=========================== 7 passed in 6.12 seconds ===========================
+=========================== 7 passed in 6.14 seconds ===========================
diff --git a/pkg/README.md b/pkg/README.md
@@ -12,7 +12,7 @@ in a statistically sound manner.
 Install `vtreat` with either of:
 
   * `pip install vtreat`
-  * `pip install https://github.com/WinVector/pyvtreat/raw/master/pkg/dist/vtreat-0.2.9.tar.gz`
+  * `pip install https://github.com/WinVector/pyvtreat/raw/master/pkg/dist/vtreat-0.3.0.tar.gz`
 
 # Details
 
@@ -1225,7 +1225,9 @@ To install, please run:
 pip install vtreat
 ```
 
+Some notes on controlling `vtreat` cross-validation can be found [here](https://github.com/WinVector/pyvtreat/blob/master/Examples/CustomizedCrossPlan/CustomizedCrossPlan.md).
+
 ## Note on data types.
 
-`.fit_transform()` expects the first argument to be a `pandas.DataFrame` with trivial row-indexing, (i.e. `.reset_index(inplace=True, drop=True)`) and the second to be a vector-like object with a `len()` equal to the number of rows of the first argument. We are working on supporting column types other than string and numeric at this time.
+`.fit_transform()` expects the first argument to be a `pandas.DataFrame` with trivial row-indexing and scalar column names, (i.e. `.reset_index(inplace=True, drop=True)`) and the second to be a vector-like object with a `len()` equal to the number of rows of the first argument. We are working on supporting column types other than string and numeric at this time.
 
diff --git a/pkg/build/lib/vtreat/__init__.py b/pkg/build/lib/vtreat/__init__.py
@@ -10,7 +10,7 @@
 from vtreat.vtreat_api import *
 
 __docformat__ = "restructuredtext"
-__version__ = "0.2.9"
+__version__ = "0.3.0"
 
 __doc__ = """
 This<https://github.com/WinVector/pyvtreat> is the Python version of the vtreat data preparation system
diff --git a/pkg/build/lib/vtreat/vtreat_api.py b/pkg/build/lib/vtreat/vtreat_api.py
@@ -22,7 +22,7 @@ def vtreat_parameters(user_params=None):
         },
         "filter_to_recommended": True,
         "indicator_min_fraction": 0.1,
-        "cross_validation_plan": vtreat.cross_plan.KWayCrossPlan(),
+        "cross_validation_plan": vtreat.cross_plan.KWayCrossPlanYStratified(),
         "cross_validation_k": 5,
         "user_transforms": [],
         "sparse_indicators": True,
diff --git a/pkg/dist/vtreat-0.2.9.tar.gz b/pkg/dist/vtreat-0.2.9.tar.gz
diff --git a/pkg/dist/vtreat-0.3.0-py3-none-any.whl b/pkg/dist/vtreat-0.3.0-py3-none-any.whl
diff --git a/pkg/dist/vtreat-0.3.0.tar.gz b/pkg/dist/vtreat-0.3.0.tar.gz
diff --git a/pkg/setup.py b/pkg/setup.py
@@ -52,7 +52,7 @@
 
 setuptools.setup(
     name='vtreat',
-    version='0.2.9',
+    version='0.3.0',
     author='John Mount, Nina Zumel',
     author_email='jmount@win-vector.com',
     url='https://github.com/WinVector/pyvtreat',
diff --git a/pkg/vtreat.egg-info/PKG-INFO b/pkg/vtreat.egg-info/PKG-INFO
@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: vtreat
-Version: 0.2.9
+Version: 0.3.0
 Summary: vtreat is a pandas.DataFrame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. 
 Home-page: https://github.com/WinVector/pyvtreat
 Author: John Mount, Nina Zumel
diff --git a/pkg/vtreat/__init__.py b/pkg/vtreat/__init__.py
@@ -10,7 +10,7 @@
 from vtreat.vtreat_api import *
 
 __docformat__ = "restructuredtext"
-__version__ = "0.2.9"
+__version__ = "0.3.0"
 
 __doc__ = """
 This<https://github.com/WinVector/pyvtreat> is the Python version of the vtreat data preparation system
diff --git a/pkg/vtreat/vtreat_api.py b/pkg/vtreat/vtreat_api.py
@@ -22,7 +22,7 @@ def vtreat_parameters(user_params=None):
         },
         "filter_to_recommended": True,
         "indicator_min_fraction": 0.1,
-        "cross_validation_plan": vtreat.cross_plan.KWayCrossPlan(),
+        "cross_validation_plan": vtreat.cross_plan.KWayCrossPlanYStratified(),
         "cross_validation_k": 5,
         "user_transforms": [],
         "sparse_indicators": True,