Skip to content

Commit 84fd0bf

Browse files
committed
switch default cross-val to y-stratified
1 parent 2f85c27 commit 84fd0bf

File tree

14 files changed

+29
-15
lines changed

14 files changed

+29
-15
lines changed

Examples/CustomizedCrossPlan/CustomizedCrossPlan.ipynb

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -159,7 +159,11 @@
159159
"cell_type": "markdown",
160160
"metadata": {},
161161
"source": [
162-
"First, try preparing this data using `vtreat`. `Python` `vtreat` defaults to a simple `k`-fold cross validation plan."
162+
"First, try preparing this data using `vtreat`.\n",
163+
"\n",
164+
"By default, `Python` `vtreat` uses a `y`-stratified randomized k-way cross validation when creating and evaluating complex synthetic variables. \n",
165+
"\n",
166+
"Here we start with a simple `k`-way cross validation plan. This will work well for the majority of applications. However, there may be times when you need a more specialized cross validation scheme for your modeling projects. In this document, we'll show how to replace the cross validation scheme in `vtreat`."
163167
]
164168
},
165169
{
@@ -178,6 +182,7 @@
178182
" outcome_name='y',\n",
179183
" outcome_target=1,\n",
180184
" params=vtreat.vtreat_parameters({\n",
185+
" 'cross_validation_plan': vtreat.cross_plan.KWayCrossPlan(),\n",
181186
" 'cross_validation_k': k\n",
182187
" })\n",
183188
")\n",

Examples/CustomizedCrossPlan/CustomizedCrossPlan.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -108,7 +108,11 @@ d.describe()
108108

109109

110110

111-
First, try preparing this data using `vtreat`. `Python` `vtreat` defaults to a simple `k`-fold cross validation plan.
111+
First, try preparing this data using `vtreat`.
112+
113+
By default, `Python` `vtreat` uses a `y`-stratified randomized k-way cross validation when creating and evaluating complex synthetic variables.
114+
115+
Here we start with a simple `k`-way cross validation plan. This will work well for the majority of applications. However, there may be times when you need a more specialized cross validation scheme for your modeling projects. In this document, we'll show how to replace the cross validation scheme in `vtreat`.
112116

113117

114118
```python
@@ -122,6 +126,7 @@ treatment_unstratified = vtreat.BinomialOutcomeTreatment(
122126
outcome_name='y',
123127
outcome_target=1,
124128
params=vtreat.vtreat_parameters({
129+
'cross_validation_plan': vtreat.cross_plan.KWayCrossPlan(),
125130
'cross_validation_k': k
126131
})
127132
)

README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ in a statistically sound manner.
1212
Install `vtreat` with either of:
1313

1414
* `pip install vtreat`
15-
* `pip install https://github.com/WinVector/pyvtreat/raw/master/pkg/dist/vtreat-0.2.9.tar.gz`
15+
* `pip install https://github.com/WinVector/pyvtreat/raw/master/pkg/dist/vtreat-0.3.0.tar.gz`
1616

1717
# Details
1818

@@ -1225,7 +1225,9 @@ To install, please run:
12251225
pip install vtreat
12261226
```
12271227

1228+
Some notes on controlling `vtreat` cross-validation can be found [here](https://github.com/WinVector/pyvtreat/blob/master/Examples/CustomizedCrossPlan/CustomizedCrossPlan.md).
1229+
12281230
## Note on data types.
12291231

1230-
`.fit_transform()` expects the first argument to be a `pandas.DataFrame` with trivial row-indexing, (i.e. `.reset_index(inplace=True, drop=True)`) and the second to be a vector-like object with a `len()` equal to the number of rows of the first argument. We are working on supporting column types other than string and numeric at this time.
1232+
`.fit_transform()` expects the first argument to be a `pandas.DataFrame` with trivial row-indexing and scalar column names, (i.e. `.reset_index(inplace=True, drop=True)`) and the second to be a vector-like object with a `len()` equal to the number of rows of the first argument. We are working on supporting column types other than string and numeric at this time.
12311233

coverage.txt

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -16,13 +16,13 @@ pkg/tests/test_util.py . [100%]
1616
Name Stmts Miss Cover
1717
-----------------------------------------------
1818
pkg/vtreat/__init__.py 7 0 100%
19-
pkg/vtreat/cross_plan.py 94 62 34%
19+
pkg/vtreat/cross_plan.py 94 52 45%
2020
pkg/vtreat/transform.py 13 8 38%
2121
pkg/vtreat/util.py 80 8 90%
2222
pkg/vtreat/vtreat_api.py 209 61 71%
2323
pkg/vtreat/vtreat_impl.py 467 80 83%
2424
-----------------------------------------------
25-
TOTAL 870 219 75%
25+
TOTAL 870 209 76%
2626

2727

28-
=========================== 7 passed in 6.12 seconds ===========================
28+
=========================== 7 passed in 6.14 seconds ===========================

pkg/README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ in a statistically sound manner.
1212
Install `vtreat` with either of:
1313

1414
* `pip install vtreat`
15-
* `pip install https://github.com/WinVector/pyvtreat/raw/master/pkg/dist/vtreat-0.2.9.tar.gz`
15+
* `pip install https://github.com/WinVector/pyvtreat/raw/master/pkg/dist/vtreat-0.3.0.tar.gz`
1616

1717
# Details
1818

@@ -1225,7 +1225,9 @@ To install, please run:
12251225
pip install vtreat
12261226
```
12271227

1228+
Some notes on controlling `vtreat` cross-validation can be found [here](https://github.com/WinVector/pyvtreat/blob/master/Examples/CustomizedCrossPlan/CustomizedCrossPlan.md).
1229+
12281230
## Note on data types.
12291231

1230-
`.fit_transform()` expects the first argument to be a `pandas.DataFrame` with trivial row-indexing, (i.e. `.reset_index(inplace=True, drop=True)`) and the second to be a vector-like object with a `len()` equal to the number of rows of the first argument. We are working on supporting column types other than string and numeric at this time.
1232+
`.fit_transform()` expects the first argument to be a `pandas.DataFrame` with trivial row-indexing and scalar column names, (i.e. `.reset_index(inplace=True, drop=True)`) and the second to be a vector-like object with a `len()` equal to the number of rows of the first argument. We are working on supporting column types other than string and numeric at this time.
12311233

pkg/build/lib/vtreat/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
from vtreat.vtreat_api import *
1111

1212
__docformat__ = "restructuredtext"
13-
__version__ = "0.2.9"
13+
__version__ = "0.3.0"
1414

1515
__doc__ = """
1616
This<https://github.com/WinVector/pyvtreat> is the Python version of the vtreat data preparation system

pkg/build/lib/vtreat/vtreat_api.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ def vtreat_parameters(user_params=None):
2222
},
2323
"filter_to_recommended": True,
2424
"indicator_min_fraction": 0.1,
25-
"cross_validation_plan": vtreat.cross_plan.KWayCrossPlan(),
25+
"cross_validation_plan": vtreat.cross_plan.KWayCrossPlanYStratified(),
2626
"cross_validation_k": 5,
2727
"user_transforms": [],
2828
"sparse_indicators": True,

pkg/dist/vtreat-0.2.9.tar.gz

-21.7 KB
Binary file not shown.
Binary file not shown.

pkg/dist/vtreat-0.3.0.tar.gz

21.7 KB
Binary file not shown.

0 commit comments

Comments
 (0)