Skip to content

Commit 33ef26c

Browse files
committed
fix reference
1 parent 12e4a64 commit 33ef26c

36 files changed

+591
-592
lines changed

DESCRIPTION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ Package: vtreat
22
Type: Package
33
Title: A Statistically Sound 'data.frame' Processor/Conditioner
44
Version: 1.4.0
5-
Date: 2019-05-01
5+
Date: 2019-05-04
66
Authors@R: c(
77
person("John", "Mount", email = "[email protected]", role = c("aut", "cre")),
88
person("Nina", "Zumel", email = "[email protected]", role = c("aut")),

NEWS.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11

2-
# vtreat 1.4.0 2019/05/01
2+
# vtreat 1.4.0 2019/05/04
33

44
* Fancy level and variable names.
55
* More tests on odd level names (and collisions).

README.Rmd

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -134,9 +134,7 @@ precautions to guard against the following real world data issues:
134134
We re-encode such variables as a family of indicator or dummy
135135
variables for common levels plus an additional [impact
136136
code](http://www.win-vector.com/blog/2012/07/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/)
137-
(also called "effects coded" in Jacob Cohen, Patricia Cohen,
138-
*Applied Multiple Regression/Correlation Analysis for the Behavioral
139-
Sciences*, 2nd edition, 1983). This allows principled use
137+
(also called "effects coded"). This allows principled use
140138
(including smoothing) of huge categorical variables (like zip-codes)
141139
when building models. This is critical for some libraries (such as
142140
'randomForest', which has hard limits on the number of
@@ -316,9 +314,9 @@ dTrainN %.>%
316314

317315
Related work:
318316

319-
* _Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences_, 2nd edition, 1983, Jacob Cohen, Patricia Cohen (called the concept “effects coded variables”).
320-
* ["A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems"](http://dl.acm.org/citation.cfm?id=507538) Daniele Micci-Barreca, ACM SIGKDD Explorations, Volume 3 Issue 1, July 2001 Pages 27-32.
321-
* ["Modeling Trick: Impact Coding of Categorical Variables with Many Levels"](http://www.win-vector.com/blog/2012/07/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/) Nina Zumel, Win-Vector blog, 2012.
317+
* ["A Transformation for Simplifying the Interpretation of Coefficients of Binary Variables in Regression Analysis"](https://www.jstor.org/stable/2683780), Robert E. Sweeney and Edwin F. Ulveling; The American Statistician, vol. 26, no. 5, pp. 30-32, 1972.
318+
* ["A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems"](http://dl.acm.org/citation.cfm?id=507538) Daniele Micci-Barreca; ACM SIGKDD Explorations, Volume 3 Issue 1, July 2001 Pages 27-32.
319+
* ["Modeling Trick: Impact Coding of Categorical Variables with Many Levels"](http://www.win-vector.com/blog/2012/07/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/) Nina Zumel; Win-Vector blog, 2012.
322320
* "Big Learning Made Easy – with Counts!", Misha Bilenko, Cortana Intelligence and Machine Learning Blog, 2015.
323321

324322
## Installation

README.md

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -123,9 +123,7 @@ precautions to guard against the following real world data issues:
123123
We re-encode such variables as a family of indicator or dummy
124124
variables for common levels plus an additional [impact
125125
code](http://www.win-vector.com/blog/2012/07/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/)
126-
(also called “effects coded” in Jacob Cohen, Patricia Cohen,
127-
*Applied Multiple Regression/Correlation Analysis for the Behavioral
128-
Sciences*, 2nd edition, 1983). This allows principled use (including
126+
(also called “effects coded”). This allows principled use (including
129127
smoothing) of huge categorical variables (like zip-codes) when
130128
building models. This is critical for some libraries (such as
131129
‘randomForest’, which has hard limits on the number of allowed
@@ -276,14 +274,14 @@ dTestC <- data.frame(x=c('a', 'b', 'c', NA), z=c(10, 20, 30, NA))
276274
treatmentsC <- designTreatmentsC(dTrainC, colnames(dTrainC), 'y', TRUE,
277275
verbose=FALSE)
278276
print(treatmentsC$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')])
279-
# origName varName code rsq sig extraModelDegrees
280-
# 1 x x_catP catP 1.559780e-01 0.22202097 2
281-
# 2 x x_catB catB 1.142159e-05 0.99166241 2
282-
# 3 z z clean 2.376018e-01 0.13176020 0
283-
# 4 z z_isBAD isBAD 2.960654e-01 0.09248399 0
284-
# 5 x x_lev_NA lev 2.960654e-01 0.09248399 0
285-
# 6 x x_lev_x_a lev 1.300057e-01 0.26490379 0
286-
# 7 x x_lev_x_b lev 6.067337e-03 0.80967242 0
277+
# origName varName code rsq sig extraModelDegrees
278+
# 1 x x_catP catP 0.057741424 0.45748159 2
279+
# 2 x x_catB catB 0.019483838 0.66603146 2
280+
# 3 z z clean 0.237601767 0.13176020 0
281+
# 4 z z_isBAD isBAD 0.296065432 0.09248399 0
282+
# 5 x x_lev_NA lev 0.296065432 0.09248399 0
283+
# 6 x x_lev_x_a lev 0.130005705 0.26490379 0
284+
# 7 x x_lev_x_b lev 0.006067337 0.80967242 0
287285

288286
# help("prepare")
289287

@@ -321,9 +319,9 @@ treatmentsN = designTreatmentsN(dTrainN, colnames(dTrainN), 'y',
321319
verbose=FALSE)
322320
print(treatmentsN$scoreFrame[, c('origName', 'varName', 'code', 'rsq', 'sig', 'extraModelDegrees')])
323321
# origName varName code rsq sig extraModelDegrees
324-
# 1 x x_catP catP 7.352941e-02 0.5159425 2
325-
# 2 x x_catN catN 1.678556e-03 0.9232668 2
326-
# 3 x x_catD catD 3.614228e-01 0.1149323 2
322+
# 1 x x_catP catP 3.558824e-01 0.1184999 2
323+
# 2 x x_catN catN 2.131202e-02 0.7301398 2
324+
# 3 x x_catD catD 4.512437e-02 0.6135229 2
327325
# 4 z z clean 2.880952e-01 0.1701892 0
328326
# 5 z z_isBAD isBAD 3.333333e-01 0.1339746 0
329327
# 6 x x_lev_NA lev 3.333333e-01 0.1339746 0
@@ -379,17 +377,19 @@ dTrainN %.>%
379377

380378
Related work:
381379

382-
- *Applied Multiple Regression/Correlation Analysis for the Behavioral
383-
Sciences*, 2nd edition, 1983, Jacob Cohen, Patricia Cohen (called
384-
the concept “effects coded variables”).
380+
- [“A Transformation for Simplifying the Interpretation of
381+
Coefficients of Binary Variables in Regression
382+
Analysis”](https://www.jstor.org/stable/2683780), Robert E.
383+
Sweeney and Edwin F. Ulveling; The American Statistician, vol. 26,
384+
no. 5, pp. 30-32, 1972.
385385
- [“A preprocessing scheme for high-cardinality categorical attributes
386386
in classification and prediction
387387
problems”](http://dl.acm.org/citation.cfm?id=507538) Daniele
388-
Micci-Barreca, ACM SIGKDD Explorations, Volume 3 Issue 1, July 2001
388+
Micci-Barreca; ACM SIGKDD Explorations, Volume 3 Issue 1, July 2001
389389
Pages 27-32.
390390
- [“Modeling Trick: Impact Coding of Categorical Variables with Many
391391
Levels”](http://www.win-vector.com/blog/2012/07/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/)
392-
Nina Zumel, Win-Vector blog, 2012.
392+
Nina Zumel; Win-Vector blog, 2012.
393393
- “Big Learning Made Easy – with Counts\!”, Misha Bilenko, Cortana
394394
Intelligence and Machine Learning Blog, 2015.
395395

0 commit comments

Comments
 (0)