Large scale regularized logistic regression code (including Hadoop implementation): see: http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/ http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/ http://www.win-vector.com/blog/2010/11/learn-a-powerful-machine-learning-tool-logistic-regression-and-beyond/ The experimental class LogisticTrainPlus allows useful encoding of an arbitrary number of categorical levels. See: http://www.win-vector.com/blog/2012/08/a-bit-more-on-impact-coding/ All material copyright Win-Vector LLC and distributed with license: GPLv3 (see: http://www.gnu.org/copyleft/gpl.html ). This is demonstration/experimental code. If you just want to try out standard logistic regression without Hadoop use R ( http://cran.r-project.org ). You may also want to consider Apache's Mahout which does do logistic regression: ( see: https://cwiki.apache.org/MAHOUT/logistic-regression.html and http://imiloainf.wordpress.com/2011/11/02/mahout-logistic-regression/ ). See also: http://www.win-vector.com/blog/2012/10/added-worked-example-to-logistic-regression-project/ for the purpose of this project. This Logistic codebase is designed to support experimentation on variations of logistic regression including:
- A pure Java implementation (thus directly usable in Java server environments).
- A simple multinomial implementation (that allows more than two possible result categories).
- The ability to work with too large for memory data-sets and directly from files or database tables.
- A demonstration of the steps needed to use standard Newton-Raphson in Hadoop.
- Ability to work with arbitrarily large categorical inputs.
- Provide explicit L2 model regularization.
- Implement safe optimization methods (like conjugate gradient, line-search and majorization) for situations where the standard Iteratively-re-Weighted-Least-Squares/Newton-Raphson fails.
- Provide an overall framework to quickly try implementation experiments (as opposed to novel usage experiments).
What we mean by this code being "experimental" is that it has capabilities that many standard implementations do not. In fact most of the items in the above list are not usually made available to the logistic regression user. But our project is also stand-alone and not as well integrated into existing workflows as standard production systems. Before trying our code you may want to try R or Mahout. In principle running the code is easy: all you do is supply a training file as a TSV (tab separated file) or CSV (comma separated file), write down the column you want to predict as a schematic formula of the columns you wish to use in your model. In practice it is a bit harder: you have to have already set up your Java or Hadoop environment to bring in all required dependencies. Setting up a Hadoop configuration can range from simple (like the single machine tests in our projects JUnit test suite) to complicated. Also, for non-trivial clusters you often do not control the configuration (i.e. somebody else supplies the cluster environment). So we really can't tell you how to set up your non-trivial Hadoop environment (just too many variables). Some time ago we supplied a complete example of how to run an example on Amazon's Elastic Map Reduce (but that was in 2010, so the environment may have changed a bit. However the current code runs at least on Hadoop versions 0.20.0, 1.0.0 and 1.0.3 without modification). Our intent was never to depend on Hadoop except in the case of very large data (and even then there are other options like sub-sampling and Mahout). So we supplied direct Java command line options. Below is a simple example of using these options ( see also: added example blog post ). We are assuming a command-line environment (for example the Bash shell on OSX or Linux, but this can be done on Windows using either CMD or Cygwin). We show what works for us in our environment, you will have to adapt to your environment as it differs. Example of running at the command line (using some Apache support classes, but not running under Hadoop):
-
Get a data file. For our example we took the data file from the UCI Iris data example saved it and added a header line of the form "SepalLength,SepalWidth,PetalLength,PetalWidth,TrainingClass" to make the file machine readable. The edited file is available here: iris.data.txt .
-
Get all the supporting code you need and set your Java CLASSPATH. To run this you need all of the classes from:
- https://github.com/WinVector/Logistic
- https://github.com/WinVector/SQLScrewdriver
- http://acs.lbl.gov/software/colt/
- Some of the Apache commons files (command line parsing and logging). We got these from the Hadoop-1.0.3 lib directory: http://hadoop.apache.org/releases.html#Download. It turns out we only need hadoop-1.0.3/lib/commons-logging-1.1.1.jar , hadoop-1.0.3/lib/commons-logging-api-1.0.4.jar and hadoop-1.0.3/lib/commons-cli-1.2.jar from the Hadoop package if we are not running a Map Reduce.
This is complicated- but it is one time set up cost. In practice you would not manipulate classes directly at the command line by use an IDE like Eclipse or a build manager like Maven to do all of the work. But not everybody uses the same tools and tools bring in even more dependencies and complications; so we show how to set up the class paths directly. In our shell (bash on OSX) we set our class variable as follows:
CLASSES="hadoop-1.0.3/lib/commons-logging-1.1.1.jar:hadoop-1.0.3/lib/commons-logging-api-1.0.4.jar:hadoop-1.0.3/lib/commons-cli-1.2.jar:Logistic/bin:Colt-1.2.0/bin:SQLScrewdriver/bin"
We are using paths where we put the files hadoop and the path separator ":" (separator is ";" on Windows). The path you would use would depend on where you put the files you downloaded.
-
In the directory you saved the training data file run the logistic training procedure:
java -cp $CLASSES com.winvector.logistic.LogisticTrain -trainURI file:iris.data.txt -sep , -formula "TrainingClass ~ SepalLength + SepalWidth + PetalLength + PetalWidth" -resultSer iris_model.ser
This produces iris_model.ser , the trained model. The diagnostic printouts show the confusion matrix (tabulation of training class versus predicted class) and show a high degree of training accuracy.
INFO: Consfusion matrix: prediction actual actual actual index:outcome 0:Iris-setosa 1:Iris-versicolor 2:Iris-virginica 0:Iris-setosa 50 0 0 1:Iris-versicolor 0 47 1 2:Iris-virginica 0 3 49
Notice that there are only 4 training errors (1 Iris-verginica classified as Iris-versicolor and 3 iris-versicolor classified as Irs-virginica). The model coefficients are also printed as part of the diagnostics.
INFO: soln details: outcome outcomegroup variable kind level value Iris-setosa 0 Const 0.774522294889561 Iris-setosa 0 PetalLength Numeric -5.192578560594749 Iris-setosa 0 PetalWidth Numeric -2.357410090972314 Iris-setosa 0 SepalLength Numeric 1.69382466234698 Iris-setosa 0 SepalWidth Numeric 3.9224697382723903 Iris-versicolor 1 Const 1.8730846627542541 Iris-versicolor 1 PetalLength Numeric -0.3747220505092903 Iris-versicolor 1 PetalWidth Numeric -2.839314336523609 Iris-versicolor 1 SepalLength Numeric 1.1786843497402208 Iris-versicolor 1 SepalWidth Numeric 0.2589801610139257 Iris-virginica 2 Const -2.6476069576668615 Iris-virginica 2 PetalLength Numeric 5.567416774143793 Iris-virginica 2 PetalWidth Numeric 5.196786825681218 Iris-virginica 2 SepalLength Numeric -2.8725550015586947 Iris-virginica 2 SepalWidth Numeric -4.1815354814927215
-
In the same directory run the logistic scoring procedure:
java -cp $CLASSES com.winvector.logistic.LogisticScore -dataURI file:iris.data.txt -sep , -modelFile iris_model.ser -resultFile iris.scored.tsv
This produces iris.scored.tsv , the final result. The scored file is essentially the input file (in this case iris.data.txt) copied over with a few prediction columns (predicted category, confidence in predicted category and probability for each possible outcome category) prepended.
For the Hadoop demonstrations of training and scoring the commands are
as follows (though obviously some of the details depend on your Hadoop
set-up): Get or build a jar containing the Logistic code, SQLScrewdriver
code and free portions of the COLT library (pre built:
WinVectorLogistic.Hadoop0.20.2.jar). Make sure the training file is
tab-separated (instead of comma separated). For example the iris data in
such a format is here: iris.data.tsv Run the Hadoop version of the
trainer: hadoop-1.0.3/bin/hadoop jar
Logistic/WinVectorLogistic.Hadoop0.20.2.jar
logistictrain iris.data.tsv "TrainingClass ~ SepalLength + SepalWidth +
PetalLength + PetalWidth" iris_model.ser Run the Hadoop version of the
scorring function: hadoop-1.0.3/bin/hadoop jar
Logistic/WinVectorLogistic.Hadoop0.20.2.jar
logisticscore iris_model.ser iris.data.tsv scoreDir Scored output is
left in Hadoop format in the user specified scoreDir (slightly different
format than the stand-alone programs). The passes take quite a long time
due to the overhead of setting up and tearing down a Hadoop environment
for such a small problem. Also if you are running in a serious Hadoop
environment (like elastic map-reduce) you will have to change certain
file names to the type of URI type the system is using. In our elastic
map reduce example we used S3 containers which had forms like:
“s3n://bigModel.ser” and so on. The Hadoop code can also be entered
using the main()
s found in
com.winvector.logistic.demo.MapReduceLogisticTrain
and
com.winvector.logistic.demo.MapReduceScore
. This allows interactive
debugging through an IDE (like Eclispe) without having go use the
"hadoop" command. Again: this is experimental code. It can do things
other code bases can not. If you need one of its features or
capabilities it is very much worth the trouble. But if you can make do
with a standard package like R you have less trouble and are more able
to interact with others work.