-
Notifications
You must be signed in to change notification settings - Fork 0
Description
I did a run where I ran every ML measure for every table/column combination 20 times (not all runs completed all 20, but most did). I gathered various stats regarding those 20 runs. The results are summarized here. Following that is a more detailed explanation.
Summary
Bottom line: Most of the methods can have substantial variation between individual measures. The exception is LinearRegression, but this one has the problem of different machines getting different measures.
Some of these measures can take a long time to run, so it seems that what we need to do is to have multiple measures per table/column/method combination, and run each individual measure in an individual slurm job and produce a separate file for each one. Then at gatherResults.py
we select the best of multiple measures.
Following are the summary results for each method, followed by the reasoning from the statistical measures.
LinearRegression
Either gives a consistently very bad score (varies significantly between runs), or a consistently good score. Appears to be machine dependent.
Assuming we are on a machine that gives a good score, then the variance between the good scores is very small. This means we can get away with selecting one run (or perhaps the best of just 3 or 4 runs).
MLPRegressor
There is a lot of variance between measures. I think we need 20-30 measures to find a good one. Some of these measures can be quite slow.
BinaryAdaBoostClassifier
There is a lot of variance between measures. I think we need 20-30 measures to find a good one. Fortunately these measures are generally pretty fast.
BinaryLogisticRegression
There is a lot of variance between measures. I think we need 20-30 measures to find a good one. Fortunately these measures are generally pretty fast.
BinaryMLPClassifier
There is a lot of variance between measures. I think we need 20-30 measures to find a good one. Fortunately these measures are generally pretty fast.
MulticlassDecisionTreeClassifier
There is a lot of variance between measures. I think we need 20-30 measures to find a good one. Fortunately these measures are generally pretty fast.
MulticlassMLPClassifier
There is a lot of variance between measures. I think we need 20-30 measures to find a good one. Fortunately these measures are generally pretty fast.
Reasoning
LinearRegression
The stats are:
LinearRegression:
Total samples: 96
Max max: 1.0
Min max: -7.52120439003072e+25
Average max: -7.845129805975915e+23
Stddev max: 7.676190056164506e+24
Average avg: -1.0235563739196867e+24
Stddev avg: 8.001905245682028e+24
Average stdev: 1.0690345450679683e+24
Stddev stdev: 1.0474356428102754e+25
Max max-min gap: 4.589633033653067e+26
intrusion.csv.dst_bytes.LinearRegression.json
Average max-min gap: 4.7808678232824684e+24
Stddev max-min gap: 4.68427459878935e+25
Max max-first gap: 2.570393772131859e+16
census.csv.age.LinearRegression.json
Average max-first gap: 510925575515149.75
Max 0.01ofMax: 19
census.csv.age.LinearRegression.json
Average 0.01ofMax: 1.5208333333333333
Stddev 0.01ofMax: 4.717223015976257
0 of 96 have both positive and negative scores
LinearRegression_good:
Total samples: 37
Max max: 1.0
Min max: 0.5058273747831172
Average max: 0.7956564371727154
Stddev max: 0.19854860391503396
Average avg: 0.7940715085198299
Stddev avg: 0.199761604374762
Average stdev: 0.0013304198752687325
Stddev stdev: 0.005112806519675665
Max max-min gap: 0.10141611802703243
intrusion.csv.dst_host_count.LinearRegression.json
Average max-min gap: 0.004846775495064539
Stddev max-min gap: 0.01745522784932013
Max max-first gap: 0.025371444764794693
intrusion.csv.dst_host_count.LinearRegression.json
Average max-first gap: 0.0011819185626353627
Max 0.01ofMax: 2
intrusion.csv.dst_host_count.LinearRegression.json
Average 0.01ofMax: 0.05405405405405406
Stddev 0.01ofMax: 0.3287979746107146
0 of 37 have both positive and negative scores
Note that methods labeled Method_good
limit the stats to runs that had a score better than 0.5. There are 37 samples (different table/column combinations).
0 of 96 have both positive and negative scores
: This tells us that these runs have either all positive or all negative scores.
Average max-min gap: 0.004846775495064539
: For the good
cases, this is telling us that on average (over 37 samples) there is very little difference between the highest and lowest scores of the 20 runs.
Max max-min gap: 0.10141611802703243
: for the good
cases, this tells us that the largest max-min gap of the 37 samples was 0.1, which is higher than we want.
Average 0.01ofMax: 0.05405405405405406
: For the good
cases, this tells us that on average, a score within 0.01 of the best score was found at the first measure. Max 0.01ofMax: 2
tells us that in the worst case, a score within 0.01 of the best score was found on the third measure.
Assuming we hit a machine that gives a good score, we are safe taking only on measure or at most 3-4 measures.
MLPRegressor
MLPRegressor:
Total samples: 96
Max max: 0.9998981082247835
Min max: -3.062275587521884
Average max: 0.3866253580863069
Stddev max: 0.6136262397941669
Average avg: -40.430940130884984
Stddev avg: 233.61268430735157
Average stdev: 56.139268953175964
Stddev stdev: 294.5794165958341
Max max-min gap: 6790.90828833852
intrusion.csv.dst_host_count.MLPRegressor.json
Average max-min gap: 205.56190843811632
Stddev max-min gap: 1054.7628390978848
Max max-first gap: 1036.0233666015326
intrusion.csv.count.MLPRegressor.json
Average max-first gap: 18.93460919768702
Max 0.01ofMax: 15
intrusion.csv.dst_host_count.MLPRegressor.json
Average 0.01ofMax: 2.1770833333333335
Stddev 0.01ofMax: 4.015744780280599
23 of 96 have both positive and negative scores
Average max-min gap: 205.56190843811632
: This is telling us that on average there is a huge difference between the scores of different runs for the same table/column.
Max max-min gap: 6790.90828833852
: And this is the worst case among the 96 samples.
Average 0.01ofMax: 2.1770833333333335
: This tells us that, on average, it takes about 2 measures to get within 0.01 of the best score.
Stddev 0.01ofMax: 4.015744780280599
: This tells us that the number of measures to get within 0.01 varies a lot for different table/column samples.
Max 0.01ofMax: 15
: And in the worst case it took 15 samples to get within 0.01 of the best score.
This all suggests that we need a lot of samples to get a good score for MLPRegressor. At least 20 I think. (But they can all be run on the same machine.)
BinaryAdaBoostClassifier
BinaryAdaBoostClassifier:
Total samples: 53
Max max: 1.0
Min max: 0.0
Average max: 0.9068751303409786
Stddev max: 0.1562903553311418
Average avg: 0.8924151832469711
Stddev avg: 0.17410336201391738
Average stdev: 0.009009615621929833
Stddev stdev: 0.04088655322611039
Max max-min gap: 0.9
intrusion.csv.land.BinaryAdaBoostClassifier.json
Average max-min gap: 0.02959683917390543
Stddev max-min gap: 0.1304145612939037
Max max-first gap: 0.23809523809523808
intrusion.csv.root_shell.BinaryAdaBoostClassifier.json
Average max-first gap: 0.010474362349415359
Max 0.01ofMax: 19
intrusion.csv.land.BinaryAdaBoostClassifier.json
Average 0.01ofMax: 1.0566037735849056
Stddev 0.01ofMax: 3.8150264357367303
0 of 53 have both positive and negative scores
Note 52 of 53 runs had max score > 0.5.
Average max-min gap: 0.02959683917390543
tells us that usually the measures are pretty tight, but Stddev max-min gap: 0.1304145612939037
says that there is substantial variance, and Max max-min gap: 0.9
says that at least once in 53 table/column runs the gap was quite large.
Average 0.01ofMax: 1.0566037735849056
says that usually a score close to the best was found in the second measure, but Stddev 0.01ofMax: 3.8150264357367303
says that the variance is high, and Max 0.01ofMax: 19
says that in once case only one measure was within 0.01 of the best score (the best score itself).
Conclude that we need 20-30 measures to get a good score.
BinaryLogisticRegression
BinaryLogisticRegression:
Total samples: 53
Max max: 1.0
Min max: 0.007111111111111111
Average max: 0.8055424136960064
Stddev max: 0.2663549125673896
Average avg: 0.7687631564140988
Stddev avg: 0.28473818880421475
Average stdev: 0.02943585817702546
Stddev stdev: 0.07057808561319116
Max max-min gap: 0.9655883664158716
intrusion.csv.is_guest_login.BinaryLogisticRegression.json
Average max-min gap: 0.09563457059857121
Stddev max-min gap: 0.21303781576313277
Max max-first gap: 0.9655883664158716
intrusion.csv.is_guest_login.BinaryLogisticRegression.json
Average max-first gap: 0.05620198898797505
Max 0.01ofMax: 17
intrusion.csv.dst_host_diff_srv_rate.BinaryLogisticRegression.json
Average 0.01ofMax: 1.5849056603773586
Stddev 0.01ofMax: 4.0212606245006866
0 of 53 have both positive and negative scores
Similar to the prior arguments, there can be substantial variance between measures, and we need 20-30 measures to get a good value.
BinaryMLPClassifier
BinaryMLPClassifier:
Total samples: 53
Max max: 1.0
Min max: 0.15384615384615383
Average max: 0.8854125579711414
Stddev max: 0.1694019541487938
Average avg: 0.8505652421867341
Stddev avg: 0.22191102593369064
Average stdev: 0.026263986505766817
Stddev stdev: 0.06864744882473209
Max max-min gap: 0.9523809523809523
intrusion.csv.land.BinaryMLPClassifier.json
Average max-min gap: 0.08668380475713197
Stddev max-min gap: 0.19552783992170478
Max max-first gap: 0.9523809523809523
intrusion.csv.land.BinaryMLPClassifier.json
Average max-first gap: 0.03828926786080096
Max 0.01ofMax: 17
intrusion.csv.srv_diff_host_rate.BinaryMLPClassifier.json
Average 0.01ofMax: 1.5660377358490567
Stddev 0.01ofMax: 3.516496475199013
0 of 53 have both positive and negative scores
Similar to the prior arguments, there can be substantial variance between measures, and we need 20-30 measures to get a good value.
MulticlassDecisionTreeClassifier
MulticlassDecisionTreeClassifier:
Total samples: 132
Max max: 1.0
Min max: 0.1369446222401963
Average max: 0.6354383346905583
Stddev max: 0.25276984943532
Average avg: 0.614685403358619
Stddev avg: 0.259548686245021
Average stdev: 0.01245704989667424
Stddev stdev: 0.022662225073119067
Max max-min gap: 0.4746633745946614
intrusion.csv.num_shells.MulticlassDecisionTreeClassifier.json
Average max-min gap: 0.0416258386946806
Stddev max-min gap: 0.07401481190525455
Max max-first gap: 0.4746633745946614
intrusion.csv.num_shells.MulticlassDecisionTreeClassifier.json
Average max-first gap: 0.02189192510450059
Max 0.01ofMax: 19
KRK_v1.csv.black_king_rank.MulticlassDecisionTreeClassifier.json
Average 0.01ofMax: 1.4621212121212122
Stddev 0.01ofMax: 3.5304678301464927
0 of 132 have both positive and negative scores
Similar to the prior arguments, there can be substantial variance between measures, and we need 20-30 measures to get a good value.
MulticlassMLPClassifier
MulticlassMLPClassifier:
Total samples: 132
Max max: 0.9997861232493044
Min max: 0.12425568754818966
Average max: 0.6228910729462037
Stddev max: 0.2568441238766107
Average avg: 0.595090856861174
Stddev avg: 0.26329307085976694
Average stdev: 0.016626135238015812
Stddev stdev: 0.02080889679037879
Max max-min gap: 0.40728186971216823
expedia_hotel_logs.csv.srch_destination_type_id.MulticlassMLPClassifier.json
Average max-min gap: 0.06087418836490585
Stddev max-min gap: 0.0728309728800052
Max max-first gap: 0.3550986625120539
expedia_hotel_logs.csv.srch_destination_type_id.MulticlassMLPClassifier.json
Average max-first gap: 0.027274826254010922
Max 0.01ofMax: 17
intrusion.csv.num_shells.MulticlassMLPClassifier.json
Average 0.01ofMax: 2.75
Stddev 0.01ofMax: 4.01404975303587
0 of 132 have both positive and negative scores
Similar to the prior arguments, there can be substantial variance between measures, and we need 20-30 measures to get a good value.