Variance between individual ML measures


I did a run where I ran every ML measure for every table/column combination 20 times (not all runs completed all 20, but most did). I gathered various stats regarding those 20 runs. The results are summarized here. Following that is a more detailed explanation.

# Summary

Bottom line: Most of the methods can have substantial variation between individual measures. The exception is LinearRegression, but this one has the problem of different machines getting different measures.

Some of these measures can take a long time to run, so it seems that what we need to do is to have multiple measures per table/column/method combination, and run each individual measure in an individual slurm job and produce a separate file for each one. Then at `gatherResults.py` we select the best of multiple measures.

Following are the summary results for each method, followed by the reasoning from the statistical measures.

### LinearRegression

Either gives a consistently very bad score (varies significantly between runs), or a consistently good score. Appears to be machine dependent.

Assuming we are on a machine that gives a good score, then the variance between the good scores is very small. This means we can get away with selecting one run (or perhaps the best of just 3 or 4 runs).

### MLPRegressor

There is a lot of variance between measures. I think we need 20-30 measures to find a good one. Some of these measures can be quite slow.

### BinaryAdaBoostClassifier

There is a lot of variance between measures. I think we need 20-30 measures to find a good one. Fortunately these measures are generally pretty fast.

### BinaryLogisticRegression

There is a lot of variance between measures. I think we need 20-30 measures to find a good one. Fortunately these measures are generally pretty fast.

### BinaryMLPClassifier

There is a lot of variance between measures. I think we need 20-30 measures to find a good one. Fortunately these measures are generally pretty fast.

### MulticlassDecisionTreeClassifier

There is a lot of variance between measures. I think we need 20-30 measures to find a good one. Fortunately these measures are generally pretty fast.

### MulticlassMLPClassifier

There is a lot of variance between measures. I think we need 20-30 measures to find a good one. Fortunately these measures are generally pretty fast.

# Reasoning

## LinearRegression

The stats are:

```
LinearRegression:
    Total samples: 96
    Max max: 1.0
    Min max: -7.52120439003072e+25
    Average max: -7.845129805975915e+23
    Stddev max: 7.676190056164506e+24
    Average avg: -1.0235563739196867e+24
    Stddev avg: 8.001905245682028e+24
    Average stdev: 1.0690345450679683e+24
    Stddev stdev: 1.0474356428102754e+25
    Max max-min gap: 4.589633033653067e+26
         intrusion.csv.dst_bytes.LinearRegression.json
    Average max-min gap: 4.7808678232824684e+24
    Stddev max-min gap: 4.68427459878935e+25
    Max max-first gap: 2.570393772131859e+16
         census.csv.age.LinearRegression.json
    Average max-first gap: 510925575515149.75
    Max 0.01ofMax: 19
         census.csv.age.LinearRegression.json
    Average 0.01ofMax: 1.5208333333333333
    Stddev 0.01ofMax: 4.717223015976257
    0 of 96 have both positive and negative scores

LinearRegression_good:
    Total samples: 37
    Max max: 1.0
    Min max: 0.5058273747831172
    Average max: 0.7956564371727154
    Stddev max: 0.19854860391503396
    Average avg: 0.7940715085198299
    Stddev avg: 0.199761604374762
    Average stdev: 0.0013304198752687325
    Stddev stdev: 0.005112806519675665
    Max max-min gap: 0.10141611802703243
         intrusion.csv.dst_host_count.LinearRegression.json
    Average max-min gap: 0.004846775495064539
    Stddev max-min gap: 0.01745522784932013
    Max max-first gap: 0.025371444764794693
         intrusion.csv.dst_host_count.LinearRegression.json
    Average max-first gap: 0.0011819185626353627
    Max 0.01ofMax: 2
         intrusion.csv.dst_host_count.LinearRegression.json
    Average 0.01ofMax: 0.05405405405405406
    Stddev 0.01ofMax: 0.3287979746107146
    0 of 37 have both positive and negative scores
```

Note that methods labeled `Method_good` limit the stats to runs that had a score better than 0.5. There are 37 samples (different table/column combinations).

`0 of 96 have both positive and negative scores`: This tells us that these runs have either all positive or all negative scores. 

`Average max-min gap: 0.004846775495064539`: For the `good` cases, this is telling us that on average (over 37 samples) there is very little difference between the highest and lowest scores of the 20 runs.

`Max max-min gap: 0.10141611802703243`: for the `good` cases, this tells us that the largest max-min gap of the 37 samples was 0.1, which is higher than we want.

`Average 0.01ofMax: 0.05405405405405406`: For the `good` cases, this tells us that on average, a score within 0.01 of the best score was found at the first measure. `Max 0.01ofMax: 2` tells us that in the worst case, a score within 0.01 of the best score was found on the third measure.

Assuming we hit a machine that gives a good score, we are safe taking only on measure or at most 3-4 measures.

## MLPRegressor

```
MLPRegressor:
    Total samples: 96
    Max max: 0.9998981082247835
    Min max: -3.062275587521884
    Average max: 0.3866253580863069
    Stddev max: 0.6136262397941669
    Average avg: -40.430940130884984
    Stddev avg: 233.61268430735157
    Average stdev: 56.139268953175964
    Stddev stdev: 294.5794165958341
    Max max-min gap: 6790.90828833852
         intrusion.csv.dst_host_count.MLPRegressor.json
    Average max-min gap: 205.56190843811632
    Stddev max-min gap: 1054.7628390978848
    Max max-first gap: 1036.0233666015326
         intrusion.csv.count.MLPRegressor.json
    Average max-first gap: 18.93460919768702
    Max 0.01ofMax: 15
         intrusion.csv.dst_host_count.MLPRegressor.json
    Average 0.01ofMax: 2.1770833333333335
    Stddev 0.01ofMax: 4.015744780280599
    23 of 96 have both positive and negative scores
```

`Average max-min gap: 205.56190843811632`: This is telling us that on average there is a huge difference between the scores of different runs for the same table/column.

`Max max-min gap: 6790.90828833852`: And this is the worst case among the 96 samples.

`Average 0.01ofMax: 2.1770833333333335`: This tells us that, on average, it takes about 2 measures to get within 0.01 of the best score.

`Stddev 0.01ofMax: 4.015744780280599`: This tells us that the number of measures to get within 0.01 varies a lot for different table/column samples.

`Max 0.01ofMax: 15`: And in the worst case it took 15 samples to get within 0.01 of the best score.

This all suggests that we need a lot of samples to get a good score for MLPRegressor. At least 20 I think. (But they can all be run on the same machine.)
 
## BinaryAdaBoostClassifier

```
BinaryAdaBoostClassifier:
    Total samples: 53
    Max max: 1.0
    Min max: 0.0
    Average max: 0.9068751303409786
    Stddev max: 0.1562903553311418
    Average avg: 0.8924151832469711
    Stddev avg: 0.17410336201391738
    Average stdev: 0.009009615621929833
    Stddev stdev: 0.04088655322611039
    Max max-min gap: 0.9
         intrusion.csv.land.BinaryAdaBoostClassifier.json
    Average max-min gap: 0.02959683917390543
    Stddev max-min gap: 0.1304145612939037
    Max max-first gap: 0.23809523809523808
         intrusion.csv.root_shell.BinaryAdaBoostClassifier.json
    Average max-first gap: 0.010474362349415359
    Max 0.01ofMax: 19
         intrusion.csv.land.BinaryAdaBoostClassifier.json
    Average 0.01ofMax: 1.0566037735849056
    Stddev 0.01ofMax: 3.8150264357367303
    0 of 53 have both positive and negative scores
```

Note 52 of 53 runs had max score > 0.5.

`Average max-min gap: 0.02959683917390543` tells us that usually the measures are pretty tight, but `Stddev max-min gap: 0.1304145612939037` says that there is substantial variance, and `Max max-min gap: 0.9` says that at least once in 53 table/column runs the gap was quite large.

`Average 0.01ofMax: 1.0566037735849056` says that usually a score close to the best was found in the second measure, but `Stddev 0.01ofMax: 3.8150264357367303` says that the variance is high, and `Max 0.01ofMax: 19` says that in once case only one measure was within 0.01 of the best score (the best score itself).

Conclude that we need 20-30 measures to get a good score.

## BinaryLogisticRegression

```
BinaryLogisticRegression:
    Total samples: 53
    Max max: 1.0
    Min max: 0.007111111111111111
    Average max: 0.8055424136960064
    Stddev max: 0.2663549125673896
    Average avg: 0.7687631564140988
    Stddev avg: 0.28473818880421475
    Average stdev: 0.02943585817702546
    Stddev stdev: 0.07057808561319116
    Max max-min gap: 0.9655883664158716
         intrusion.csv.is_guest_login.BinaryLogisticRegression.json
    Average max-min gap: 0.09563457059857121
    Stddev max-min gap: 0.21303781576313277
    Max max-first gap: 0.9655883664158716
         intrusion.csv.is_guest_login.BinaryLogisticRegression.json
    Average max-first gap: 0.05620198898797505
    Max 0.01ofMax: 17
         intrusion.csv.dst_host_diff_srv_rate.BinaryLogisticRegression.json
    Average 0.01ofMax: 1.5849056603773586
    Stddev 0.01ofMax: 4.0212606245006866
    0 of 53 have both positive and negative scores
```

Similar to the prior arguments, there can be substantial variance between measures, and we need 20-30 measures to get a good value.

## BinaryMLPClassifier

```
BinaryMLPClassifier:
    Total samples: 53
    Max max: 1.0
    Min max: 0.15384615384615383
    Average max: 0.8854125579711414
    Stddev max: 0.1694019541487938
    Average avg: 0.8505652421867341
    Stddev avg: 0.22191102593369064
    Average stdev: 0.026263986505766817
    Stddev stdev: 0.06864744882473209
    Max max-min gap: 0.9523809523809523
         intrusion.csv.land.BinaryMLPClassifier.json
    Average max-min gap: 0.08668380475713197
    Stddev max-min gap: 0.19552783992170478
    Max max-first gap: 0.9523809523809523
         intrusion.csv.land.BinaryMLPClassifier.json
    Average max-first gap: 0.03828926786080096
    Max 0.01ofMax: 17
         intrusion.csv.srv_diff_host_rate.BinaryMLPClassifier.json
    Average 0.01ofMax: 1.5660377358490567
    Stddev 0.01ofMax: 3.516496475199013
    0 of 53 have both positive and negative scores
```

Similar to the prior arguments, there can be substantial variance between measures, and we need 20-30 measures to get a good value.

## MulticlassDecisionTreeClassifier

```
MulticlassDecisionTreeClassifier:
    Total samples: 132
    Max max: 1.0
    Min max: 0.1369446222401963
    Average max: 0.6354383346905583
    Stddev max: 0.25276984943532
    Average avg: 0.614685403358619
    Stddev avg: 0.259548686245021
    Average stdev: 0.01245704989667424
    Stddev stdev: 0.022662225073119067
    Max max-min gap: 0.4746633745946614
         intrusion.csv.num_shells.MulticlassDecisionTreeClassifier.json
    Average max-min gap: 0.0416258386946806
    Stddev max-min gap: 0.07401481190525455
    Max max-first gap: 0.4746633745946614
         intrusion.csv.num_shells.MulticlassDecisionTreeClassifier.json
    Average max-first gap: 0.02189192510450059
    Max 0.01ofMax: 19
         KRK_v1.csv.black_king_rank.MulticlassDecisionTreeClassifier.json
    Average 0.01ofMax: 1.4621212121212122
    Stddev 0.01ofMax: 3.5304678301464927
    0 of 132 have both positive and negative scores
```

Similar to the prior arguments, there can be substantial variance between measures, and we need 20-30 measures to get a good value.

## MulticlassMLPClassifier

```
MulticlassMLPClassifier:
    Total samples: 132
    Max max: 0.9997861232493044
    Min max: 0.12425568754818966
    Average max: 0.6228910729462037
    Stddev max: 0.2568441238766107
    Average avg: 0.595090856861174
    Stddev avg: 0.26329307085976694
    Average stdev: 0.016626135238015812
    Stddev stdev: 0.02080889679037879
    Max max-min gap: 0.40728186971216823
         expedia_hotel_logs.csv.srch_destination_type_id.MulticlassMLPClassifier.json
    Average max-min gap: 0.06087418836490585
    Stddev max-min gap: 0.0728309728800052
    Max max-first gap: 0.3550986625120539
         expedia_hotel_logs.csv.srch_destination_type_id.MulticlassMLPClassifier.json
    Average max-first gap: 0.027274826254010922
    Max 0.01ofMax: 17
         intrusion.csv.num_shells.MulticlassMLPClassifier.json
    Average 0.01ofMax: 2.75
    Stddev 0.01ofMax: 4.01404975303587
    0 of 132 have both positive and negative scores
```

Similar to the prior arguments, there can be substantial variance between measures, and we need 20-30 measures to get a good value.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Variance between individual ML measures #81

Summary

LinearRegression

MLPRegressor

BinaryAdaBoostClassifier

BinaryLogisticRegression

BinaryMLPClassifier

MulticlassDecisionTreeClassifier

MulticlassMLPClassifier

Reasoning

LinearRegression

MLPRegressor

BinaryAdaBoostClassifier

BinaryLogisticRegression

BinaryMLPClassifier

MulticlassDecisionTreeClassifier

MulticlassMLPClassifier

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Variance between individual ML measures #81

Description

Summary

LinearRegression

MLPRegressor

BinaryAdaBoostClassifier

BinaryLogisticRegression

BinaryMLPClassifier

MulticlassDecisionTreeClassifier

MulticlassMLPClassifier

Reasoning

LinearRegression

MLPRegressor

BinaryAdaBoostClassifier

BinaryLogisticRegression

BinaryMLPClassifier

MulticlassDecisionTreeClassifier

MulticlassMLPClassifier

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions