Skip to content

Commit 4688606

Browse files
skadiobkleyn
andauthored
Feature/minepy (#6)
Co-authored-by: Kleynhans, Bernard <[email protected]> Co-authored-by: Bernard Kleynhans <[email protected]>
1 parent 2e7fd9b commit 4688606

File tree

7 files changed

+134
-122
lines changed

7 files changed

+134
-122
lines changed

CHANGELOG.txt

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,14 @@
22
CHANGELOG
33
=========
44

5+
-------------------------------------------------------------------------------
6+
Feb, 09, 2022 1.1.1
7+
-------------------------------------------------------------------------------
8+
9+
- Dropped [Maximal Information (MIC)](https://github.com/minepy/minepy) due to inactive backend library
10+
- MINEPY installation is not compatible with setuptools>=58 as noted in [this issue](https://github.com/minepy/minepy/issues/32)
11+
- In addition, MIC is rather slow on large datasets
12+
513
-------------------------------------------------------------------------------
614
June, 16, 2021 1.1.0
715
-------------------------------------------------------------------------------

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ print("Scores:", list(selector.get_absolute_scores()))
4141
| :---------------: | :-----: |
4242
| [Variance per Feature](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) | `threshold` |
4343
| [Correlation pairwise Features](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) | [Pearson Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) <br> [Kendall Rank Correlation Coefficient](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient) <br> [Spearman's Rank Correlation Coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) <br> |
44-
| [Statistical Analysis](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection) | [ANOVA F-test Classification](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html) <br> [F-value Regression](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html) <br> [Chi-Square](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html) <br> [Mutual Information Classification](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html) <br> [Maximal Information (MIC)](https://github.com/minepy/minepy) <br> [Variance Inflation Factor](https://www.statsmodels.org/stable/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html) |
44+
| [Statistical Analysis](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection) | [ANOVA F-test Classification](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html) <br> [F-value Regression](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html) <br> [Chi-Square](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html) <br> [Mutual Information Classification](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html) <br> [Variance Inflation Factor](https://www.statsmodels.org/stable/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html) |
4545
| [Linear Methods](https://en.wikipedia.org/wiki/Linear_regression) | [Linear Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linear%20regression#sklearn.linear_model.LinearRegression) <br> [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regression#sklearn.linear_model.LogisticRegression) <br> [Lasso Regularization](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso) <br> [Ridge Regularization](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge) <br> |
4646
| [Tree-based Methods](https://scikit-learn.org/stable/modules/tree.html) | [Decision Tree](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier) <br> [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=random%20forest#sklearn.ensemble.RandomForestClassifier) <br> [Extra Trees Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html) <br> [XGBoost](https://xgboost.readthedocs.io/en/latest/) <br> [LightGBM](https://lightgbm.readthedocs.io/en/latest/) <br> [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) <br> [CatBoost](https://github.com/catboost)<br> [Gradient Boosting Tree](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) <br> |
4747

feature/_version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,4 @@
22
# Copyright FMR LLC <[email protected]>
33
# SPDX-License-Identifier: GNU GPLv3
44

5-
__version__ = "1.1.0"
5+
__version__ = "1.1.1"

feature/selector.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,8 @@ class Statistical(NamedTuple):
202202
the results might be sensitive to exact bin selection.
203203
204204
Maximal information score (MIC) tries to address these gaps by
205-
searching for the optimal binnign strategy.
205+
searching for the optimal binning strategy.
206+
Note: MIC is dropped from Selective due to inactive MINE library
206207
207208
Notes on Randomness:
208209
- Mutual Info is non-deterministic, depends on the seed value.
@@ -218,7 +219,6 @@ class Statistical(NamedTuple):
218219
* anova: Anova and Anova F-test (default)
219220
* chi_square: Chi-Square
220221
* mutual_info: Mutual Information score
221-
* maximal_info: Maximal Information score (MIC)
222222
* variance_inflation: Variance Inflation factor (VIF)
223223
"""
224224
num_features: Num = 0.0
@@ -229,8 +229,8 @@ def _validate(self):
229229
check_true(self.num_features > 0, ValueError("Num features must be greater than zero."))
230230
if isinstance(self.num_features, float):
231231
check_true(self.num_features <= 1, ValueError("Num features ratio must be between [0..1]."))
232-
check_true(self.method in ["anova", "chi_square", "mutual_info", "maximal_info", "variance_inflation"],
233-
ValueError("Statistical method can only be anova, chi_square, mutual_info, or maximal_info."))
232+
check_true(self.method in ["anova", "chi_square", "mutual_info", "variance_inflation"], # "maximal_info" dropped
233+
ValueError("Statistical method can only be anova, chi_square, or mutual_info."))
234234

235235
class TreeBased(NamedTuple):
236236
"""

feature/statistical.py

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
from functools import partial
66
from typing import NoReturn, Tuple
77

8-
from minepy import MINE
8+
# from minepy import MINE (dropped)
99
import numpy as np
1010
import pandas as pd
1111
from sklearn.feature_selection import chi2, f_classif, f_regression, mutual_info_classif, mutual_info_regression
@@ -36,11 +36,11 @@ def __init__(self, seed: int, num_features: Num, method: str):
3636
self.factory = {"regression_anova": f_regression,
3737
"regression_chi_square": None,
3838
"regression_mutual_info": partial(mutual_info_regression, random_state=self.seed),
39-
"regression_maximal_info": MINE(),
39+
# "regression_maximal_info": MINE(), # dropped
4040
"classification_anova": f_classif,
4141
"classification_chi_square": chi2,
4242
"classification_mutual_info": partial(mutual_info_classif, random_state=self.seed),
43-
"classification_maximal_info": MINE(),
43+
# "classification_maximal_info": MINE(), # dropped
4444
"unsupervised_variance_inflation": variance_inflation_factor}
4545

4646
def get_model_args(self, selection_method) -> Tuple:
@@ -62,7 +62,7 @@ def dispatch_model(self, labels: pd.Series, *args):
6262
# Check scoring compatibility with task
6363
if score_func is None:
6464
raise TypeError(method + " cannot be used for task: " + get_task_string(labels))
65-
elif isinstance(score_func, MINE) or method == "variance_inflation":
65+
elif method == "variance_inflation": # or isinstance(score_func, MINE) (dropped)
6666
self.imp = score_func
6767
else:
6868
# Set sklearn model selector based on scoring function
@@ -71,13 +71,15 @@ def dispatch_model(self, labels: pd.Series, *args):
7171
def fit(self, data: pd.DataFrame, labels: pd.Series) -> NoReturn:
7272

7373
# Calculate absolute scores depending on the method
74-
if isinstance(self.imp, MINE):
75-
self.abs_scores = []
76-
for col in data.columns:
77-
self.imp.compute_score(data[col], labels)
78-
score = self.imp.mic()
79-
self.abs_scores.append(score)
80-
elif self.method == "variance_inflation":
74+
75+
# NOTE: mine is dropped
76+
# if isinstance(self.imp, MINE):
77+
# self.abs_scores = []
78+
# for col in data.columns:
79+
# self.imp.compute_score(data[col], labels)
80+
# score = self.imp.mic()
81+
# self.abs_scores.append(score)
82+
if self.method == "variance_inflation":
8183
# VIF is unsupervised, regression between data and each feature
8284
self.abs_scores = np.array([variance_inflation_factor(data.values, i) for i in range(data.shape[1])])
8385
else:

requirements.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
catboost
22
joblib
33
lightgbm
4-
minepy
54
numpy
65
pandas
76
scikit-learn

tests/test_stat_maximal.py

Lines changed: 107 additions & 104 deletions
Original file line numberDiff line numberDiff line change
@@ -2,112 +2,115 @@
22
# Copyright FMR LLC <[email protected]>
33
# SPDX-License-Identifier: GNU GPLv3
44

5-
from sklearn.datasets import load_boston, load_iris
6-
from feature.utils import get_data_label
7-
from feature.selector import Selective, SelectionMethod
5+
# from sklearn.datasets import load_boston, load_iris
6+
# from feature.utils import get_data_label
7+
# from feature.selector import Selective, SelectionMethod
88
from tests.test_base import BaseTest
99

1010

1111
class TestMaximalInfo(BaseTest):
1212

13-
def test_maximal_regress_top_k(self):
14-
data, label = get_data_label(load_boston())
15-
data = data.drop(columns=["CHAS", "NOX", "RM", "DIS", "RAD", "TAX", "PTRATIO", "INDUS"])
16-
17-
method = SelectionMethod.Statistical(num_features=3, method="maximal_info")
18-
selector = Selective(method)
19-
selector.fit(data, label)
20-
subset = selector.transform(data)
21-
22-
# Reduced columns
23-
self.assertEqual(subset.shape[1], 3)
24-
self.assertListEqual(list(subset.columns), ['CRIM', 'AGE', 'LSTAT'])
25-
26-
def test_maximal_regress_top_percentile(self):
27-
data, label = get_data_label(load_boston())
28-
data = data.drop(columns=["CHAS", "NOX", "RM", "DIS", "RAD", "TAX", "PTRATIO", "INDUS"])
29-
30-
method = SelectionMethod.Statistical(num_features=0.6, method="maximal_info")
31-
selector = Selective(method)
32-
selector.fit(data, label)
33-
subset = selector.transform(data)
34-
35-
# Reduced columns
36-
self.assertEqual(subset.shape[1], 3)
37-
self.assertListEqual(list(subset.columns), ['CRIM', 'AGE', 'LSTAT'])
38-
39-
def test_maximal_regress_top_k_all(self):
40-
data, label = get_data_label(load_boston())
41-
data = data.drop(columns=["CHAS", "NOX", "RM", "DIS", "RAD", "TAX", "PTRATIO", "INDUS"])
42-
43-
method = SelectionMethod.Statistical(num_features=5, method="maximal_info")
44-
selector = Selective(method)
45-
selector.fit(data, label)
46-
subset = selector.transform(data)
47-
48-
# Reduced columns
49-
self.assertEqual(data.shape[1], subset.shape[1])
50-
self.assertListEqual(list(data.columns), list(subset.columns))
51-
52-
def test_maximal_regress_top_percentile_all(self):
53-
data, label = get_data_label(load_boston())
54-
data = data.drop(columns=["CHAS", "NOX", "RM", "DIS", "RAD", "TAX", "PTRATIO", "INDUS"])
55-
56-
method = SelectionMethod.Statistical(num_features=1.0, method="maximal_info")
57-
selector = Selective(method)
58-
selector.fit(data, label)
59-
subset = selector.transform(data)
60-
61-
# Reduced columns
62-
self.assertEqual(data.shape[1], subset.shape[1])
63-
self.assertListEqual(list(data.columns), list(subset.columns))
64-
65-
def test_maximal_classif_top_k(self):
66-
data, label = get_data_label(load_iris())
67-
68-
method = SelectionMethod.Statistical(num_features=2, method="maximal_info")
69-
selector = Selective(method)
70-
selector.fit(data, label)
71-
subset = selector.transform(data)
72-
73-
# Reduced columns
74-
self.assertEqual(subset.shape[1], 2)
75-
self.assertListEqual(list(subset.columns), ['petal length (cm)', 'petal width (cm)'])
76-
77-
def test_maximal_classif_top_percentile(self):
78-
data, label = get_data_label(load_iris())
79-
80-
method = SelectionMethod.Statistical(num_features=0.5, method="maximal_info")
81-
selector = Selective(method)
82-
selector.fit(data, label)
83-
subset = selector.transform(data)
84-
85-
# Reduced columns
86-
self.assertEqual(subset.shape[1], 2)
87-
self.assertListEqual(list(subset.columns), ['petal length (cm)', 'petal width (cm)'])
88-
89-
def test_maximal_classif_top_percentile_all(self):
90-
data, label = get_data_label(load_iris())
91-
92-
method = SelectionMethod.Statistical(num_features=1.0, method="maximal_info")
93-
selector = Selective(method)
94-
selector.fit(data, label)
95-
subset = selector.transform(data)
96-
97-
# Reduced columns
98-
self.assertEqual(subset.shape[1], 4)
99-
self.assertListEqual(list(subset.columns),
100-
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'])
101-
102-
def test_maximal_classif_top_k_all(self):
103-
data, label = get_data_label(load_iris())
104-
105-
method = SelectionMethod.Statistical(num_features=4, method="maximal_info")
106-
selector = Selective(method)
107-
selector.fit(data, label)
108-
subset = selector.transform(data)
109-
110-
# Reduced columns
111-
self.assertEqual(subset.shape[1], 4)
112-
self.assertListEqual(list(subset.columns),
113-
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'])
13+
def test_maximal(self):
14+
pass
15+
16+
# def test_maximal_regress_top_k(self):
17+
# data, label = get_data_label(load_boston())
18+
# data = data.drop(columns=["CHAS", "NOX", "RM", "DIS", "RAD", "TAX", "PTRATIO", "INDUS"])
19+
#
20+
# method = SelectionMethod.Statistical(num_features=3, method="maximal_info")
21+
# selector = Selective(method)
22+
# selector.fit(data, label)
23+
# subset = selector.transform(data)
24+
#
25+
# # Reduced columns
26+
# self.assertEqual(subset.shape[1], 3)
27+
# self.assertListEqual(list(subset.columns), ['CRIM', 'AGE', 'LSTAT'])
28+
#
29+
# def test_maximal_regress_top_percentile(self):
30+
# data, label = get_data_label(load_boston())
31+
# data = data.drop(columns=["CHAS", "NOX", "RM", "DIS", "RAD", "TAX", "PTRATIO", "INDUS"])
32+
#
33+
# method = SelectionMethod.Statistical(num_features=0.6, method="maximal_info")
34+
# selector = Selective(method)
35+
# selector.fit(data, label)
36+
# subset = selector.transform(data)
37+
#
38+
# # Reduced columns
39+
# self.assertEqual(subset.shape[1], 3)
40+
# self.assertListEqual(list(subset.columns), ['CRIM', 'AGE', 'LSTAT'])
41+
#
42+
# def test_maximal_regress_top_k_all(self):
43+
# data, label = get_data_label(load_boston())
44+
# data = data.drop(columns=["CHAS", "NOX", "RM", "DIS", "RAD", "TAX", "PTRATIO", "INDUS"])
45+
#
46+
# method = SelectionMethod.Statistical(num_features=5, method="maximal_info")
47+
# selector = Selective(method)
48+
# selector.fit(data, label)
49+
# subset = selector.transform(data)
50+
#
51+
# # Reduced columns
52+
# self.assertEqual(data.shape[1], subset.shape[1])
53+
# self.assertListEqual(list(data.columns), list(subset.columns))
54+
#
55+
# def test_maximal_regress_top_percentile_all(self):
56+
# data, label = get_data_label(load_boston())
57+
# data = data.drop(columns=["CHAS", "NOX", "RM", "DIS", "RAD", "TAX", "PTRATIO", "INDUS"])
58+
#
59+
# method = SelectionMethod.Statistical(num_features=1.0, method="maximal_info")
60+
# selector = Selective(method)
61+
# selector.fit(data, label)
62+
# subset = selector.transform(data)
63+
#
64+
# # Reduced columns
65+
# self.assertEqual(data.shape[1], subset.shape[1])
66+
# self.assertListEqual(list(data.columns), list(subset.columns))
67+
#
68+
# def test_maximal_classif_top_k(self):
69+
# data, label = get_data_label(load_iris())
70+
#
71+
# method = SelectionMethod.Statistical(num_features=2, method="maximal_info")
72+
# selector = Selective(method)
73+
# selector.fit(data, label)
74+
# subset = selector.transform(data)
75+
#
76+
# # Reduced columns
77+
# self.assertEqual(subset.shape[1], 2)
78+
# self.assertListEqual(list(subset.columns), ['petal length (cm)', 'petal width (cm)'])
79+
#
80+
# def test_maximal_classif_top_percentile(self):
81+
# data, label = get_data_label(load_iris())
82+
#
83+
# method = SelectionMethod.Statistical(num_features=0.5, method="maximal_info")
84+
# selector = Selective(method)
85+
# selector.fit(data, label)
86+
# subset = selector.transform(data)
87+
#
88+
# # Reduced columns
89+
# self.assertEqual(subset.shape[1], 2)
90+
# self.assertListEqual(list(subset.columns), ['petal length (cm)', 'petal width (cm)'])
91+
#
92+
# def test_maximal_classif_top_percentile_all(self):
93+
# data, label = get_data_label(load_iris())
94+
#
95+
# method = SelectionMethod.Statistical(num_features=1.0, method="maximal_info")
96+
# selector = Selective(method)
97+
# selector.fit(data, label)
98+
# subset = selector.transform(data)
99+
#
100+
# # Reduced columns
101+
# self.assertEqual(subset.shape[1], 4)
102+
# self.assertListEqual(list(subset.columns),
103+
# ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'])
104+
#
105+
# def test_maximal_classif_top_k_all(self):
106+
# data, label = get_data_label(load_iris())
107+
#
108+
# method = SelectionMethod.Statistical(num_features=4, method="maximal_info")
109+
# selector = Selective(method)
110+
# selector.fit(data, label)
111+
# subset = selector.transform(data)
112+
#
113+
# # Reduced columns
114+
# self.assertEqual(subset.shape[1], 4)
115+
# self.assertListEqual(list(subset.columns),
116+
# ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'])

0 commit comments

Comments
 (0)