Skip to content

Commit af96f79

Browse files
Merge pull request #517 from Kimoby/refactor-hp-repo-automl
Refactor of the AutoML Loop, Trainer, and HyperparamsRepos
2 parents c5627e6 + 98d68f7 commit af96f79

File tree

110 files changed

+10389
-11207
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

110 files changed

+10389
-11207
lines changed

.github/pull_request_template.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ Things to check each time you contribute:
4545
- [ ] Your local Git username is set to your GitHub username, and your local Git email is set to your GitHub email. This is important to avoid breaking the cla-bot and for your contributions to be linked to your profile. More info: https://github.com/settings/emails
4646
- [ ] Argument's dimensions and types are specified for new steps (important), with examples in docstrings when needed.
4747
- [ ] Class names and argument / API variables are very clear: there is no possible ambiguity. They also respect the existing code style (avoid duplicating words for the same concept) and are intuitive.
48-
- [ ] Use typing like `variable: Typing = ...` as much as possible. Also use typing for function arguments and return values like `def my_func(self, my_list: Dict[int, List[str]]) -> OrderedDict[int, str]:`.
48+
- [ ] Use typing like `variable: Typing = ...` as much as possible. Also use typing for function arguments and return values like `def my_func(self, my_list: Dict[int, List[str]]) -> 'OrderedDict[int, str]':`.
4949
- [ ] Classes are documented: their behavior is explained beyond just the title of the class. You may even use the description written in your pull request above to fill some docstrings accurately.
5050
- [ ] If a numpy array is used, it is important to remember that these arrays are a special type that must be documented accordingly, and that numpy array should not be abused. This is because Neuraxle is a library that is not only limited to transforming numpy arrays. To this effect, numpy steps should probably be located in the existing numpy python files as much as possible, and not be all over the place. The same applies to Pandas DataFrames.
5151
- [ ] Code coverage is above 90% for the added code for the unit tests.

.github/workflows/license_checker_v2.py

-1
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,6 @@ def is_license_in_list(license, license_list):
6363
library_license_dict[library_name] = library_license
6464
print(f"{library_name}: {library_license}")
6565
# First checks if its refused_licenses, then if its in accepted_licenses, else add in the maybe list
66-
# TODO : Should use regex instead?
6766

6867
if is_license_in_list(library_license, args.forbidden_licenses):
6968
refused_libraries.append(library_name)

.gitignore

+8-1
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ coverage.xml
4646
*.cover
4747
.hypothesis/
4848
.pytest_cache/
49+
prof/
4950

5051
# Translations
5152
*.mo
@@ -82,6 +83,7 @@ celerybeat-schedule
8283
*.sage.py
8384

8485
# Environments
86+
venv
8587
.env
8688
.venv
8789
env/
@@ -106,11 +108,15 @@ venv.bak/
106108
# IDEs
107109
.idea
108110
.vscode
111+
.style.yapf
109112
vandelay-py.js
113+
appmap.yml
114+
tmp
110115

111116
# Other
117+
.DS_Store
112118
___*
113-
119+
todo.txt
114120
**cache/**
115121
**caching/**
116122
cache/**
@@ -119,4 +125,5 @@ testing/examples/cache/**
119125
testing/cache/**
120126
testing/cache/*
121127
cov.xml
128+
profile.sh
122129

README.rst

+4-6
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ For example, you can build a time series processing pipeline as such:
6161
.. code:: python
6262
6363
p = Pipeline([
64-
TrainOnly(DataShuffler()),
64+
TrainOnlyWrapper(DataShuffler()),
6565
WindowTimeSeries(),
6666
MiniBatchSequentialPipeline([
6767
Tensorflow2ModelStep(
@@ -113,7 +113,7 @@ You can also tune your hyperparameters using AutoML algorithms such as the TPE:
113113
use_linear_forgetting_weights=False,
114114
number_recent_trial_at_full_weights=25
115115
),
116-
validation_splitter=ValidationSplitter(test_size=0.20),
116+
validation_splitter=ValidationSplitter(validation_size=0.20),
117117
scoring_callback=ScoringCallback(accuracy_score, higher_score_is_better=True),
118118
callbacks[
119119
MetricCallback(f1_score, higher_score_is_better=True),
@@ -122,17 +122,15 @@ You can also tune your hyperparameters using AutoML algorithms such as the TPE:
122122
],
123123
n_trials=7,
124124
epochs=10,
125-
hyperparams_repository=HyperparamsJSONRepository(cache_folder='cache'),
126-
refit_trial=True,
125+
refit_best_trial=True,
127126
)
128127
129128
# Load data, and launch AutoML loop !
130129
X_train, y_train, X_test, y_test = generate_classification_data()
131130
auto_ml = auto_ml.fit(X_train, y_train)
132131
133132
# Get the model from the best trial, and make predictions using predict.
134-
best_pipeline = auto_ml.get_best_model()
135-
y_pred = best_pipeline.predict(X_test)
133+
y_pred = auto_ml.predict(X_test)
136134
137135
138136
--------------

coverage.sh

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
#!/usr/bin/env bash
22
./flake8.sh
3-
pytest --cov-report html --cov-report xml:cov.xml --cov=neuraxle testing
3+
pytest -n 7 --cov-report html --cov-report xml:cov.xml --cov=neuraxle testing
4+
# pytest --cov-report html --cov=neuraxle testing; open htmlcov/index.html
45

examples/Handler Methods.ipynb

+13-20
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,6 @@
5454
"* Edit the [DataContainer](https://www.neuraxle.org/stable/api/neuraxle.data_container.html#neuraxle.data_container.DataContainer)\n",
5555
"* Call a method on a step\n",
5656
"* Mini-Batching (see [MiniBatchSequentialPipeline](https://www.neuraxle.org/stable/api/neuraxle.pipeline.html#neuraxle.pipeline.MiniBatchSequentialPipeline))\n",
57-
"* Caching (see [neuraxle.checkpoint](https://www.neuraxle.org/stable/api/neuraxle.checkpoints.html) package)\n",
5857
"* etc.\n",
5958
"\n",
6059
"### [HandleOnlyMixin](https://www.neuraxle.org/stable/api/neuraxle.base.html#neuraxle.base.HandleOnlyMixin)\n",
@@ -97,14 +96,13 @@
9796
"def _transform_data_container(self, data_container: DataContainer, context: ExecutionContext) -> DataContainer:\n",
9897
" output_data_container: ListDataContainer = ListDataContainer.empty()\n",
9998
"\n",
100-
" for current_id, di, eo in data_container:\n",
99+
" for _id, di, eo in data_container:\n",
101100
" output: DataContainer = self.wrapped.handle_transform(\n",
102-
" DataContainer(summary_id=data_container.summary_id, current_ids=None, data_inputs=di, expected_outputs=eo),\n",
101+
" DataContainer(data_inputs=di, expected_outputs=eo),\n",
103102
" context\n",
104103
" )\n",
105104
"\n",
106-
" output_data_container.append(current_id, output.data_inputs, output.expected_outputs)\n",
107-
" output_data_container.summary_id = data_container.summary_id\n",
105+
" output_data_container.append(_id, output.data_inputs, output.expected_outputs)\n",
108106
"\n",
109107
" return output_data_container"
110108
]
@@ -160,10 +158,10 @@
160158
"\n",
161159
"\n",
162160
"class OutputTransformerWrapper(ForceHandleOnlyMixin, MetaStepMixin, BaseStep):\n",
163-
" def __init__(self, wrapped, cache_folder_when_no_handle=None):\n",
161+
" def __init__(self, wrapped):\n",
164162
" BaseStep.__init__(self)\n",
165163
" MetaStepMixin.__init__(self, wrapped)\n",
166-
" ForceHandleOnlyMixin.__init__(self, cache_folder_when_no_handle)"
164+
" ForceHandleOnlyMixin.__init__(self)"
167165
]
168166
},
169167
{
@@ -185,8 +183,7 @@
185183
" new_expected_outputs_data_container = self.wrapped.handle_transform(\n",
186184
" DataContainer(\n",
187185
" data_inputs=data_container.expected_outputs, \n",
188-
" current_ids=data_container.current_ids, \n",
189-
" expected_outputs=None\n",
186+
" ids=data_container.ids, \n",
190187
" ), \n",
191188
" context\n",
192189
" )\n",
@@ -214,8 +211,7 @@
214211
" self.wrapped = self.wrapped.handle_fit(\n",
215212
" DataContainer(\n",
216213
" data_inputs=data_container.expected_outputs, \n",
217-
" current_ids=data_container.current_ids, \n",
218-
" expected_outputs=None),\n",
214+
" ids=data_container.ids),\n",
219215
" context\n",
220216
" )\n",
221217
"\n",
@@ -242,8 +238,7 @@
242238
" self.wrapped, new_expected_outputs_data_container = self.wrapped.handle_fit_transform(\n",
243239
" DataContainer(\n",
244240
" data_inputs=data_container.expected_outputs, \n",
245-
" current_ids=data_container.current_ids,\n",
246-
" expected_outputs=None\n",
241+
" ids=data_container.ids\n",
247242
" ),\n",
248243
" context\n",
249244
" )\n",
@@ -270,15 +265,14 @@
270265
"\n",
271266
"\n",
272267
"class OutputTransformerWrapper(ForceHandleOnlyMixin, MetaStepMixin, BaseStep):\n",
273-
" def __init__(self, wrapped, cache_folder_when_no_handle=None):\n",
268+
" def __init__(self, wrapped):\n",
274269
" BaseStep.__init__(self)\n",
275270
" MetaStepMixin.__init__(self, wrapped)\n",
276-
" ForceHandleOnlyMixin.__init__(self, cache_folder_when_no_handle)\n",
271+
" ForceHandleOnlyMixin.__init__(self)\n",
277272
"\n",
278273
" def _transform_data_container(self, data_container: DataContainer, context: ExecutionContext) -> DataContainer:\n",
279274
" new_expected_outputs_data_container = self.wrapped.handle_transform(\n",
280-
" DataContainer(data_inputs=data_container.expected_outputs, current_ids=data_container.current_ids,\n",
281-
" expected_outputs=None),\n",
275+
" DataContainer(data_inputs=data_container.expected_outputs, ids=data_container.ids),\n",
282276
" context\n",
283277
" )\n",
284278
" data_container.set_expected_outputs(new_expected_outputs_data_container.data_inputs)\n",
@@ -287,16 +281,15 @@
287281
"\n",
288282
" def _fit_data_container(self, data_container: DataContainer, context: ExecutionContext) -> (BaseStep, DataContainer):\n",
289283
" self.wrapped = self.wrapped.handle_fit(\n",
290-
" DataContainer(data_inputs=data_container.expected_outputs, current_ids=data_container.current_ids,\n",
291-
" expected_outputs=None),\n",
284+
" DataContainer(data_inputs=data_container.expected_outputs, ids=data_container.ids),\n",
292285
" context\n",
293286
" )\n",
294287
"\n",
295288
" return self, data_container\n",
296289
"\n",
297290
" def _fit_transform_data_container(self, data_container: DataContainer, context: ExecutionContext) -> (BaseStep, DataContainer):\n",
298291
" self.wrapped, new_expected_outputs_data_container = self.wrapped.handle_fit_transform(\n",
299-
" DataContainer(data_inputs=data_container.expected_outputs, current_ids=data_container.current_ids, expected_outputs=None),\n",
292+
" DataContainer(data_inputs=data_container.expected_outputs, ids=data_container.ids),\n",
300293
" context\n",
301294
" )\n",
302295
" data_container.set_expected_outputs(new_expected_outputs_data_container.data_inputs)\n",

examples/Hyperparams And Distributions.ipynb

-1
Original file line numberDiff line numberDiff line change
@@ -305,7 +305,6 @@
305305
"\n",
306306
"hd = ScipyDistributionWrapper(\n",
307307
" scipy_distribution=randint(low=0, high=10),\n",
308-
" is_continuous=False,\n",
309308
" null_default_value=0\n",
310309
")"
311310
]

examples/Introduction to Automatic Hyperparameter Tuning.ipynb

+9-9
Large diffs are not rendered by default.

examples/Introduction to Time Series Processing.ipynb

+3-6
Original file line numberDiff line numberDiff line change
@@ -657,22 +657,19 @@
657657
"metadata": {},
658658
"outputs": [],
659659
"source": [
660-
"from neuraxle.metaopt.auto_ml import AutoML, InMemoryHyperparamsRepository, ValidationSplitter, \\\n",
661-
" RandomSearchHyperparameterSelectionStrategy\n",
662-
"#from neuraxle.metaopt.tpe import TreeParzenEstimatorSelectionStrategy\n",
663-
"#from neuraxle.metaopt.auto_ml import HyperparamsJSONRepository\n",
660+
"from neuraxle.metaopt.auto_ml import AutoML, ValidationSplitter\n",
661+
"from neuraxle.metaopt.validation import RandomSearchSampler\n",
664662
"from neuraxle.metaopt.callbacks import ScoringCallback\n",
665663
"from sklearn.metrics import accuracy_score\n",
666664
"\n",
667665
"\n",
668666
"auto_ml = AutoML(\n",
669667
" pipeline=pipeline,\n",
670-
" hyperparams_optimizer=RandomSearchHyperparameterSelectionStrategy(),\n",
668+
" hyperparams_optimizer=RandomSearchSampler(),\n",
671669
" validation_splitter=ValidationSplitter(test_size=0.20),\n",
672670
" scoring_callback=ScoringCallback(accuracy_score, higher_score_is_better=True),\n",
673671
" n_trials=10,\n",
674672
" epochs=1,\n",
675-
" hyperparams_repository=InMemoryHyperparamsRepository(cache_folder=cache_folder),\n",
676673
" refit_trial=True,\n",
677674
" # callbacks=[MetricCallbacks(...)]\n",
678675
")"

examples/auto_ml/plot_automl_loop_clean_kata.py

+24-26
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
This demonstrates how you can build an AutoML loop that finds the best possible sklearn classifier.
66
It also shows you how to add hyperparams to sklearn steps using SKLearnWrapper.
77
This example has been derived and simplified from the following repository: https://github.com/Neuraxio/Kata-Clean-Machine-Learning-From-Dirty-Code
8-
Here, 2D data is fitted, whereas in the original example 3D (sequential / time series) data is preprocessed and then fitted with the same models.
8+
Here, 2D data is fitted, whereas in the original example 3D (sequential / time series) data is preprocessed and then fitted with the same models.
99
1010
..
1111
Copyright 2019, Neuraxio Inc.
@@ -25,28 +25,29 @@
2525
"""
2626
import shutil
2727

28-
from sklearn.datasets import make_classification
29-
from sklearn.ensemble import RandomForestClassifier
30-
from sklearn.linear_model import RidgeClassifier, LogisticRegression
31-
from sklearn.metrics import accuracy_score
32-
from sklearn.model_selection import train_test_split
33-
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
34-
35-
from neuraxle.hyperparams.distributions import Choice, RandInt, Boolean, LogUniform
28+
from neuraxle.base import ExecutionContext as CX
29+
from neuraxle.hyperparams.distributions import (Boolean, Choice, LogUniform,
30+
RandInt)
3631
from neuraxle.hyperparams.space import HyperparameterSpace
37-
from neuraxle.metaopt.auto_ml import AutoML, RandomSearchHyperparameterSelectionStrategy, ValidationSplitter, \
38-
HyperparamsJSONRepository
32+
from neuraxle.metaopt.auto_ml import (AutoML, RandomSearchSampler,
33+
ValidationSplitter)
3934
from neuraxle.metaopt.callbacks import ScoringCallback
35+
from neuraxle.metaopt.data.json_repo import HyperparamsOnDiskRepository
4036
from neuraxle.pipeline import Pipeline
4137
from neuraxle.steps.flow import ChooseOneStepOf
4238
from neuraxle.steps.numpy import NumpyRavel
4339
from neuraxle.steps.output_handlers import OutputTransformerWrapper
4440
from neuraxle.steps.sklearn import SKLearnWrapper
41+
from sklearn.datasets import make_classification
42+
from sklearn.ensemble import RandomForestClassifier
43+
from sklearn.linear_model import LogisticRegression, RidgeClassifier
44+
from sklearn.metrics import accuracy_score
45+
from sklearn.model_selection import train_test_split
46+
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
4547

4648

47-
def main():
49+
def main(tmpdir: str):
4850
# Define classification models, and hyperparams.
49-
# See also HyperparameterSpace documentation : https://www.neuraxle.org/stable/api/neuraxle.hyperparams.space.html#neuraxle.hyperparams.space.HyperparameterSpace
5051

5152
decision_tree_classifier = SKLearnWrapper(
5253
DecisionTreeClassifier(),
@@ -97,7 +98,7 @@ def main():
9798
]).set_name('RandomForestClassifier')
9899

99100
# Define a classification pipeline that lets the AutoML loop choose one of the classifier.
100-
# See also ChooseOneStepOf documentation : https://www.neuraxle.org/stable/api/neuraxle.steps.flow.html#neuraxle.steps.flow.ChooseOneStepOf
101+
# See also ChooseOneStepOf documentation: https://www.neuraxle.org/stable/api/neuraxle.steps.flow.html#neuraxle.steps.flow.ChooseOneStepOf
101102

102103
pipeline = Pipeline([
103104
ChooseOneStepOf([
@@ -110,17 +111,17 @@ def main():
110111
])
111112

112113
# Create the AutoML loop object.
113-
# See also AutoML documentation : https://www.neuraxle.org/stable/api/neuraxle.metaopt.auto_ml.html#neuraxle.metaopt.auto_ml.AutoML
114+
# See also AutoML documentation: https://www.neuraxle.org/stable/api/neuraxle.metaopt.auto_ml.html#neuraxle.metaopt.auto_ml.AutoML
114115

115116
auto_ml = AutoML(
116117
pipeline=pipeline,
117-
hyperparams_optimizer=RandomSearchHyperparameterSelectionStrategy(),
118-
validation_splitter=ValidationSplitter(test_size=0.20),
118+
hyperparams_optimizer=RandomSearchSampler(),
119+
validation_splitter=ValidationSplitter(validation_size=0.20),
119120
scoring_callback=ScoringCallback(accuracy_score, higher_score_is_better=True),
120121
n_trials=7,
121122
epochs=1,
122-
hyperparams_repository=HyperparamsJSONRepository(cache_folder='cache'),
123-
refit_trial=True,
123+
hyperparams_repository=HyperparamsOnDiskRepository(cache_folder=tmpdir),
124+
refit_best_trial=True,
124125
continue_loop_on_error=False
125126
)
126127

@@ -129,16 +130,13 @@ def main():
129130
X_train, y_train, X_test, y_test = generate_classification_data()
130131
auto_ml = auto_ml.fit(X_train, y_train)
131132

132-
# Get the model from the best trial, and make predictions using predict.
133-
# See also predict documentation : https://www.neuraxle.org/stable/api/neuraxle.base.html#neuraxle.base.BaseStep.predict
134-
135-
best_pipeline = auto_ml.get_best_model()
136-
y_pred = best_pipeline.predict(X_test)
133+
# Get the model from the best trial, and make predictions using predict, as per the `refit_best_trial=True` argument to AutoML.
134+
y_pred = auto_ml.predict(X_test)
137135

138136
accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
139137
print("Test accuracy score:", accuracy)
140138

141-
shutil.rmtree('cache')
139+
shutil.rmtree(tmpdir)
142140

143141

144142
def generate_classification_data():
@@ -163,4 +161,4 @@ def generate_classification_data():
163161

164162

165163
if __name__ == '__main__':
166-
main()
164+
main(CX.get_new_cache_folder())

0 commit comments

Comments
 (0)