Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving preprocessing #1320

Merged
merged 86 commits into from
Nov 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
6d60801
Adding logs & the ability to specify categorical data
aPovidlo Aug 7, 2024
057c4d2
Fixes categorical features
aPovidlo Aug 9, 2024
4b4536a
Changing getsizeof to nbytes
aPovidlo Aug 9, 2024
ae6eb42
Delete _clean_extra_spaces
aPovidlo Aug 9, 2024
f0df60c
Adding more logs, adding OptimisedFeature storage, refactoring fittin…
aPovidlo Aug 13, 2024
e4c13f5
@Lopa10ko requested changes
aPovidlo Aug 14, 2024
c0f7ff3
Fix bug with nbytes
aPovidlo Aug 14, 2024
6d7bf97
Fix bug with cat_features_names if there aren't exists features_names
aPovidlo Aug 14, 2024
705529a
Adding reduce_memory_size to pipeline._preprocess
aPovidlo Aug 14, 2024
4c7d281
Return to Pandas for nan_matrix
aPovidlo Aug 14, 2024
75901ae
Change logic of _into_categorical_features_transformation_for_fit
aPovidlo Aug 14, 2024
426dbd9
Adding convert to np.array
aPovidlo Aug 14, 2024
9ab9f99
Update ImputationImplementation
aPovidlo Aug 14, 2024
b679660
Fix bug in BinaryCategorical
aPovidlo Aug 15, 2024
119bca8
Fix bug with test_data_from_csv_load_correctly
aPovidlo Aug 15, 2024
7a3946a
Fix bug with test_api_fit_predict_with_pseudo_large_dataset_with_labe…
aPovidlo Aug 15, 2024
3134fc6
Fix bug with test_pipeline_preprocessing_through_api_correctly
aPovidlo Aug 15, 2024
e5db54d
Fix bug with test_default_forecast (add new TODO for ts_forecasting)
aPovidlo Aug 15, 2024
ebab7f2
Fix bug with test_cv_multiple_metrics_evaluated_correct by adding cop…
aPovidlo Aug 15, 2024
c123779
Fix bug with test_regression_pipeline_with_data_operation_fit_predict…
aPovidlo Aug 15, 2024
2e168dc
Fix bug in test_default_train_test_simple with nbytes
aPovidlo Aug 16, 2024
f6d539a
Fix bugs with str* types in features
aPovidlo Aug 16, 2024
9290d82
Fix bug with test_inf_and_nan_absence_after_imputation_implementation…
aPovidlo Aug 16, 2024
2f59466
Fix bug with test_pipeline_objective_evaluate_with_different_metrics …
aPovidlo Aug 16, 2024
1be317f
Fix bug with test_order_by_data_flow_len_correct
aPovidlo Aug 16, 2024
16285df
Fix bug with test_pipeline_with_imputer (finally)
aPovidlo Aug 16, 2024
36f994c
Fix bug with test_correct_api_dataset_with_text_preprocessing by upda…
aPovidlo Aug 16, 2024
0b8c41c
Update for OneHotImplementation
aPovidlo Aug 19, 2024
c3a8069
Update for subset_features and post_init
aPovidlo Aug 19, 2024
1d5ecfe
Update data_has_categorical_features
aPovidlo Aug 19, 2024
eb14784
Adding bool to numerical
aPovidlo Aug 19, 2024
af00955
Update for ImputationImplementation
aPovidlo Aug 19, 2024
600d12c
Fix data for tests
aPovidlo Aug 19, 2024
91c24a4
Fix test with adding new types
aPovidlo Aug 19, 2024
313ad8a
Update test with deleting extra spaces
aPovidlo Aug 19, 2024
fa11d8b
Update test with adding extra types_encountered
aPovidlo Aug 19, 2024
e76cd93
Fixes different tests
aPovidlo Aug 19, 2024
4085f55
Update expected_values for test_metrics test
aPovidlo Aug 20, 2024
f9f8acf
pep8 fixes
aPovidlo Aug 20, 2024
fca7ef6
Adding preprocessing copying to predefined models
aPovidlo Aug 20, 2024
5a7cd7a
Adding docstring to reduce memory and optimisedfeatures
aPovidlo Aug 20, 2024
25cbe7a
Automated autopep8 fixes
github-actions[bot] Aug 20, 2024
9053f9f
Fix bug with unhashable np
aPovidlo Aug 21, 2024
aaba291
Merge branch 'preproc_refactoring' of https://github.com/aimclub/FEDO…
aPovidlo Aug 21, 2024
8411c6e
Temp update
aPovidlo Aug 21, 2024
5cf8d1b
Fix tests
aPovidlo Aug 21, 2024
3544636
Fix test_regression_data_operations with inf data after poly_features
aPovidlo Aug 21, 2024
40aabd7
Fix bug in tests with IndexError
aPovidlo Aug 21, 2024
7429381
Adding take by indecies method and to_numpy() in OptimisedFeatures
aPovidlo Aug 21, 2024
b58993f
Update train_test_split for OptimisedFeatures
aPovidlo Aug 21, 2024
936635c
Transform target to numpy array during memory_reduce
aPovidlo Aug 21, 2024
47f214c
PR#1318 migration
aPovidlo Aug 22, 2024
b1cfadc
Fixing for test_metrics with py3.10
aPovidlo Aug 22, 2024
888f484
Fix test_from_ ... with broadcast
aPovidlo Aug 23, 2024
f963d09
Hide preprocessing messages under debug logging (2)
aPovidlo Aug 23, 2024
a542088
Fix TypeError with float16, rejection from this type
aPovidlo Aug 25, 2024
776d7f5
Refactoring OptimisedFeatures - _columns: np.ndarray -> _columns: pd.…
aPovidlo Sep 3, 2024
4cc8a3d
Revert changes with features property
aPovidlo Sep 4, 2024
762f892
Fixes various tests
aPovidlo Sep 4, 2024
4efdad5
Global refactoring - Rejection from separate class
aPovidlo Sep 8, 2024
bfe617d
Fix pep8, wrong code correction & test
aPovidlo Sep 8, 2024
68e7610
Fixes bug with memory_usage & test
aPovidlo Sep 8, 2024
bef6bf2
Fixes bug with invalid slice
aPovidlo Sep 8, 2024
bc1681d
pep8 fix
aPovidlo Sep 8, 2024
4843f7b
test fixes
aPovidlo Sep 8, 2024
a066f31
pep8 fix
aPovidlo Sep 8, 2024
8aac969
fix bug with memory_usage
aPovidlo Sep 8, 2024
1039392
reduce_memory_usage in utils, fix test with operations
aPovidlo Sep 8, 2024
9a0ccab
fix tests
aPovidlo Sep 8, 2024
0d8796d
fix tests in main api
aPovidlo Sep 8, 2024
f2096f7
fix: fix ambiguous value in integration test
DRMPN Oct 22, 2024
87195c1
fix: fix typing error
DRMPN Oct 24, 2024
0537fe9
fix: fix arrays used as indices must be of integer
DRMPN Oct 24, 2024
e74393a
fix: fix NoneType object isn't subscriptable error
DRMPN Oct 24, 2024
4929ea2
fix: copy input_data to prevent modification
DRMPN Oct 29, 2024
3441877
fix: fix fedot input_data transform to h2o_frame for regression
DRMPN Oct 31, 2024
2ae6faf
fix: update the type of ids attributes to np.ndarray
DRMPN Nov 1, 2024
9156f26
Merge branch 'master' into preproc_refactoring
DRMPN Nov 1, 2024
769e4a0
Automated autopep8 fixes
github-actions[bot] Nov 1, 2024
36a1063
chore: change the logging levels of new messages
DRMPN Nov 5, 2024
0c81310
chore: fix pep8 style problems
DRMPN Nov 5, 2024
122f909
Merge branch 'preproc_refactoring' of https://github.com/aimclub/FEDO…
DRMPN Nov 5, 2024
e29d1a9
Automated autopep8 fixes
github-actions[bot] Nov 5, 2024
53da44f
fix: cannot concatenate ndarray
DRMPN Nov 5, 2024
fe8369f
Merge branch 'preproc_refactoring' of https://github.com/aimclub/FEDO…
DRMPN Nov 5, 2024
3b03169
fix: preserve single ndarray type for num_features
DRMPN Nov 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 24 additions & 8 deletions fedot/api/api_utils/api_data.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
import sys
from datetime import datetime
from typing import Dict, Union
from typing import Optional
Expand Down Expand Up @@ -34,14 +33,19 @@ def __init__(self, task: Task, use_input_preprocessing: bool = True):
self.task = task

self._recommendations = {}
self.preprocessor = DummyPreprocessor()

if use_input_preprocessing:
self.preprocessor = DataPreprocessor()

# Dictionary with recommendations (e.g. 'cut' for cutting dataset, 'label_encoded'
# to encode features using label encoder). Parameters for transformation provided also
self._recommendations = {'cut': self.preprocessor.cut_dataset,
'label_encoded': self.preprocessor.label_encoding_for_fit}
self._recommendations = {
'cut': self.preprocessor.cut_dataset,
'label_encoded': self.preprocessor.label_encoding_for_fit
}

else:
self.preprocessor = DummyPreprocessor()

self.log = default_log(self)

Expand Down Expand Up @@ -133,18 +137,28 @@ def accept_and_apply_recommendations(self, input_data: Union[InputData, MultiMod
def fit_transform(self, train_data: InputData) -> InputData:
start_time = datetime.now()
self.log.message('Preprocessing data')
memory_usage = convert_memory_size(sys.getsizeof(train_data.features))
memory_usage = convert_memory_size(train_data.memory_usage)
features_shape = train_data.features.shape
target_shape = train_data.target.shape
self.log.message(
f'Train Data (Original) Memory Usage: {memory_usage} Data Shapes: {features_shape, target_shape}')

self.log.debug('- Obligatory preprocessing started')
train_data = self.preprocessor.obligatory_prepare_for_fit(data=train_data)

self.log.debug('- Optional preprocessing started')
train_data = self.preprocessor.optional_prepare_for_fit(pipeline=Pipeline(), data=train_data)

self.log.debug('- Converting indexes for fitting started')
train_data = self.preprocessor.convert_indexes_for_fit(pipeline=Pipeline(), data=train_data)

self.log.debug('- Reducing memory started')
train_data = self.preprocessor.reduce_memory_size(data=train_data)

train_data.supplementary_data.is_auto_preprocessed = True

memory_usage = convert_memory_size(sys.getsizeof(train_data.features))
memory_usage = convert_memory_size(train_data.memory_usage)

features_shape = train_data.features.shape
target_shape = train_data.target.shape
self.log.message(
Expand All @@ -156,7 +170,7 @@ def fit_transform(self, train_data: InputData) -> InputData:
def transform(self, test_data: InputData, current_pipeline) -> InputData:
start_time = datetime.now()
self.log.message('Preprocessing data')
memory_usage = convert_memory_size(sys.getsizeof(test_data))
memory_usage = convert_memory_size(test_data.memory_usage)
features_shape = test_data.features.shape
target_shape = test_data.target.shape
self.log.message(
Expand All @@ -168,7 +182,9 @@ def transform(self, test_data: InputData, current_pipeline) -> InputData:
test_data = self.preprocessor.update_indices_for_time_series(test_data)
test_data.supplementary_data.is_auto_preprocessed = True

memory_usage = convert_memory_size(sys.getsizeof(test_data))
test_data = self.preprocessor.reduce_memory_size(data=test_data)

memory_usage = convert_memory_size(test_data.memory_usage)
features_shape = test_data.features.shape
target_shape = test_data.target.shape
self.log.message(
Expand Down
16 changes: 13 additions & 3 deletions fedot/api/api_utils/predefined_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,26 +8,36 @@
from fedot.core.pipelines.node import PipelineNode
from fedot.core.pipelines.pipeline import Pipeline
from fedot.core.pipelines.verification import verify_pipeline
from fedot.preprocessing.base_preprocessing import BasePreprocessor


class PredefinedModel:
def __init__(self, predefined_model: Union[str, Pipeline], data: InputData, log: LoggerAdapter,
use_input_preprocessing: bool = True):
use_input_preprocessing: bool = True, api_preprocessor: BasePreprocessor = None):
self.predefined_model = predefined_model
self.data = data
self.log = log
self.pipeline = self._get_pipeline(use_input_preprocessing)
self.pipeline = self._get_pipeline(use_input_preprocessing, api_preprocessor)

def _get_pipeline(self, use_input_preprocessing: bool = True) -> Pipeline:
def _get_pipeline(self, use_input_preprocessing: bool = True,
api_preprocessor: BasePreprocessor = None) -> Pipeline:
if isinstance(self.predefined_model, Pipeline):
pipelines = self.predefined_model
elif self.predefined_model == 'auto':
# Generate initial assumption automatically
pipelines = AssumptionsBuilder.get(self.data).from_operations().build(
use_input_preprocessing=use_input_preprocessing)[0]

if use_input_preprocessing and api_preprocessor is not None:
pipelines.preprocessor = api_preprocessor

elif isinstance(self.predefined_model, str):
model = PipelineNode(self.predefined_model)
pipelines = Pipeline(model, use_input_preprocessing=use_input_preprocessing)

if use_input_preprocessing and api_preprocessor is not None:
pipelines.preprocessor = api_preprocessor

else:
raise ValueError(f'{type(self.predefined_model)} is not supported as Fedot model')

Expand Down
8 changes: 5 additions & 3 deletions fedot/api/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,9 +176,11 @@ def fit(self,
with fedot_composer_timer.launch_fitting():
if predefined_model is not None:
# Fit predefined model and return it without composing
self.current_pipeline = PredefinedModel(predefined_model, self.train_data, self.log,
use_input_preprocessing=self.params.get(
'use_input_preprocessing')).fit()
self.current_pipeline = PredefinedModel(
predefined_model, self.train_data, self.log,
use_input_preprocessing=self.params.get('use_input_preprocessing'),
api_preprocessor=self.data_processor.preprocessor,
).fit()
else:
self.current_pipeline, self.best_models, self.history = self.api_composer.obtain_model(self.train_data)

Expand Down
Loading
Loading