Improving preprocessing #1320

aPovidlo · 2024-08-13T15:01:33Z

This is a 🔨 code refactoring.

Summary

Significant Updates in Data Storage and Preprocessing

Major Updates:

Enhanced logging: Added more detailed logs in DEBUG mode during preprocessing.
New functionality: You can now mark categorical features in data when using InputData.from_numpy(...), InputData.from_dataframe(...), and InputData.from_csv(...) methods.
New class: Introduced OptimizedFeatures, which stores data with optimal dtypes for improved efficiency.
Preprocessing improvement: Added a new stage called reduce_memory_size to optimize memory usage.
API enhancements: Updated PredefinedModel to allow copying parameters from DataPreprocessor.

Minor Updates:

Improved logic for detecting categorical data.
Updated encoders and imputers to align with the new changes.
Revised tests to incorporate the new features.

Context

closes #1337
closes #1329

…g BinaryCategoricalPreprocessor, fix bugs, adding reduce memory size, delete clean_extra_spaces

pep8speaks · 2024-08-13T15:01:43Z

Hello @aPovidlo! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-11-05 14:28:55 UTC

github-actions · 2024-08-13T15:02:32Z

All PEP8 errors has been fixed, thanks ❤️

Comment last updated at

fedot/core/data/data.py

fedot/api/api_utils/api_data.py

fedot/preprocessing/data_types.py

fedot/preprocessing/preprocessing.py

fedot/preprocessing/categorical.py

fedot/preprocessing/preprocessing.py

…l_correct

…y method to OptimisedFeature

…_correct by adding check for target

…_fit_transform by adding cat and num idx in get_dataset func

…by switching Xgboost to Catboost, due to "Experimental support for categorical data is not implemented for current tree method yet." for XgBoost and checking feat ids with size

Lopa10ko · 2024-10-31T16:34:47Z

Wanted to point out that all preprocessing information is recorded in the logs in a very detailed form, which can cause the logs to become too large and difficult to read. Therefore, I propose to distinguish between different logging zones and present messages from the DataPreprocessor and TableTypesCorrector in separate contexts relative to the main context.

Lopa10ko · 2024-10-31T16:36:49Z

fedot/core/operations/evaluation/operation_implementations/models/boostings_implementations.py

+    def convert_to_dataframe(input_data: Optional[InputData], identify_cats: bool):
+        copied_input_data = deepcopy(input_data)
+
+        dataframe = pd.DataFrame(data=copied_input_data.features)
+        if copied_input_data.target is not None and copied_input_data.target.size > 0:
+            dataframe['target'] = np.ravel(copied_input_data.target)
        else:
            # TODO: temp workaround in case data.target is set to None intentionally
            #  for test.integration.models.test_model.check_predict_correct
-            dataframe['target'] = np.zeros(len(data.features))
+            dataframe['target'] = np.zeros(len(copied_input_data.features))

-        if identify_cats and data.categorical_idx is not None:
-            for col in dataframe.columns[data.categorical_idx]:
+        if identify_cats and copied_input_data.categorical_idx is not None:
+            for col in dataframe.columns[copied_input_data.categorical_idx]:
                dataframe[col] = dataframe[col].astype('category')

-        if data.numerical_idx is not None:
-            for col in dataframe.columns[data.numerical_idx]:
+        if copied_input_data.numerical_idx is not None:
+            for col in dataframe.columns[copied_input_data.numerical_idx]:


why do we need additional copying of InputData instance?

The reason for adding an additional copy of the InputData instance is to avoid unintended mutations. Currently, within that function, the input_data parameter is modified so that the features field ends up containing both feature values and target values. This causes issues later, particularly during the prediction phase, where these unintended modifications lead to failures.

To address this, we create a copy of input_data to preserve the original data and prevent any side effects. Here's an illustration of the problem:

I also tried an alternative approach:

target = dataframe["target"] return dataframe.drop(columns=["target"], inplace=True), target

However, this approach led to runtime errors, likely due to issues with dynamic memory allocation when using inplace=True with drop. To ensure stability, I opted to stick with the copy-based solution.

DRMPN · 2024-10-31T17:15:46Z

Wanted to point out that all preprocessing information is recorded in the logs in a very detailed form, which can cause the logs to become too large and difficult to read. Therefore, I propose to distinguish between different logging zones and present messages from the DataPreprocessor and TableTypesCorrector in separate contexts relative to the main context.

I agree with your suggestion. I’ll plan to implement separate logging contexts for DataPreprocessor and TableTypesCorrector after addressing the remaining issues.

…into preproc_refactoring

DRMPN · 2024-11-05T03:04:12Z

✅ Интеграционные: https://github.com/aimclub/FEDOT/actions/runs/11636387365
✅ Интеграционные: https://github.com/aimclub/FEDOT/actions/runs/11676872713
✅ Интеграционные: https://github.com/aimclub/FEDOT/actions/runs/11680895700
✅ Интеграционные: https://github.com/aimclub/FEDOT/actions/runs/11686246179

…into preproc_refactoring

aPovidlo added 5 commits August 7, 2024 18:47

Adding logs & the ability to specify categorical data

6d60801

Fixes categorical features

057c4d2

Changing getsizeof to nbytes

4b4536a

Delete _clean_extra_spaces

ae6eb42

Adding more logs, adding OptimisedFeature storage, refactoring fittin…

f0df60c

…g BinaryCategoricalPreprocessor, fix bugs, adding reduce memory size, delete clean_extra_spaces

nicl-nno reviewed Aug 13, 2024

View reviewed changes

fedot/core/data/data.py Outdated Show resolved Hide resolved

Lopa10ko reviewed Aug 14, 2024

View reviewed changes

aPovidlo added 21 commits August 14, 2024 17:14

@Lopa10ko requested changes

e4c13f5

Fix bug with nbytes

c0f7ff3

Fix bug with cat_features_names if there aren't exists features_names

6d7bf97

Adding reduce_memory_size to pipeline._preprocess

705529a

Return to Pandas for nan_matrix

4c7d281

Change logic of _into_categorical_features_transformation_for_fit

75901ae

Adding convert to np.array

426dbd9

Update ImputationImplementation

9ab9f99

Fix bug in BinaryCategorical

b679660

Fix bug with test_data_from_csv_load_correctly

119bca8

Fix bug with test_api_fit_predict_with_pseudo_large_dataset_with_labe…

7a3946a

…l_correct

Fix bug with test_pipeline_preprocessing_through_api_correctly

3134fc6

Fix bug with test_default_forecast (add new TODO for ts_forecasting)

e5db54d

Fix bug with test_cv_multiple_metrics_evaluated_correct by adding cop…

ebab7f2

…y method to OptimisedFeature

Fix bug with test_regression_pipeline_with_data_operation_fit_predict…

c123779

…_correct by adding check for target

Fix bug in test_default_train_test_simple with nbytes

2e168dc

Fix bugs with str* types in features

f6d539a

Fix bug with test_inf_and_nan_absence_after_imputation_implementation…

9290d82

…_fit_transform by adding cat and num idx in get_dataset func

Fix bug with test_pipeline_objective_evaluate_with_different_metrics …

2f59466

…by switching Xgboost to Catboost, due to "Experimental support for categorical data is not implemented for current tree method yet." for XgBoost and checking feat ids with size

Fix bug with test_order_by_data_flow_len_correct

1be317f

Fix bug with test_pipeline_with_imputer (finally)

16285df

DRMPN added 5 commits October 24, 2024 16:21

fix: fix typing error

87195c1

fix: fix arrays used as indices must be of integer

0537fe9

fix: fix NoneType object isn't subscriptable error

e74393a

fix: copy input_data to prevent modification

4929ea2

fix: fix fedot input_data transform to h2o_frame for regression

3441877

DRMPN requested a review from Lopa10ko October 31, 2024 16:06

Lopa10ko reviewed Oct 31, 2024

View reviewed changes

This comment was marked as resolved.

Sign in to view

DRMPN added 2 commits November 1, 2024 23:45

fix: update the type of ids attributes to np.ndarray

2ae6faf

Merge branch 'master' into preproc_refactoring

9156f26

This comment was marked as resolved.

Sign in to view

github-actions bot and others added 4 commits November 1, 2024 21:08

Automated autopep8 fixes

769e4a0

chore: change the logging levels of new messages

36a1063

chore: fix pep8 style problems

0c81310

Merge branch 'preproc_refactoring' of https://github.com/aimclub/FEDOT …

122f909

…into preproc_refactoring

This comment was marked as resolved.

Sign in to view

DRMPN requested a review from Lopa10ko November 5, 2024 03:03

Automated autopep8 fixes

e29d1a9

Lopa10ko approved these changes Nov 5, 2024

View reviewed changes

DRMPN added 2 commits November 5, 2024 11:59

fix: cannot concatenate ndarray

53da44f

Merge branch 'preproc_refactoring' of https://github.com/aimclub/FEDOT …

fe8369f

…into preproc_refactoring

nicl-nno approved these changes Nov 5, 2024

View reviewed changes

fix: preserve single ndarray type for num_features

3b03169

DRMPN merged commit a2c6746 into master Nov 5, 2024
10 checks passed

nicl-nno deleted the preproc_refactoring branch November 12, 2024 13:41

Lopa10ko mentioned this pull request Jan 31, 2025

[Bug]: array_to_input_data function signature conflict #1361

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improving preprocessing #1320

Improving preprocessing #1320

Uh oh!

aPovidlo commented Aug 13, 2024 •

edited by DRMPN

Loading

Uh oh!

pep8speaks commented Aug 13, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Aug 13, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lopa10ko commented Oct 31, 2024

Uh oh!

Lopa10ko Oct 31, 2024

Uh oh!

DRMPN Oct 31, 2024

Uh oh!

This comment was marked as resolved.

DRMPN commented Oct 31, 2024

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

DRMPN commented Nov 5, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Improving preprocessing #1320

Improving preprocessing #1320

Uh oh!

Conversation

aPovidlo commented Aug 13, 2024 • edited by DRMPN Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Uh oh!

pep8speaks commented Aug 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2024-11-05 14:28:55 UTC

Uh oh!

github-actions bot commented Aug 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lopa10ko commented Oct 31, 2024

Uh oh!

Lopa10ko Oct 31, 2024

Choose a reason for hiding this comment

Uh oh!

DRMPN Oct 31, 2024

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

DRMPN commented Oct 31, 2024

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

DRMPN commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aPovidlo commented Aug 13, 2024 •

edited by DRMPN

Loading

pep8speaks commented Aug 13, 2024 •

edited

Loading

github-actions bot commented Aug 13, 2024 •

edited

Loading

DRMPN commented Nov 5, 2024 •

edited

Loading