Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving preprocessing #1320

Merged
merged 86 commits into from
Nov 5, 2024
Merged

Improving preprocessing #1320

merged 86 commits into from
Nov 5, 2024

Conversation

aPovidlo
Copy link
Collaborator

@aPovidlo aPovidlo commented Aug 13, 2024

This is a 🔨 code refactoring.

Summary

Significant Updates in Data Storage and Preprocessing

Major Updates:

  • Enhanced logging: Added more detailed logs in DEBUG mode during preprocessing.
  • New functionality: You can now mark categorical features in data when using InputData.from_numpy(...), InputData.from_dataframe(...), and InputData.from_csv(...) methods.
  • New class: Introduced OptimizedFeatures, which stores data with optimal dtypes for improved efficiency.
  • Preprocessing improvement: Added a new stage called reduce_memory_size to optimize memory usage.
  • API enhancements: Updated PredefinedModel to allow copying parameters from DataPreprocessor.

Minor Updates:

  • Improved logic for detecting categorical data.
  • Updated encoders and imputers to align with the new changes.
  • Revised tests to incorporate the new features.

Context

closes #1337
closes #1329

@pep8speaks
Copy link

pep8speaks commented Aug 13, 2024

Hello @aPovidlo! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-11-05 14:28:55 UTC

Copy link
Contributor

github-actions bot commented Aug 13, 2024

All PEP8 errors has been fixed, thanks ❤️

Comment last updated at

fedot/core/data/data.py Outdated Show resolved Hide resolved
fedot/core/data/data.py Outdated Show resolved Hide resolved
fedot/api/api_utils/api_data.py Outdated Show resolved Hide resolved
fedot/preprocessing/data_types.py Outdated Show resolved Hide resolved
fedot/preprocessing/preprocessing.py Outdated Show resolved Hide resolved
fedot/preprocessing/categorical.py Outdated Show resolved Hide resolved
fedot/preprocessing/preprocessing.py Outdated Show resolved Hide resolved
…_fit_transform by adding cat and num idx in get_dataset func
…by switching Xgboost to Catboost, due to "Experimental support for categorical data is not implemented for current tree method yet." for XgBoost and checking feat ids with size
@DRMPN DRMPN requested a review from Lopa10ko October 31, 2024 16:06
@Lopa10ko
Copy link
Collaborator

Wanted to point out that all preprocessing information is recorded in the logs in a very detailed form, which can cause the logs to become too large and difficult to read. Therefore, I propose to distinguish between different logging zones and present messages from the DataPreprocessor and TableTypesCorrector in separate contexts relative to the main context.

Comment on lines +96 to +112
def convert_to_dataframe(input_data: Optional[InputData], identify_cats: bool):
copied_input_data = deepcopy(input_data)

dataframe = pd.DataFrame(data=copied_input_data.features)
if copied_input_data.target is not None and copied_input_data.target.size > 0:
dataframe['target'] = np.ravel(copied_input_data.target)
else:
# TODO: temp workaround in case data.target is set to None intentionally
# for test.integration.models.test_model.check_predict_correct
dataframe['target'] = np.zeros(len(data.features))
dataframe['target'] = np.zeros(len(copied_input_data.features))

if identify_cats and data.categorical_idx is not None:
for col in dataframe.columns[data.categorical_idx]:
if identify_cats and copied_input_data.categorical_idx is not None:
for col in dataframe.columns[copied_input_data.categorical_idx]:
dataframe[col] = dataframe[col].astype('category')

if data.numerical_idx is not None:
for col in dataframe.columns[data.numerical_idx]:
if copied_input_data.numerical_idx is not None:
for col in dataframe.columns[copied_input_data.numerical_idx]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need additional copying of InputData instance?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for adding an additional copy of the InputData instance is to avoid unintended mutations. Currently, within that function, the input_data parameter is modified so that the features field ends up containing both feature values and target values. This causes issues later, particularly during the prediction phase, where these unintended modifications lead to failures.

To address this, we create a copy of input_data to preserve the original data and prevent any side effects. Here's an illustration of the problem:

image

I also tried an alternative approach:

target = dataframe["target"]
return dataframe.drop(columns=["target"], inplace=True), target

However, this approach led to runtime errors, likely due to issues with dynamic memory allocation when using inplace=True with drop. To ensure stability, I opted to stick with the copy-based solution.

@DRMPN

This comment was marked as resolved.

@DRMPN
Copy link
Collaborator

DRMPN commented Oct 31, 2024

Wanted to point out that all preprocessing information is recorded in the logs in a very detailed form, which can cause the logs to become too large and difficult to read. Therefore, I propose to distinguish between different logging zones and present messages from the DataPreprocessor and TableTypesCorrector in separate contexts relative to the main context.

I agree with your suggestion. I’ll plan to implement separate logging contexts for DataPreprocessor and TableTypesCorrector after addressing the remaining issues.

@Lopa10ko

This comment was marked as resolved.

@DRMPN

This comment was marked as resolved.

@DRMPN

This comment was marked as resolved.

@DRMPN DRMPN requested a review from Lopa10ko November 5, 2024 03:03
@DRMPN
Copy link
Collaborator

DRMPN commented Nov 5, 2024

  1. ✅ Интеграционные: https://github.com/aimclub/FEDOT/actions/runs/11636387365
  2. ✅ Интеграционные: https://github.com/aimclub/FEDOT/actions/runs/11676872713
  3. ✅ Интеграционные: https://github.com/aimclub/FEDOT/actions/runs/11680895700
  4. ✅ Интеграционные: https://github.com/aimclub/FEDOT/actions/runs/11686246179

@DRMPN DRMPN merged commit a2c6746 into master Nov 5, 2024
10 checks passed
@nicl-nno nicl-nno deleted the preproc_refactoring branch November 12, 2024 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants