-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving preprocessing #1320
Improving preprocessing #1320
Conversation
…g BinaryCategoricalPreprocessor, fix bugs, adding reduce memory size, delete clean_extra_spaces
All PEP8 errors has been fixed, thanks ❤️ Comment last updated at |
…y method to OptimisedFeature
…_correct by adding check for target
…_fit_transform by adding cat and num idx in get_dataset func
…by switching Xgboost to Catboost, due to "Experimental support for categorical data is not implemented for current tree method yet." for XgBoost and checking feat ids with size
Wanted to point out that all preprocessing information is recorded in the logs in a very detailed form, which can cause the logs to become too large and difficult to read. Therefore, I propose to distinguish between different logging zones and present messages from the |
def convert_to_dataframe(input_data: Optional[InputData], identify_cats: bool): | ||
copied_input_data = deepcopy(input_data) | ||
|
||
dataframe = pd.DataFrame(data=copied_input_data.features) | ||
if copied_input_data.target is not None and copied_input_data.target.size > 0: | ||
dataframe['target'] = np.ravel(copied_input_data.target) | ||
else: | ||
# TODO: temp workaround in case data.target is set to None intentionally | ||
# for test.integration.models.test_model.check_predict_correct | ||
dataframe['target'] = np.zeros(len(data.features)) | ||
dataframe['target'] = np.zeros(len(copied_input_data.features)) | ||
|
||
if identify_cats and data.categorical_idx is not None: | ||
for col in dataframe.columns[data.categorical_idx]: | ||
if identify_cats and copied_input_data.categorical_idx is not None: | ||
for col in dataframe.columns[copied_input_data.categorical_idx]: | ||
dataframe[col] = dataframe[col].astype('category') | ||
|
||
if data.numerical_idx is not None: | ||
for col in dataframe.columns[data.numerical_idx]: | ||
if copied_input_data.numerical_idx is not None: | ||
for col in dataframe.columns[copied_input_data.numerical_idx]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need additional copying of InputData
instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for adding an additional copy of the InputData
instance is to avoid unintended mutations. Currently, within that function, the input_data
parameter is modified so that the features
field ends up containing both feature values and target
values. This causes issues later, particularly during the prediction phase, where these unintended modifications lead to failures.
To address this, we create a copy of input_data
to preserve the original data and prevent any side effects. Here's an illustration of the problem:
I also tried an alternative approach:
target = dataframe["target"]
return dataframe.drop(columns=["target"], inplace=True), target
However, this approach led to runtime errors, likely due to issues with dynamic memory allocation when using inplace=True
with drop
. To ensure stability, I opted to stick with the copy-based solution.
This comment was marked as resolved.
This comment was marked as resolved.
I agree with your suggestion. I’ll plan to implement separate logging contexts for |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
|
…into preproc_refactoring
This is a 🔨 code refactoring.
Summary
Significant Updates in Data Storage and Preprocessing
Major Updates:
InputData.from_numpy(...)
,InputData.from_dataframe(...)
, andInputData.from_csv(...)
methods.OptimizedFeatures
, which stores data with optimal dtypes for improved efficiency.reduce_memory_size
to optimize memory usage.PredefinedModel
to allow copying parameters fromDataPreprocessor
.Minor Updates:
Context
closes #1337
closes #1329