You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This functionality is multi-part:
Currently, machine learning data splits are performed after normalization and feature selection.
This poses a potential for data leakage into the models.
The proposed fix is the implement functions that perform data splits prior to normalization.
To implement this the normalization and feature selection would need to be applied to the training split and then propagated to the validation, testing, [*holdout] splits.
I envision most of this functionality needing to be carried out by the user.
To do so, I suggest updating the noramlize and feature selection function and implement two more functions:
Normalize function: Save the transformation performed on the training data
Implement a normalize transform function to propagate the saved transformation to testing data splits
Feature select function: Save the feature selected columns list
Implement a feature selection column propagation to each test data split.
Thanks for adding this issue @MikeLippincott ! No special approval required before opening a PR, please feel free to propose changes and open when ready.
On the enhancement development: based on how you outlined the issue would it be possible to add a test which checks for leakage you mentioned? I imagine this would help prove the new capabilities in addition to making sure future changes don't reintroduce leakage. Totally open to your thoughts here (this isn't a requirement).
Feature type
Add new functionality
Change existing functionality
General description of the proposed functionality
This functionality is multi-part:
Currently, machine learning data splits are performed after normalization and feature selection.
This poses a potential for data leakage into the models.
The proposed fix is the implement functions that perform data splits prior to normalization.
To implement this the normalization and feature selection would need to be applied to the training split and then propagated to the validation, testing, [*holdout] splits.
I envision most of this functionality needing to be carried out by the user.
To do so, I suggest updating the noramlize and feature selection function and implement two more functions:
Feature example
change function:
def noramlize(*args, **kwargs, blah, blah, blah, save_xform:bool = False):
normalize_function_magic_that_is_happinging
new function:
def apply_xform(*args, **kwargs):
apply the saved xform the test data splits here
change function:
def feature_select(*args, **kwargs, save_selected_feature_list: bool = False):
feature selection magic
new function:
def apply_selected_features(*args, **kwargs):
apply the selected features from train split to test split.
Alternative Solutions
No response
Additional information
No response
The text was updated successfully, but these errors were encountered: