FeatureRequest: Machine Learning Data Split Functionality #427

MikeLippincott · 2024-08-22T20:03:32Z

Feature type

Add new functionality
Change existing functionality

General description of the proposed functionality

This functionality is multi-part:
Currently, machine learning data splits are performed after normalization and feature selection.
This poses a potential for data leakage into the models.
The proposed fix is the implement functions that perform data splits prior to normalization.
To implement this the normalization and feature selection would need to be applied to the training split and then propagated to the validation, testing, [*holdout] splits.
I envision most of this functionality needing to be carried out by the user.
To do so, I suggest updating the noramlize and feature selection function and implement two more functions:

Normalize function: Save the transformation performed on the training data
Implement a normalize transform function to propagate the saved transformation to testing data splits
Feature select function: Save the feature selected columns list
Implement a feature selection column propagation to each test data split.

Feature example

change function:
def noramlize(*args, **kwargs, blah, blah, blah, save_xform:bool = False):
normalize_function_magic_that_is_happinging

if not save_xform:
    continue
else:
    save_the_transform (probably as a numpy array), depending on method it could be parameters too (mean, std)

new function:
def apply_xform(*args, **kwargs):
apply the saved xform the test data splits here

change function:
def feature_select(*args, **kwargs, save_selected_feature_list: bool = False):
feature selection magic

if not save_selected_feature_list:
    continue
else:
    save the list of features to apply to a dataset

new function:
def apply_selected_features(*args, **kwargs):
apply the selected features from train split to test split.

Alternative Solutions

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

MikeLippincott · 2024-08-22T20:35:32Z

This is something that I can implement. Do I need approval prior to opening a PR?

d33bs · 2024-08-23T14:23:14Z

Thanks for adding this issue @MikeLippincott ! No special approval required before opening a PR, please feel free to propose changes and open when ready.

On the enhancement development: based on how you outlined the issue would it be possible to add a test which checks for leakage you mentioned? I imagine this would help prove the new capabilities in addition to making sure future changes don't reintroduce leakage. Totally open to your thoughts here (this isn't a requirement).

gwaybio · 2024-08-23T15:24:14Z

related to #154

MikeLippincott added the enhancement New feature or request label Aug 22, 2024

d33bs assigned MikeLippincott Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FeatureRequest: Machine Learning Data Split Functionality #427

FeatureRequest: Machine Learning Data Split Functionality #427

MikeLippincott commented Aug 22, 2024

MikeLippincott commented Aug 22, 2024

d33bs commented Aug 23, 2024

gwaybio commented Aug 23, 2024

FeatureRequest: Machine Learning Data Split Functionality #427

FeatureRequest: Machine Learning Data Split Functionality #427

Comments

MikeLippincott commented Aug 22, 2024

Feature type

General description of the proposed functionality

Feature example

Alternative Solutions

Additional information

MikeLippincott commented Aug 22, 2024

d33bs commented Aug 23, 2024

gwaybio commented Aug 23, 2024