Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FeatureRequest: Machine Learning Data Split Functionality #427

Open
2 tasks done
MikeLippincott opened this issue Aug 22, 2024 · 3 comments
Open
2 tasks done

FeatureRequest: Machine Learning Data Split Functionality #427

MikeLippincott opened this issue Aug 22, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@MikeLippincott
Copy link

Feature type

  • Add new functionality

  • Change existing functionality

General description of the proposed functionality

This functionality is multi-part:
Currently, machine learning data splits are performed after normalization and feature selection.
This poses a potential for data leakage into the models.
The proposed fix is the implement functions that perform data splits prior to normalization.
To implement this the normalization and feature selection would need to be applied to the training split and then propagated to the validation, testing, [*holdout] splits.
I envision most of this functionality needing to be carried out by the user.
To do so, I suggest updating the noramlize and feature selection function and implement two more functions:

  1. Normalize function: Save the transformation performed on the training data
  2. Implement a normalize transform function to propagate the saved transformation to testing data splits
  3. Feature select function: Save the feature selected columns list
  4. Implement a feature selection column propagation to each test data split.

Feature example

change function:
def noramlize(*args, **kwargs, blah, blah, blah, save_xform:bool = False):
normalize_function_magic_that_is_happinging

if not save_xform:
    continue
else:
    save_the_transform (probably as a numpy array), depending on method it could be parameters too (mean, std)

new function:
def apply_xform(*args, **kwargs):
apply the saved xform the test data splits here

change function:
def feature_select(*args, **kwargs, save_selected_feature_list: bool = False):
feature selection magic

if not save_selected_feature_list:
    continue
else:
    save the list of features to apply to a dataset

new function:
def apply_selected_features(*args, **kwargs):
apply the selected features from train split to test split.

Alternative Solutions

No response

Additional information

No response

@MikeLippincott MikeLippincott added the enhancement New feature or request label Aug 22, 2024
@MikeLippincott
Copy link
Author

This is something that I can implement. Do I need approval prior to opening a PR?

@d33bs
Copy link
Member

d33bs commented Aug 23, 2024

Thanks for adding this issue @MikeLippincott ! No special approval required before opening a PR, please feel free to propose changes and open when ready.

On the enhancement development: based on how you outlined the issue would it be possible to add a test which checks for leakage you mentioned? I imagine this would help prove the new capabilities in addition to making sure future changes don't reintroduce leakage. Totally open to your thoughts here (this isn't a requirement).

@gwaybio
Copy link
Member

gwaybio commented Aug 23, 2024

related to #154

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants