-
-
Notifications
You must be signed in to change notification settings - Fork 331
Open
Description
Describe the bug
Two bugs are related:
- A variable that is one-hot encoded in the training dataset has categorical values that do not exist in the testing dataset.
- A variable that is one-hot encoded in the testing dataset has categorical values that do not exist in the ** training dataset**.
Both issues result in the transformed dataframe shapes not equalling. This results in errors in a pipeline.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
OneHotEncoder needs to ensure that dataframe shapes are equal.
Proposed solution for Issue #1:
- Reindex the test dataset using
get_feature_names_out()
. Something like:
expected_columns = X_train_prcsd.get_feature_names_out()
X_test_prcsd = X_test_prcsd.reindex(columns=expected_columns, fill_value=0)
Proposed solution for Issue #2:
Add an handle_unknown
attribute. If the user selects the value to ignore
, then new catorigical values in the test dataset will not be encoded.
Screenshots
feature-engine error that is returned:
def _check_X_matches_training_df(X: pd.DataFrame, reference: int) -> None:
"""
Checks that DataFrame to transform has the same number of columns that the
DataFrame used with the fit() method.
Parameters
----------
X : Pandas DataFrame
The df to be checked
reference : int
The number of columns in the dataframe that was used with the fit() method.
Raises
------
ValueError
If the number of columns does not match.
Returns
-------
None
"""
if X.shape[1] != reference:
> raise ValueError(
"The number of columns in this dataset is different from the one used to "
"fit this transformer (when using the fit() method)."
)
E ValueError: The number of columns in this dataset is different from the one used to fit this transformer (when using the fit() method).
venv/lib/python3.11/site-packages/feature_engine/dataframe_checks.py:239: ValueError
Desktop (please complete the following information):
- OS: Mac Os
- Browser: N/A
- Version: Latest version
Additional context
feature-engine rulesss!
Metadata
Metadata
Assignees
Labels
No labels