Skip to content

How does OneHotEncoder handle difference in categories in variable names? Currently a mismatch in shape? #832

@Morgan-Sell

Description

@Morgan-Sell

Describe the bug
Two bugs are related:

  1. A variable that is one-hot encoded in the training dataset has categorical values that do not exist in the testing dataset.
  2. A variable that is one-hot encoded in the testing dataset has categorical values that do not exist in the ** training dataset**.

Both issues result in the transformed dataframe shapes not equalling. This results in errors in a pipeline.

To Reproduce
Steps to reproduce the behavior:

Expected behavior
OneHotEncoder needs to ensure that dataframe shapes are equal.

Proposed solution for Issue #1:

  • Reindex the test dataset using get_feature_names_out(). Something like:
expected_columns = X_train_prcsd.get_feature_names_out()
X_test_prcsd = X_test_prcsd.reindex(columns=expected_columns, fill_value=0)

Proposed solution for Issue #2:
Add an handle_unknown attribute. If the user selects the value to ignore, then new catorigical values in the test dataset will not be encoded.

Screenshots
feature-engine error that is returned:


    def _check_X_matches_training_df(X: pd.DataFrame, reference: int) -> None:
        """
        Checks that DataFrame to transform has the same number of columns that the
        DataFrame used with the fit() method.
    
        Parameters
        ----------
        X : Pandas DataFrame
            The df to be checked
        reference : int
            The number of columns in the dataframe that was used with the fit() method.
    
        Raises
        ------
        ValueError
            If the number of columns does not match.
    
        Returns
        -------
        None
        """
    
        if X.shape[1] != reference:
>           raise ValueError(
                "The number of columns in this dataset is different from the one used to "
                "fit this transformer (when using the fit() method)."
            )
E           ValueError: The number of columns in this dataset is different from the one used to fit this transformer (when using the fit() method).

venv/lib/python3.11/site-packages/feature_engine/dataframe_checks.py:239: ValueError

Desktop (please complete the following information):

  • OS: Mac Os
  • Browser: N/A
  • Version: Latest version

Additional context
feature-engine rulesss!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions