Skip to content

Conversation

@Reinaldo-Kn
Copy link

This pull request introduces a set of functions _best_columns.py designed to evaluate and identify the best target variable from a given dataset based on various statistical measures and predictive performance. The functions leverage correlation metrics, regression error metrics, feature importance, and mutual information to provide insights into the most relevant columns for predictive modeling.

Functions

  • bestColumn_pearson_spearman( )

Calculates the Pearson and Spearman correlation coefficients between all columns in the provided DataFrame. It identifies the column with the highest average correlation (positive) with other columns using both correlation methods. Pearson measures linear relationships, while Spearman measures monotonic relationships.

Parameters:
      df (pd.DataFrame): The input DataFrame containing the dataset.
  Returns:
      dict: A dictionary containing the best column for each correlation method:
          pearson: The column with the highest average Pearson correlation.
          spearman: The column with the highest average Spearman correlation.
  • bestColumn_with_least_mae_or_r2( )

Evaluates each column as a target variable for regression and calculates the Mean Absolute Error (MAE) and R-squared (R²) scores for predictions made by an XGBoost regressor. It identifies which column minimizes MAE and maximizes R², providing insights on the best target variable based on predictive performance.


    Parameters:
        df (pd.DataFrame): The input DataFrame containing the dataset.
    Returns:
        dict: A dictionary with sorted results for MAE and R²:
            mae: Sorted list of columns minimizing Mean Absolute Error.
            r2: Sorted list of columns maximizing R-squared.
  • bestColumn_feature_importance( )

Evaluates the importance of each feature by training an XGBoost regressor for each column and computing the average feature importance. This helps to identify which columns contribute most to predicting the target variable.

    Parameters:
        df (pd.DataFrame): The input DataFrame containing the dataset.
    Returns:
        list: A sorted list of feature importances for each column, indicating their contribution to predictive modeling.

  • bestColumn_mutual_information( )

Calculates the mutual information scores between each column and the other columns in the DataFrame. Mutual information quantifies the amount of information obtained about one variable through the other, helping to determine which features are most informative for predicting the target variable.

    Parameters:
        df (pd.DataFrame): The input DataFrame containing the dataset.
    Returns:
        list: A sorted list of mutual information scores for each column, providing insights into their informational value.

You can view the new functions in Colab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants