Skip to content

Conversation

@Reinaldo-Kn
Copy link

@Reinaldo-Kn Reinaldo-Kn commented Oct 17, 2024

These functions were implemented as part of the requests from issue #45

  • time_based_windowing()

This function applies a time-based rolling window to the dataset based on the timestamp index, allowing you to aggregate data over defined time intervals.

Parameters:

  df (pandas.DataFrame): The data to be processed.
   train_or_test (string, optional): Specifies whether the data is for training or testing.
   window_size (string, optional): The size of the time window (e.g., '30min' for 30 minutes or '1H' for 1 hour).
   agg_func (string, optional): The aggregation function to apply within each window ('mean', 'sum', 'max', 'min').

Returns:

   (pandas.DataFrame): The DataFrame with aggregated values for each time window.

Raises:

   ValueError: If the DataFrame index is not a datetime type or if an unsupported aggregation function is provided.
  • remove_constant_difference()

This function computes the cumulative sum of the input DataFrame and removes columns where the differences between consecutive rows are constant. This can be useful for identifying and dropping features that do not provide significant variability.

Parameters:

    df (pandas.DataFrame): The data to be processed.
    train_or_test (string, optional): Specifies whether the data is for training or testing.

Returns:

    (pandas.DataFrame): The DataFrame after removing columns with constant differences.

Raises:

    TypeError: If any column in the DataFrame is not numeric.
  • differencing()

This function applies differencing to the dataset, a technique used to remove trends or seasonal patterns by subtracting the current observation from the previous one.

Parameters:

    df (pandas.DataFrame): The data to be processed.
    train_or_test (string, optional): Specifies whether the data is for training or testing.
    lag (int, optional): The lag value used for differencing, with a default of 1.

Returns:

    (pandas.DataFrame): The DataFrame after applying differencing, with NaN values dropped.
  • log_transform()

This function applies a logarithmic transformation to the dataset to reduce variability and help normalize the data. The transformation used is log(1 + x) to handle values close to zero safely.

Parameters:

    df (pandas.DataFrame): The data to be processed.
    train_or_test (string, optional): Specifies whether the data is for training or testing.

Returns:

    (pandas.DataFrame): The DataFrame after applying the log transformation.

More methods for _gridsearch.py

A new function has also been added for GridSearch #52 , now using the Xgboost library to be able to generate hyperparameters more quickly by using the GPU

  • xgboost_best()

This function performs a grid search to identify the best hyperparameters for the XGBRegressor model using a predefined parameter grid. It splits the dataset into training and testing sets, then applies GridSearchCV to tune the model for optimal performance.

Parameters:

    df (pandas.DataFrame): The dataset to be used for training and testing.
    target_column (str): The name of the target column that the model will predict.

Returns:

    (XGBRegressor): The best XGBoost estimator found during the grid search.

Key Attributes:

    self.best_params_: Stores the best parameters found by the grid search.
    self.best_estimator_: The best estimator (model) found based on the grid search results.

Grid Search Default Parameters:

    n_estimators: [100, 200]
    learning_rate: [0.05, 0.1]
    max_depth: [None, 6]
    subsample: [0.1, 0.5]
    colsample_bytree: [0.1, 0.5]
    gamma: [0, 0.1]
    reg_alpha: [0, 0.1]
    reg_lambda: [0.1, 0.5]

Notes:

  • The grid search utilizes the GPU for faster computation by setting tree_method="gpu_hist" and predictor="gpu_predictor".
  • The function is designed to optimize the model based on the scoring method and cross-validation (cv) specified during class initialization.

You can view the new functions in Colab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants