Skip to content

Enhance Dataset Preprocessing with Feature Indices and Bipartite Graph Support #39

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jul 19, 2025

Conversation

haru-256
Copy link
Owner

@haru-256 haru-256 commented Jun 7, 2025

User description

This pull request refactors and enhances preprocessing functions for the Amazon Reviews dataset, introducing modularity and new functionality. Key changes include splitting preprocessing logic into reusable components, adding a new preprocessing function for bipartite graph construction, and improving the handling of edge cases like missing categories. These updates aim to make the codebase more maintainable and extendable.

Refactoring and Modularity Improvements:

  • Introduced _common_preprocess_dataset to encapsulate shared preprocessing logic, such as schema overrides, metadata joining, and history filtering. This modular function is reused across other preprocessing functions.
  • Refactored seq_rec_preprocess_dataset to delegate common preprocessing steps to _common_preprocess_dataset, improving code reuse and clarity.

New Functionality:

  • Added build_feature_indices, a dedicated function for generating user, item, and category indices, along with mappings between item indices and category indices. This separates index-building logic from dataset preprocessing for better maintainability.
  • Created bipartite_graph_preprocess_dataset, a new function for preparing the dataset specifically for bipartite graph analysis. It aggregates user-item interactions and computes additional features like the number of ratings.

Robustness Enhancements:

  • Improved handling of missing or null categories during category indexing, including logging warnings and creating minimal indices when necessary. This ensures the preprocessing pipeline is resilient to edge cases.

PR Type

Enhancement


Description

  • Extracts feature index building into new function.

  • Introduces common dataset preprocessing utility.

  • Refactors sequential recommendation preprocessing.

  • Adds new bipartite graph preprocessing function.


Changes walkthrough 📝

Relevant files
Enhancement
amazon_reviews_dataset.py
Refactor and add new dataset preprocessing functions         

libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py

  • Extracted feature index generation into a new build_feature_indices
    function, handling user, item, and category indices, including special
    UNK/PAD tokens and robust null category handling.
  • Introduced _common_preprocess_dataset to encapsulate shared data
    loading, schema overrides, metadata joining, and history filtering
    logic.
  • Refactored seq_rec_preprocess_dataset to utilize
    _common_preprocess_dataset for initial processing, improving
    modularity.
  • Added a new bipartite_graph_preprocess_dataset function for bipartite
    graph analysis, which aggregates user-item interactions and calculates
    num_ratings.
  • +300/-100
    Tests
    test_amazon_reviews_dataset.py
    Add comprehensive tests for new and refactored preprocessing functions

    libs/ml_sandbox_libs/tests/test_data/test_amazon_reviews_dataset.py

  • Added a new test test_build_feature_indices to validate the
    functionality of the extracted build_feature_indices function.
  • Implemented test_seq_rec_preprocess_dataset to verify the refactored
    sequential recommendation preprocessing, including history processing.
  • Introduced test_bipartite_graph_preprocess_dataset to ensure the
    correct aggregation and output of the new bipartite graph
    preprocessing.
  • Updated imports to include the newly added functions for testing.
  • +470/-1 

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • @haru-256 haru-256 requested a review from Copilot June 7, 2025 15:15
    Copy link
    Contributor

    @gemini-code-assist gemini-code-assist bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Hello @haru-256, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

    Summary of Changes

    Hello! Gemini here, providing a summary of this pull request. This PR focuses on enhancing the dataset preprocessing pipeline for the Amazon Reviews dataset. The primary goals are to improve modularity by extracting common logic, introduce a dedicated function for building feature indices (users, items, categories), and add support for preparing the dataset specifically for bipartite graph analysis. This refactoring aims to make the preprocessing code more maintainable and adaptable for different downstream tasks.

    Highlights

    • Modularity Refactoring: Common preprocessing steps, such as converting to Polars, joining metadata, and filtering history, have been extracted into a new internal helper function _common_preprocess_dataset to reduce code duplication.
    • Dedicated Index Building: A new public function build_feature_indices has been created to specifically handle the generation of user, item, and category index mappings, including the item-to-category index mapping. This separates the index creation logic from the main dataset preprocessing flow.
    • Bipartite Graph Preprocessing: A new public function bipartite_graph_preprocess_dataset has been added. This function leverages the common preprocessing steps and then aggregates the data by user-item pairs, computing features like the number of ratings per pair, suitable for bipartite graph representations.
    • Improved Robustness: The index building logic now includes more explicit handling and logging for cases where items might be missing category information, ensuring the category index mapping is robust.

    Changelog

    Click here to see the changelog
    • libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py
      • Extracted user, item, and category index building logic into a new function build_feature_indices (lines 99-261).
      • Introduced an internal helper function _common_preprocess_dataset to encapsulate shared data loading, metadata joining, and history filtering steps (lines 264-341).
      • Refactored seq_rec_preprocess_dataset to call _common_preprocess_dataset for initial processing (lines 344-382).
      • Added a new function bipartite_graph_preprocess_dataset for preparing data for bipartite graphs, including aggregation by user-item pairs and counting interactions (lines 807-879).
      • Enhanced category index building in build_feature_indices to handle null categories and ensure special indices are mapped correctly (lines 183-195, 242-254).
    • libs/ml_sandbox_libs/tests/test_data/test_amazon_reviews_dataset.py
      • Imported new and refactored functions (build_feature_indices, seq_rec_preprocess_dataset, bipartite_graph_preprocess_dataset) and SpecialIndex (lines 4-10).
      • Added a new test test_build_feature_indices to verify the functionality of the new index building function (lines 21-82).
      • Added a new test test_seq_rec_preprocess_dataset to verify the refactored sequential recommendation preprocessing function, including its use of the common helper (lines 84-249).
      • Added a new test test_bipartite_graph_preprocess_dataset to verify the new bipartite graph preprocessing function, including its aggregation logic and use of the common helper with filter_no_history=False (lines 252-480).
    Using Gemini Code Assist

    The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

    Invoking Gemini

    You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

    Feature Command Description
    Code Review /gemini review Performs a code review for the current pull request in its current state.
    Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
    Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
    Help /gemini help Displays a list of available commands.

    Customization

    To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

    Limitations & Feedback

    Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

    You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

    Footnotes

    1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

    @github-actions github-actions bot changed the title feat: enhance dataset preprocessing with build_feature_indices and bipartite graph support Enhance Dataset Preprocessing with Feature Indices and Bipartite Graph Support Jun 7, 2025
    Copy link

    github-actions bot commented Jun 7, 2025

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    🔀 No multiple PR themes
    ⚡ Recommended focus areas for review

    Category Indexing

    The new logic in build_feature_indices for handling cases where no non-null categories are found in the training data seems robust, creating a minimal index and logging a warning. However, it's important to confirm that this behavior is acceptable for all downstream uses, especially if category information is critical for the models.

    train_df_for_category_count = train_df.filter(pl.col("category").is_not_null())
    if train_df_for_category_count.is_empty() and not train_df.is_empty():
        logger.warning(
            "No non-null categories found in train_df for category indexing. Category index will be minimal."
        )
        # Create a minimal category2index if no categories are found to prevent errors downstream
        category2index: dict[str, int] = {"#UNK": SpecialIndex.UNK, "#PAD": SpecialIndex.PAD}
        train_categories = pl.DataFrame({"category": []})  # Empty DataFrame
    elif train_df_for_category_count.is_empty() and train_df.is_empty():
        logger.warning("train_df is empty. Category index will be minimal.")
        category2index = {"#UNK": SpecialIndex.UNK, "#PAD": SpecialIndex.PAD}
        train_categories = pl.DataFrame({"category": []})  # Empty DataFrame
    Aggregation Strategy

    In bipartite_graph_preprocess_dataset, when grouping by user_id and parent_asin, the aggregation uses pl.max("rating") and pl.min("timestamp"). This means for multiple interactions between the same user and item, the highest rating and earliest timestamp are kept. Please confirm if this specific aggregation strategy (max rating, min timestamp) is the intended and most appropriate choice for the bipartite graph analysis, as other aggregations (e.g., average rating, latest timestamp) might also be valid depending on the model's requirements.

    df = df.group_by("user_id", "parent_asin").agg(
        pl.max("user_index").alias("user_index"),
        pl.max("item_index").alias("item_index"),
        pl.max("category").alias("category"),
        pl.max("category_index").alias("category_index"),
        pl.max("rating").alias("rating"),
        pl.min("timestamp").alias("timestamp"),
        pl.len().alias("num_ratings"),
    )

    Copilot

    This comment was marked as outdated.

    Comment on lines 818 to 828
    """Preprocess the dataset for bipartite graph

    Args:
    dataset_dict: dataset from the datasets library(transformers)
    metadata: metadata dataset from the datasets library(transformers)

    Returns:
    bipartite_df: bipartite graph dataframe
    user2index: user to index dictionary
    item2index: item to index dictionary
    """
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Suggestion: The docstring for bipartite_graph_preprocess_dataset does not match the actual return values. It currently lists only bipartite_df, user2index, and item2index, but the function returns train_df, val_df, test_df, user2index, item2index, category2index, and item_index_2_category_index. Update the docstring to accurately reflect all returned values for clarity and correctness. [possible issue, importance: 7]

    Suggested change
    """Preprocess the dataset for bipartite graph
    Args:
    dataset_dict: dataset from the datasets library(transformers)
    metadata: metadata dataset from the datasets library(transformers)
    Returns:
    bipartite_df: bipartite graph dataframe
    user2index: user to index dictionary
    item2index: item to index dictionary
    """
    """Preprocess the dataset for bipartite graph
    Args:
    dataset_dict: dataset from the datasets library(transformers)
    metadata: metadata dataset from the datasets library(transformers)
    Returns:
    train_df: train dataset for bipartite graph
    val_df: validation dataset for bipartite graph
    test_df: test dataset for bipartite graph
    user2index: user to index dictionary
    item2index: item to index dictionary
    category2index: category to index dictionary
    item_index_2_category_index: item index to category index dictionary
    """

    Comment on lines 109 to 123
    """Builds indices for users, items, and categories from the provided datasets.

    Args:
    dataset: dataset from the datasets library(transformers)
    metadata: metadata dataset from the datasets library(transformers)
    filter_no_history: whether to filter out the dataset which has no history. Defaults to True.
    train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering).
    meta_df: The Polars DataFrame containing metadata for items, including categories.
    threshold: The threshold for `unk_filter_by_count` to determine frequent users/items/categories. Defaults to 0.95.

    Returns:
    train_df: train dataset
    val_df: validation dataset
    test_df: test dataset
    user2index: user to index dictionary
    item2index: item to index dictionary
    category2index: category to index dictionary
    A tuple containing:
    - user2index: Mapping from user ID string to integer index.
    - item2index: Mapping from item ID (parent_asin) string to integer index.
    - category2index: Mapping from category string to integer index.
    - item_index_2_category_index: Mapping from item integer index to category integer index.
    - processed_train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering) from which indices were derived.
    """
    Copy link

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Suggestion: The docstring for build_feature_indices incorrectly lists processed_train_df as a return value. This DataFrame is not returned by the function. Remove this entry from the docstring to align with the actual function signature and prevent confusion. [general, importance: 7]

    Suggested change
    """Builds indices for users, items, and categories from the provided datasets.
    Args:
    dataset: dataset from the datasets library(transformers)
    metadata: metadata dataset from the datasets library(transformers)
    filter_no_history: whether to filter out the dataset which has no history. Defaults to True.
    train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering).
    meta_df: The Polars DataFrame containing metadata for items, including categories.
    threshold: The threshold for `unk_filter_by_count` to determine frequent users/items/categories. Defaults to 0.95.
    Returns:
    train_df: train dataset
    val_df: validation dataset
    test_df: test dataset
    user2index: user to index dictionary
    item2index: item to index dictionary
    category2index: category to index dictionary
    A tuple containing:
    - user2index: Mapping from user ID string to integer index.
    - item2index: Mapping from item ID (parent_asin) string to integer index.
    - category2index: Mapping from category string to integer index.
    - item_index_2_category_index: Mapping from item integer index to category integer index.
    - processed_train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering) from which indices were derived.
    """
    """Builds indices for users, items, and categories from the provided datasets.
    Args:
    train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering).
    meta_df: The Polars DataFrame containing metadata for items, including categories.
    threshold: The threshold for `unk_filter_by_count` to determine frequent users/items/categories. Defaults to 0.95.
    Returns:
    A tuple containing:
    - user2index: Mapping from user ID string to integer index.
    - item2index: Mapping from item ID (parent_asin) string to integer index.
    - category2index: Mapping from category string to integer index.
    - item_index_2_category_index: Mapping from item integer index to category integer index.
    """

    Copy link
    Contributor

    @gemini-code-assist gemini-code-assist bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Code Review

    This pull request introduces excellent refactoring and new features for dataset preprocessing. The modularity is greatly improved with _common_preprocess_dataset and build_feature_indices. The new bipartite_graph_preprocess_dataset is a valuable addition. Error handling for categories in build_feature_indices is also a good enhancement. The accompanying tests are thorough and well-written.

    Overall, the changes are robust and significantly improve the codebase. My feedback primarily focuses on ensuring documentation (docstrings) accurately reflects the function signatures and return values.

    Summary of Findings

    • Docstring Accuracy for build_feature_indices: The build_feature_indices function's docstring incorrectly lists processed_train_df as a return value.
    • Docstring Accuracy for seq_rec_preprocess_dataset: The seq_rec_preprocess_dataset docstring has a mismatch in an argument name (dataset vs dataset_dict) and omits item_index_2_category_index from its returned values description.
    • Docstring Accuracy for bipartite_graph_preprocess_dataset: The bipartite_graph_preprocess_dataset docstring needs updating to correctly list the returned DataFrames (train, val, test instead of a single bipartite_df) and all four index dictionaries.

    Merge Readiness

    The pull request is well-structured and the introduced changes significantly improve the codebase. The new functionalities are valuable and the refactoring enhances modularity. The unit tests are comprehensive.

    I recommend addressing the minor documentation issues highlighted in the review comments to ensure clarity and accuracy. Once these are resolved, the PR appears to be in excellent shape for merging. As I am an AI assistant, I am not authorized to approve pull requests; this decision rests with the repository maintainers.

    Comment on lines 117 to 122
    A tuple containing:
    - user2index: Mapping from user ID string to integer index.
    - item2index: Mapping from item ID (parent_asin) string to integer index.
    - category2index: Mapping from category string to integer index.
    - item_index_2_category_index: Mapping from item integer index to category integer index.
    - processed_train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering) from which indices were derived.
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    medium

    The docstring for build_feature_indices in the Returns section lists - processed_train_df. However, this DataFrame is not actually returned by the function, and the function's return type hint tuple[dict[str, int], dict[str, int], dict[str, int], dict[int, int]] confirms this.

    Could you please update the docstring to accurately reflect the returned values?

    """preprocess the dataset

    Args:
    dataset: dataset from the datasets library(transformers)
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    medium

    In the docstring for seq_rec_preprocess_dataset, the Args section lists dataset: dataset from the datasets library(transformers). The actual parameter name is dataset_dict.

    Could you update the docstring to use dataset_dict for consistency with the function signature?

    Comment on lines 363 to 368
    train_df: train dataset
    val_df: validation dataset
    test_df: test dataset
    user2index: user to index dictionary
    item2index: item to index dictionary
    category2index: category to index dictionary
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    medium

    The Returns section of the docstring for seq_rec_preprocess_dataset lists category2index but omits item_index_2_category_index, which is part of the returned tuple as indicated by the type hint dict[int, int].

    To ensure the documentation is complete, could you add item_index_2_category_index to the list of returned dictionaries in the docstring?

    Comment on lines 824 to 827
    Returns:
    bipartite_df: bipartite graph dataframe
    user2index: user to index dictionary
    item2index: item to index dictionary
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    medium

    The docstring for bipartite_graph_preprocess_dataset has a couple of discrepancies:

    1. It states bipartite_df: bipartite graph dataframe as a return value, but the function actually returns three DataFrames: train_df, val_df, test_df.
    2. It lists user2index and item2index as returned dictionaries, but it also returns category2index and item_index_2_category_index.

    Could you update the Returns section to accurately describe all returned components, similar to how seq_rec_preprocess_dataset documents its returns?

    haru-256 added 6 commits June 14, 2025 20:36
    - Implemented `bipartite_graph.py` for preprocessing Amazon Reviews dataset into a bipartite graph format.
    - Created `common.py` for shared functions including dataset fetching and index building.
    - Developed `seq_rec.py` for sequential recommendation processing, including data preparation and dataset classes.
    - Introduced `__init__.py` to expose key functionalities from the Amazon Reviews dataset module.
    …e Bipartite Graph Processing
    
    - Added SpecialUserIndex, SpecialItemIndex, and SpecialCategoryIndex to replace the previous SpecialIndex.
    - Updated the bipartite_graph.py to utilize new special indices and introduced create_bipartite_graph function for graph creation.
    - Enhanced AmazonReviewsBipartiteGraphDataModule to include data loading and transformation logic.
    - Modified common.py to reflect changes in special indices and updated build_feature_indices function accordingly.
    - Updated seq_rec.py to use new special indices in preprocessing.
    - Refactored tests in test_amazon_reviews_dataset.py to accommodate changes in special indices and ensure correct functionality.
    @haru-256 haru-256 requested a review from Copilot July 19, 2025 11:00
    Copy link
    Contributor

    @Copilot Copilot AI left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Pull Request Overview

    This pull request enhances the Amazon Reviews dataset preprocessing functionality by introducing modularity and new capabilities for bipartite graph analysis. The changes improve code reusability through extracted common preprocessing logic and add robust handling of edge cases.

    Key changes include:

    • Refactored preprocessing logic into reusable components with _common_preprocess_dataset and build_feature_indices functions
    • Added new bipartite_graph_preprocess_dataset function for graph-based analysis workflows
    • Enhanced robustness with better handling of missing categories and null values

    Reviewed Changes

    Copilot reviewed 9 out of 11 changed files in this pull request and generated 4 comments.

    Show a summary per file
    File Description
    amazon_reviews_dataset.py Major refactoring to extract common preprocessing logic and add bipartite graph preprocessing functionality
    test_amazon_reviews_dataset.py Comprehensive test coverage for new and refactored preprocessing functions
    seq_rec.py Refactored to use common preprocessing utilities and updated special index references
    common.py New shared utilities including feature index building and common preprocessing functions
    bipartite_graph.py New module for bipartite graph preprocessing and data handling

    pl.min("timestamp").alias("timestamp"),
    pl.len().alias("num_ratings"),
    )
    if (main_df["num_ratings"] != 1).any():
    Copy link
    Preview

    Copilot AI Jul 19, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    This check evaluates the entire column for every batch. Consider moving this validation outside the _preprocess function or making it conditional (e.g., only in debug mode) to improve performance.

    Copilot uses AI. Check for mistakes.

    - item2index: Mapping from item ID (parent_asin) string to integer index.
    - category2index: Mapping from category string to integer index.
    - item_index_2_category_index: Mapping from item integer index to category integer index.
    - processed_train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering) from which indices were derived.
    Copy link
    Preview

    Copilot AI Jul 19, 2025

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The docstring mentions 'processed_train_df' in the returns section, but this function does not return this value. Remove this line from the docstring.

    Suggested change
    - processed_train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering) from which indices were derived.
    # Removed processed_train_df as it is not returned by the function.

    Copilot uses AI. Check for mistakes.

    @haru-256
    Copy link
    Owner Author

    /gemini review

    Copy link
    Contributor

    @gemini-code-assist gemini-code-assist bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Code Review

    This pull request introduces enhancements to the dataset preprocessing logic, improving modularity and maintainability with a common preprocessing utility and a dedicated function for building feature indices. It adds support for bipartite graph construction and includes comprehensive tests. I've identified some critical bugs in the new AmazonReviewsBipartiteGraphDataModule where dataloaders were using incorrect data and hardcoded parameters, and I've provided suggestions to improve code clarity and efficiency.

    Comment on lines +67 to +69
    # indexファイルが存在しないため、以下のlinkから明示的にURLを指定
    # https://data.pyg.org/whl/torch-2.7.0+cu126.html
    # https://data.pyg.org/whl/torch-2.7.0+cpu.html
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    medium

    The comment explaining the need for hardcoded URLs is in Japanese. For the benefit of a wider audience of contributors, it would be helpful to have this comment in English.

    Additionally, relying on hardcoded URLs can be brittle. If these URLs change, the build will break. While this might be the only option for now, it's a maintainability risk to be aware of.

    # The following packages do not have a proper index file, so we specify the URLs explicitly.
    # See: https://data.pyg.org/whl/torch-2.7.0+cu126.html
    # See: https://data.pyg.org/whl/torch-2.7.0+cpu.html
    

    self.filter_no_history = filter_no_history
    self.transform = T.Compose([T.RemoveIsolatedNodes(), T.RemoveSelfLoops()])

    def prepare_data(self) -> None:
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    medium

    The prepare_data method in AmazonReviewsBipartiteGraphDataModule re-processes the data every time it's called. This is inefficient for development and repeated runs.

    Consider adding a caching mechanism to save and load the preprocessed all_df and index dictionaries to disk, similar to the implementation in AmazonReviewsSeqRecDataModule.

    Comment on lines +269 to +276
    def common_preprocess_dataset(
    dataset_dict: D.DatasetDict, metadata: D.Dataset, filter_no_history: bool = True
    ) -> tuple[
    tuple[pl.DataFrame, pl.DataFrame, pl.DataFrame],
    pl.DataFrame,
    tuple[dict[str, int], dict[str, int], dict[str, int], dict[int, int]],
    tuple[pl.DataFrame, pl.DataFrame, pl.DataFrame],
    ]:
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    medium

    The function common_preprocess_dataset returns a complex nested tuple. This makes the code harder to read and maintain, as consumers of this function need to unpack it using numeric indices.

    To improve readability and maintainability, consider using a NamedTuple or a dataclass for the return value. This would allow accessing fields by name, making the code more self-documenting.

    haru-256 and others added 5 commits July 19, 2025 20:06
    …ataset/bipartite_graph.py
    
    Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
    …ataset/bipartite_graph.py
    
    Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
    …ataset/bipartite_graph.py
    
    Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
    @haru-256 haru-256 merged commit 265c0ff into main Jul 19, 2025
    2 checks passed
    @haru-256 haru-256 deleted the feat/bipartite-graph-preprocess-dataset branch July 19, 2025 11:30
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant