Enhance Dataset Preprocessing with Feature Indices and Bipartite Graph Support #39

haru-256 · 2025-06-07T15:15:23Z

User description

This pull request refactors and enhances preprocessing functions for the Amazon Reviews dataset, introducing modularity and new functionality. Key changes include splitting preprocessing logic into reusable components, adding a new preprocessing function for bipartite graph construction, and improving the handling of edge cases like missing categories. These updates aim to make the codebase more maintainable and extendable.

Refactoring and Modularity Improvements:

Introduced _common_preprocess_dataset to encapsulate shared preprocessing logic, such as schema overrides, metadata joining, and history filtering. This modular function is reused across other preprocessing functions.
Refactored seq_rec_preprocess_dataset to delegate common preprocessing steps to _common_preprocess_dataset, improving code reuse and clarity.

New Functionality:

Added build_feature_indices, a dedicated function for generating user, item, and category indices, along with mappings between item indices and category indices. This separates index-building logic from dataset preprocessing for better maintainability.
Created bipartite_graph_preprocess_dataset, a new function for preparing the dataset specifically for bipartite graph analysis. It aggregates user-item interactions and computes additional features like the number of ratings.

Robustness Enhancements:

Improved handling of missing or null categories during category indexing, including logging warnings and creating minimal indices when necessary. This ensures the preprocessing pipeline is resilient to edge cases.

PR Type

Enhancement

Description

Extracts feature index building into new function.
Introduces common dataset preprocessing utility.
Refactors sequential recommendation preprocessing.
Adds new bipartite graph preprocessing function.

Changes walkthrough 📝

Relevant files

Enhancement

amazon_reviews_dataset.py `Refactor and add new dataset preprocessing functions` libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py Extracted feature index generation into a new `build_feature_indices` function, handling user, item, and category indices, including special UNK/PAD tokens and robust null category handling. Introduced `_common_preprocess_dataset` to encapsulate shared data loading, schema overrides, metadata joining, and history filtering logic. Refactored `seq_rec_preprocess_dataset` to utilize `_common_preprocess_dataset` for initial processing, improving modularity. Added a new `bipartite_graph_preprocess_dataset` function for bipartite graph analysis, which aggregates user-item interactions and calculates `num_ratings`.	+300/-100

Tests

test_amazon_reviews_dataset.py `Add comprehensive tests for new and refactored preprocessing functions` libs/ml_sandbox_libs/tests/test_data/test_amazon_reviews_dataset.py Added a new test `test_build_feature_indices` to validate the functionality of the extracted `build_feature_indices` function. Implemented `test_seq_rec_preprocess_dataset` to verify the refactored sequential recommendation preprocessing, including history processing. Introduced `test_bipartite_graph_preprocess_dataset` to ensure the correct aggregation and output of the new bipartite graph preprocessing. Updated imports to include the newly added functions for testing.	+470/-1

Need help?
Type /help how to ... in the comments thread for any questions about PR-Agent usage.
Check out the documentation for more information.

…partite graph support

gemini-code-assist

Hello @haru-256, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request. This PR focuses on enhancing the dataset preprocessing pipeline for the Amazon Reviews dataset. The primary goals are to improve modularity by extracting common logic, introduce a dedicated function for building feature indices (users, items, categories), and add support for preparing the dataset specifically for bipartite graph analysis. This refactoring aims to make the preprocessing code more maintainable and adaptable for different downstream tasks.

Highlights

Modularity Refactoring: Common preprocessing steps, such as converting to Polars, joining metadata, and filtering history, have been extracted into a new internal helper function _common_preprocess_dataset to reduce code duplication.
Dedicated Index Building: A new public function build_feature_indices has been created to specifically handle the generation of user, item, and category index mappings, including the item-to-category index mapping. This separates the index creation logic from the main dataset preprocessing flow.
Bipartite Graph Preprocessing: A new public function bipartite_graph_preprocess_dataset has been added. This function leverages the common preprocessing steps and then aggregates the data by user-item pairs, computing features like the number of ratings per pair, suitable for bipartite graph representations.
Improved Robustness: The index building logic now includes more explicit handling and logging for cases where items might be missing category information, ensuring the category index mapping is robust.

Changelog

Click here to see the changelog

libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py
- Extracted user, item, and category index building logic into a new function build_feature_indices (lines 99-261).
- Introduced an internal helper function _common_preprocess_dataset to encapsulate shared data loading, metadata joining, and history filtering steps (lines 264-341).
- Refactored seq_rec_preprocess_dataset to call _common_preprocess_dataset for initial processing (lines 344-382).
- Added a new function bipartite_graph_preprocess_dataset for preparing data for bipartite graphs, including aggregation by user-item pairs and counting interactions (lines 807-879).
- Enhanced category index building in build_feature_indices to handle null categories and ensure special indices are mapped correctly (lines 183-195, 242-254).
libs/ml_sandbox_libs/tests/test_data/test_amazon_reviews_dataset.py
- Imported new and refactored functions (build_feature_indices, seq_rec_preprocess_dataset, bipartite_graph_preprocess_dataset) and SpecialIndex (lines 4-10).
- Added a new test test_build_feature_indices to verify the functionality of the new index building function (lines 21-82).
- Added a new test test_seq_rec_preprocess_dataset to verify the refactored sequential recommendation preprocessing function, including its use of the common helper (lines 84-249).
- Added a new test test_bipartite_graph_preprocess_dataset to verify the new bipartite graph preprocessing function, including its aggregation logic and use of the common helper with filter_no_history=False (lines 252-480).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2025-06-07T15:16:31Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
🔀 No multiple PR themes
⚡ Recommended focus areas for review Category Indexing The new logic in `build_feature_indices` for handling cases where no non-null categories are found in the training data seems robust, creating a minimal index and logging a warning. However, it's important to confirm that this behavior is acceptable for all downstream uses, especially if category information is critical for the models. train_df_for_category_count = train_df.filter(pl.col("category").is_not_null()) if train_df_for_category_count.is_empty() and not train_df.is_empty(): logger.warning( "No non-null categories found in train_df for category indexing. Category index will be minimal." ) # Create a minimal category2index if no categories are found to prevent errors downstream category2index: dict[str, int] = {"#UNK": SpecialIndex.UNK, "#PAD": SpecialIndex.PAD} train_categories = pl.DataFrame({"category": []}) # Empty DataFrame elif train_df_for_category_count.is_empty() and train_df.is_empty(): logger.warning("train_df is empty. Category index will be minimal.") category2index = {"#UNK": SpecialIndex.UNK, "#PAD": SpecialIndex.PAD} train_categories = pl.DataFrame({"category": []}) # Empty DataFrame Aggregation Strategy In `bipartite_graph_preprocess_dataset`, when grouping by `user_id` and `parent_asin`, the aggregation uses `pl.max("rating")` and `pl.min("timestamp")`. This means for multiple interactions between the same user and item, the highest rating and earliest timestamp are kept. Please confirm if this specific aggregation strategy (max rating, min timestamp) is the intended and most appropriate choice for the bipartite graph analysis, as other aggregations (e.g., average rating, latest timestamp) might also be valid depending on the model's requirements. df = df.group_by("user_id", "parent_asin").agg( pl.max("user_index").alias("user_index"), pl.max("item_index").alias("item_index"), pl.max("category").alias("category"), pl.max("category_index").alias("category_index"), pl.max("rating").alias("rating"), pl.min("timestamp").alias("timestamp"), pl.len().alias("num_ratings"), )

github-actions · 2025-06-07T15:17:18Z

libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py

+    """Preprocess the dataset for bipartite graph
+
+    Args:
+        dataset_dict: dataset from the datasets library(transformers)
+        metadata: metadata dataset from the datasets library(transformers)
+
+    Returns:
+        bipartite_df: bipartite graph dataframe
+        user2index: user to index dictionary
+        item2index: item to index dictionary
+    """


Suggestion: The docstring for bipartite_graph_preprocess_dataset does not match the actual return values. It currently lists only bipartite_df, user2index, and item2index, but the function returns train_df, val_df, test_df, user2index, item2index, category2index, and item_index_2_category_index. Update the docstring to accurately reflect all returned values for clarity and correctness. [possible issue, importance: 7]

Suggested change

"""Preprocess the dataset for bipartite graph

Args:

dataset_dict: dataset from the datasets library(transformers)

metadata: metadata dataset from the datasets library(transformers)

Returns:

bipartite_df: bipartite graph dataframe

user2index: user to index dictionary

item2index: item to index dictionary

"""

"""Preprocess the dataset for bipartite graph

Args:

dataset_dict: dataset from the datasets library(transformers)

metadata: metadata dataset from the datasets library(transformers)

Returns:

train_df: train dataset for bipartite graph

val_df: validation dataset for bipartite graph

test_df: test dataset for bipartite graph

user2index: user to index dictionary

item2index: item to index dictionary

category2index: category to index dictionary

item_index_2_category_index: item index to category index dictionary

"""

github-actions · 2025-06-07T15:17:18Z

libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py

+    """Builds indices for users, items, and categories from the provided datasets.

    Args:
-        dataset: dataset from the datasets library(transformers)
-        metadata: metadata dataset from the datasets library(transformers)
-        filter_no_history: whether to filter out the dataset which has no history. Defaults to True.
+        train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering).
+        meta_df: The Polars DataFrame containing metadata for items, including categories.
+        threshold: The threshold for `unk_filter_by_count` to determine frequent users/items/categories. Defaults to 0.95.

    Returns:
-        train_df: train dataset
-        val_df: validation dataset
-        test_df: test dataset
-        user2index: user to index dictionary
-        item2index: item to index dictionary
-        category2index: category to index dictionary
+        A tuple containing:
+            - user2index: Mapping from user ID string to integer index.
+            - item2index: Mapping from item ID (parent_asin) string to integer index.
+            - category2index: Mapping from category string to integer index.
+            - item_index_2_category_index: Mapping from item integer index to category integer index.
+            - processed_train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering) from which indices were derived.
    """


Suggestion: The docstring for build_feature_indices incorrectly lists processed_train_df as a return value. This DataFrame is not returned by the function. Remove this entry from the docstring to align with the actual function signature and prevent confusion. [general, importance: 7]

Suggested change

"""Builds indices for users, items, and categories from the provided datasets.

Args:

dataset: dataset from the datasets library(transformers)

metadata: metadata dataset from the datasets library(transformers)

filter_no_history: whether to filter out the dataset which has no history. Defaults to True.

train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering).

meta_df: The Polars DataFrame containing metadata for items, including categories.

threshold: The threshold for `unk_filter_by_count` to determine frequent users/items/categories. Defaults to 0.95.

Returns:

train_df: train dataset

val_df: validation dataset

test_df: test dataset

user2index: user to index dictionary

item2index: item to index dictionary

category2index: category to index dictionary

A tuple containing:

- user2index: Mapping from user ID string to integer index.

- item2index: Mapping from item ID (parent_asin) string to integer index.

- category2index: Mapping from category string to integer index.

- item_index_2_category_index: Mapping from item integer index to category integer index.

- processed_train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering) from which indices were derived.

"""

"""Builds indices for users, items, and categories from the provided datasets.

Args:

train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering).

meta_df: The Polars DataFrame containing metadata for items, including categories.

threshold: The threshold for `unk_filter_by_count` to determine frequent users/items/categories. Defaults to 0.95.

Returns:

A tuple containing:

- user2index: Mapping from user ID string to integer index.

- item2index: Mapping from item ID (parent_asin) string to integer index.

- category2index: Mapping from category string to integer index.

- item_index_2_category_index: Mapping from item integer index to category integer index.

"""

gemini-code-assist

Code Review

This pull request introduces excellent refactoring and new features for dataset preprocessing. The modularity is greatly improved with _common_preprocess_dataset and build_feature_indices. The new bipartite_graph_preprocess_dataset is a valuable addition. Error handling for categories in build_feature_indices is also a good enhancement. The accompanying tests are thorough and well-written.

Overall, the changes are robust and significantly improve the codebase. My feedback primarily focuses on ensuring documentation (docstrings) accurately reflects the function signatures and return values.

Summary of Findings

Docstring Accuracy for build_feature_indices: The build_feature_indices function's docstring incorrectly lists processed_train_df as a return value.
Docstring Accuracy for seq_rec_preprocess_dataset: The seq_rec_preprocess_dataset docstring has a mismatch in an argument name (dataset vs dataset_dict) and omits item_index_2_category_index from its returned values description.
Docstring Accuracy for bipartite_graph_preprocess_dataset: The bipartite_graph_preprocess_dataset docstring needs updating to correctly list the returned DataFrames (train, val, test instead of a single bipartite_df) and all four index dictionaries.

Merge Readiness

The pull request is well-structured and the introduced changes significantly improve the codebase. The new functionalities are valuable and the refactoring enhances modularity. The unit tests are comprehensive.

I recommend addressing the minor documentation issues highlighted in the review comments to ensure clarity and accuracy. Once these are resolved, the PR appears to be in excellent shape for merging. As I am an AI assistant, I am not authorized to approve pull requests; this decision rests with the repository maintainers.

gemini-code-assist · 2025-06-07T15:17:24Z

libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py

+        A tuple containing:
+            - user2index: Mapping from user ID string to integer index.
+            - item2index: Mapping from item ID (parent_asin) string to integer index.
+            - category2index: Mapping from category string to integer index.
+            - item_index_2_category_index: Mapping from item integer index to category integer index.
+            - processed_train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering) from which indices were derived.


The docstring for build_feature_indices in the Returns section lists - processed_train_df. However, this DataFrame is not actually returned by the function, and the function's return type hint tuple[dict[str, int], dict[str, int], dict[str, int], dict[int, int]] confirms this.

Could you please update the docstring to accurately reflect the returned values?

gemini-code-assist · 2025-06-07T15:17:24Z

libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py

+    """preprocess the dataset
+
+    Args:
+        dataset: dataset from the datasets library(transformers)


In the docstring for seq_rec_preprocess_dataset, the Args section lists dataset: dataset from the datasets library(transformers). The actual parameter name is dataset_dict.

Could you update the docstring to use dataset_dict for consistency with the function signature?

gemini-code-assist · 2025-06-07T15:17:24Z

libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py

+        train_df: train dataset
+        val_df: validation dataset
+        test_df: test dataset
+        user2index: user to index dictionary
+        item2index: item to index dictionary
+        category2index: category to index dictionary


The Returns section of the docstring for seq_rec_preprocess_dataset lists category2index but omits item_index_2_category_index, which is part of the returned tuple as indicated by the type hint dict[int, int].

To ensure the documentation is complete, could you add item_index_2_category_index to the list of returned dictionaries in the docstring?

gemini-code-assist · 2025-06-07T15:17:24Z

libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py

+    Returns:
+        bipartite_df: bipartite graph dataframe
+        user2index: user to index dictionary
+        item2index: item to index dictionary


The docstring for bipartite_graph_preprocess_dataset has a couple of discrepancies:

It states bipartite_df: bipartite graph dataframe as a return value, but the function actually returns three DataFrames: train_df, val_df, test_df.

It lists user2index and item2index as returned dictionaries, but it also returns category2index and item_index_2_category_index.

Could you update the Returns section to accurately describe all returned components, similar to how seq_rec_preprocess_dataset documents its returns?

- Implemented `bipartite_graph.py` for preprocessing Amazon Reviews dataset into a bipartite graph format. - Created `common.py` for shared functions including dataset fetching and index building. - Developed `seq_rec.py` for sequential recommendation processing, including data preparation and dataset classes. - Introduced `__init__.py` to expose key functionalities from the Amazon Reviews dataset module.

…e Bipartite Graph Processing - Added SpecialUserIndex, SpecialItemIndex, and SpecialCategoryIndex to replace the previous SpecialIndex. - Updated the bipartite_graph.py to utilize new special indices and introduced create_bipartite_graph function for graph creation. - Enhanced AmazonReviewsBipartiteGraphDataModule to include data loading and transformation logic. - Modified common.py to reflect changes in special indices and updated build_feature_indices function accordingly. - Updated seq_rec.py to use new special indices in preprocessing. - Refactored tests in test_amazon_reviews_dataset.py to accommodate changes in special indices and ensure correct functionality.

Copilot

Pull Request Overview

This pull request enhances the Amazon Reviews dataset preprocessing functionality by introducing modularity and new capabilities for bipartite graph analysis. The changes improve code reusability through extracted common preprocessing logic and add robust handling of edge cases.

Key changes include:

Refactored preprocessing logic into reusable components with _common_preprocess_dataset and build_feature_indices functions
Added new bipartite_graph_preprocess_dataset function for graph-based analysis workflows
Enhanced robustness with better handling of missing categories and null values

Reviewed Changes

Copilot reviewed 9 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`amazon_reviews_dataset.py`	Major refactoring to extract common preprocessing logic and add bipartite graph preprocessing functionality
`test_amazon_reviews_dataset.py`	Comprehensive test coverage for new and refactored preprocessing functions
`seq_rec.py`	Refactored to use common preprocessing utilities and updated special index references
`common.py`	New shared utilities including feature index building and common preprocessing functions
`bipartite_graph.py`	New module for bipartite graph preprocessing and data handling

libs/ml_sandbox_libs/tests/test_data/test_amazon_reviews_dataset.py

Copilot · 2025-07-19T11:01:00Z

libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset/bipartite_graph.py

+            pl.min("timestamp").alias("timestamp"),
+            pl.len().alias("num_ratings"),
+        )
+        if (main_df["num_ratings"] != 1).any():


This check evaluates the entire column for every batch. Consider moving this validation outside the _preprocess function or making it conditional (e.g., only in debug mode) to improve performance.

Copilot · 2025-07-19T11:01:00Z

libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset/common.py

+            - item2index: Mapping from item ID (parent_asin) string to integer index.
+            - category2index: Mapping from category string to integer index.
+            - item_index_2_category_index: Mapping from item integer index to category integer index.
+            - processed_train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering) from which indices were derived.


The docstring mentions 'processed_train_df' in the returns section, but this function does not return this value. Remove this line from the docstring.

Suggested change

- processed_train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering) from which indices were derived.

# Removed processed_train_df as it is not returned by the function.

haru-256 · 2025-07-19T11:01:05Z

/gemini review

…et.py Co-authored-by: Copilot <[email protected]>

gemini-code-assist

Code Review

This pull request introduces enhancements to the dataset preprocessing logic, improving modularity and maintainability with a common preprocessing utility and a dedicated function for building feature indices. It adds support for bipartite graph construction and includes comprehensive tests. I've identified some critical bugs in the new AmazonReviewsBipartiteGraphDataModule where dataloaders were using incorrect data and hardcoded parameters, and I've provided suggestions to improve code clarity and efficiency.

libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset/bipartite_graph.py

gemini-code-assist · 2025-07-19T11:02:31Z

libs/ml_sandbox_libs/pyproject.toml

+# indexファイルが存在しないため、以下のlinkから明示的にURLを指定
+# https://data.pyg.org/whl/torch-2.7.0+cu126.html
+# https://data.pyg.org/whl/torch-2.7.0+cpu.html


The comment explaining the need for hardcoded URLs is in Japanese. For the benefit of a wider audience of contributors, it would be helpful to have this comment in English.

Additionally, relying on hardcoded URLs can be brittle. If these URLs change, the build will break. While this might be the only option for now, it's a maintainability risk to be aware of.

# The following packages do not have a proper index file, so we specify the URLs explicitly. # See: https://data.pyg.org/whl/torch-2.7.0+cu126.html # See: https://data.pyg.org/whl/torch-2.7.0+cpu.html

gemini-code-assist · 2025-07-19T11:02:31Z

libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset/bipartite_graph.py

+        self.filter_no_history = filter_no_history
+        self.transform = T.Compose([T.RemoveIsolatedNodes(), T.RemoveSelfLoops()])
+
+    def prepare_data(self) -> None:


The prepare_data method in AmazonReviewsBipartiteGraphDataModule re-processes the data every time it's called. This is inefficient for development and repeated runs.

Consider adding a caching mechanism to save and load the preprocessed all_df and index dictionaries to disk, similar to the implementation in AmazonReviewsSeqRecDataModule.

gemini-code-assist · 2025-07-19T11:02:31Z

libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset/common.py

+def common_preprocess_dataset(
+    dataset_dict: D.DatasetDict, metadata: D.Dataset, filter_no_history: bool = True
+) -> tuple[
+    tuple[pl.DataFrame, pl.DataFrame, pl.DataFrame],
+    pl.DataFrame,
+    tuple[dict[str, int], dict[str, int], dict[str, int], dict[int, int]],
+    tuple[pl.DataFrame, pl.DataFrame, pl.DataFrame],
+]:


The function common_preprocess_dataset returns a complex nested tuple. This makes the code harder to read and maintain, as consumers of this function need to unpack it using numeric indices.

To improve readability and maintainability, consider using a NamedTuple or a dataclass for the return value. This would allow accessing fields by name, making the code more self-documenting.

…and maintainability

…ataset/bipartite_graph.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…s for clarity

feat: enhance dataset preprocessing with build_feature_indices and bi…

2ef4a7c

…partite graph support

haru-256 requested a review from Copilot June 7, 2025 15:15

gemini-code-assist bot reviewed Jun 7, 2025

View reviewed changes

github-actions bot changed the title ~~feat: enhance dataset preprocessing with build_feature_indices and bipartite graph support~~ Enhance Dataset Preprocessing with Feature Indices and Bipartite Graph Support Jun 7, 2025

github-actions bot added the Review effort 3/5 label Jun 7, 2025

This comment was marked as outdated.

Sign in to view

github-actions bot reviewed Jun 7, 2025

View reviewed changes

gemini-code-assist bot suggested changes Jun 7, 2025

View reviewed changes

haru-256 added 6 commits June 14, 2025 20:36

Add README file for recsys-candidate-generation module

dcfce0c

Refactor code structure for improved readability and maintainability

0012b30

change

cc0bd38

Merge branch 'main' into feat/bipartite-graph-preprocess-dataset

94a1943

haru-256 requested a review from Copilot July 19, 2025 11:00

Copilot AI reviewed Jul 19, 2025

View reviewed changes

haru-256 and others added 2 commits July 19, 2025 20:01

Update libs/ml_sandbox_libs/tests/test_data/test_amazon_reviews_datas…

da8af9d

…et.py Co-authored-by: Copilot <[email protected]>

Update libs/ml_sandbox_libs/tests/test_data/test_amazon_reviews_datas…

6218a11

…et.py Co-authored-by: Copilot <[email protected]>

gemini-code-assist bot reviewed Jul 19, 2025

View reviewed changes

haru-256 and others added 5 commits July 19, 2025 20:06

Refactor bipartite graph preprocessing function for improved clarity …

ba18b97

…and maintainability

Update libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_d…

8052aaa

…ataset/bipartite_graph.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_d…

8ea7f81

…ataset/bipartite_graph.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_d…

29952a4

…ataset/bipartite_graph.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Refactor Amazon Reviews DataModule parameters and update metrics test…

41d0032

…s for clarity

haru-256 merged commit 265c0ff into main Jul 19, 2025
2 checks passed

haru-256 deleted the feat/bipartite-graph-preprocess-dataset branch July 19, 2025 11:30

	- processed_train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering) from which indices were derived.
	# Removed processed_train_df as it is not returned by the function.

Enhance Dataset Preprocessing with Feature Indices and Bipartite Graph Support #39

Enhance Dataset Preprocessing with Feature Indices and Bipartite Graph Support #39

Uh oh!

Conversation

haru-256 commented Jun 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Refactoring and Modularity Improvements:

New Functionality:

Robustness Enhancements:

PR Type

Description

Changes walkthrough 📝

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

github-actions bot commented Jun 7, 2025

PR Reviewer Guide 🔍

Uh oh!

This comment was marked as outdated.

Uh oh!

github-actions bot Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

haru-256 commented Jul 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 19, 2025

Choose a reason for hiding this comment

haru-256 commented Jun 7, 2025 •

edited by github-actions bot

Loading