-
Notifications
You must be signed in to change notification settings - Fork 0
Enhance Dataset Preprocessing with Feature Indices and Bipartite Graph Support #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…partite graph support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @haru-256, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request. This PR focuses on enhancing the dataset preprocessing pipeline for the Amazon Reviews dataset. The primary goals are to improve modularity by extracting common logic, introduce a dedicated function for building feature indices (users, items, categories), and add support for preparing the dataset specifically for bipartite graph analysis. This refactoring aims to make the preprocessing code more maintainable and adaptable for different downstream tasks.
Highlights
- Modularity Refactoring: Common preprocessing steps, such as converting to Polars, joining metadata, and filtering history, have been extracted into a new internal helper function
_common_preprocess_dataset
to reduce code duplication. - Dedicated Index Building: A new public function
build_feature_indices
has been created to specifically handle the generation of user, item, and category index mappings, including the item-to-category index mapping. This separates the index creation logic from the main dataset preprocessing flow. - Bipartite Graph Preprocessing: A new public function
bipartite_graph_preprocess_dataset
has been added. This function leverages the common preprocessing steps and then aggregates the data by user-item pairs, computing features like the number of ratings per pair, suitable for bipartite graph representations. - Improved Robustness: The index building logic now includes more explicit handling and logging for cases where items might be missing category information, ensuring the category index mapping is robust.
Changelog
Click here to see the changelog
- libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py
- Extracted user, item, and category index building logic into a new function
build_feature_indices
(lines 99-261). - Introduced an internal helper function
_common_preprocess_dataset
to encapsulate shared data loading, metadata joining, and history filtering steps (lines 264-341). - Refactored
seq_rec_preprocess_dataset
to call_common_preprocess_dataset
for initial processing (lines 344-382). - Added a new function
bipartite_graph_preprocess_dataset
for preparing data for bipartite graphs, including aggregation by user-item pairs and counting interactions (lines 807-879). - Enhanced category index building in
build_feature_indices
to handle null categories and ensure special indices are mapped correctly (lines 183-195, 242-254).
- Extracted user, item, and category index building logic into a new function
- libs/ml_sandbox_libs/tests/test_data/test_amazon_reviews_dataset.py
- Imported new and refactored functions (
build_feature_indices
,seq_rec_preprocess_dataset
,bipartite_graph_preprocess_dataset
) andSpecialIndex
(lines 4-10). - Added a new test
test_build_feature_indices
to verify the functionality of the new index building function (lines 21-82). - Added a new test
test_seq_rec_preprocess_dataset
to verify the refactored sequential recommendation preprocessing function, including its use of the common helper (lines 84-249). - Added a new test
test_bipartite_graph_preprocess_dataset
to verify the new bipartite graph preprocessing function, including its aggregation logic and use of the common helper withfilter_no_history=False
(lines 252-480).
- Imported new and refactored functions (
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
"""Preprocess the dataset for bipartite graph | ||
|
||
Args: | ||
dataset_dict: dataset from the datasets library(transformers) | ||
metadata: metadata dataset from the datasets library(transformers) | ||
|
||
Returns: | ||
bipartite_df: bipartite graph dataframe | ||
user2index: user to index dictionary | ||
item2index: item to index dictionary | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: The docstring for bipartite_graph_preprocess_dataset
does not match the actual return values. It currently lists only bipartite_df
, user2index
, and item2index
, but the function returns train_df
, val_df
, test_df
, user2index
, item2index
, category2index
, and item_index_2_category_index
. Update the docstring to accurately reflect all returned values for clarity and correctness. [possible issue, importance: 7]
"""Preprocess the dataset for bipartite graph | |
Args: | |
dataset_dict: dataset from the datasets library(transformers) | |
metadata: metadata dataset from the datasets library(transformers) | |
Returns: | |
bipartite_df: bipartite graph dataframe | |
user2index: user to index dictionary | |
item2index: item to index dictionary | |
""" | |
"""Preprocess the dataset for bipartite graph | |
Args: | |
dataset_dict: dataset from the datasets library(transformers) | |
metadata: metadata dataset from the datasets library(transformers) | |
Returns: | |
train_df: train dataset for bipartite graph | |
val_df: validation dataset for bipartite graph | |
test_df: test dataset for bipartite graph | |
user2index: user to index dictionary | |
item2index: item to index dictionary | |
category2index: category to index dictionary | |
item_index_2_category_index: item index to category index dictionary | |
""" |
"""Builds indices for users, items, and categories from the provided datasets. | ||
|
||
Args: | ||
dataset: dataset from the datasets library(transformers) | ||
metadata: metadata dataset from the datasets library(transformers) | ||
filter_no_history: whether to filter out the dataset which has no history. Defaults to True. | ||
train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering). | ||
meta_df: The Polars DataFrame containing metadata for items, including categories. | ||
threshold: The threshold for `unk_filter_by_count` to determine frequent users/items/categories. Defaults to 0.95. | ||
|
||
Returns: | ||
train_df: train dataset | ||
val_df: validation dataset | ||
test_df: test dataset | ||
user2index: user to index dictionary | ||
item2index: item to index dictionary | ||
category2index: category to index dictionary | ||
A tuple containing: | ||
- user2index: Mapping from user ID string to integer index. | ||
- item2index: Mapping from item ID (parent_asin) string to integer index. | ||
- category2index: Mapping from category string to integer index. | ||
- item_index_2_category_index: Mapping from item integer index to category integer index. | ||
- processed_train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering) from which indices were derived. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: The docstring for build_feature_indices
incorrectly lists processed_train_df
as a return value. This DataFrame is not returned by the function. Remove this entry from the docstring to align with the actual function signature and prevent confusion. [general, importance: 7]
"""Builds indices for users, items, and categories from the provided datasets. | |
Args: | |
dataset: dataset from the datasets library(transformers) | |
metadata: metadata dataset from the datasets library(transformers) | |
filter_no_history: whether to filter out the dataset which has no history. Defaults to True. | |
train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering). | |
meta_df: The Polars DataFrame containing metadata for items, including categories. | |
threshold: The threshold for `unk_filter_by_count` to determine frequent users/items/categories. Defaults to 0.95. | |
Returns: | |
train_df: train dataset | |
val_df: validation dataset | |
test_df: test dataset | |
user2index: user to index dictionary | |
item2index: item to index dictionary | |
category2index: category to index dictionary | |
A tuple containing: | |
- user2index: Mapping from user ID string to integer index. | |
- item2index: Mapping from item ID (parent_asin) string to integer index. | |
- category2index: Mapping from category string to integer index. | |
- item_index_2_category_index: Mapping from item integer index to category integer index. | |
- processed_train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering) from which indices were derived. | |
""" | |
"""Builds indices for users, items, and categories from the provided datasets. | |
Args: | |
train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering). | |
meta_df: The Polars DataFrame containing metadata for items, including categories. | |
threshold: The threshold for `unk_filter_by_count` to determine frequent users/items/categories. Defaults to 0.95. | |
Returns: | |
A tuple containing: | |
- user2index: Mapping from user ID string to integer index. | |
- item2index: Mapping from item ID (parent_asin) string to integer index. | |
- category2index: Mapping from category string to integer index. | |
- item_index_2_category_index: Mapping from item integer index to category integer index. | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces excellent refactoring and new features for dataset preprocessing. The modularity is greatly improved with _common_preprocess_dataset
and build_feature_indices
. The new bipartite_graph_preprocess_dataset
is a valuable addition. Error handling for categories in build_feature_indices
is also a good enhancement. The accompanying tests are thorough and well-written.
Overall, the changes are robust and significantly improve the codebase. My feedback primarily focuses on ensuring documentation (docstrings) accurately reflects the function signatures and return values.
Summary of Findings
- Docstring Accuracy for
build_feature_indices
: Thebuild_feature_indices
function's docstring incorrectly listsprocessed_train_df
as a return value. - Docstring Accuracy for
seq_rec_preprocess_dataset
: Theseq_rec_preprocess_dataset
docstring has a mismatch in an argument name (dataset
vsdataset_dict
) and omitsitem_index_2_category_index
from its returned values description. - Docstring Accuracy for
bipartite_graph_preprocess_dataset
: Thebipartite_graph_preprocess_dataset
docstring needs updating to correctly list the returned DataFrames (train, val, test instead of a single bipartite_df) and all four index dictionaries.
Merge Readiness
The pull request is well-structured and the introduced changes significantly improve the codebase. The new functionalities are valuable and the refactoring enhances modularity. The unit tests are comprehensive.
I recommend addressing the minor documentation issues highlighted in the review comments to ensure clarity and accuracy. Once these are resolved, the PR appears to be in excellent shape for merging. As I am an AI assistant, I am not authorized to approve pull requests; this decision rests with the repository maintainers.
A tuple containing: | ||
- user2index: Mapping from user ID string to integer index. | ||
- item2index: Mapping from item ID (parent_asin) string to integer index. | ||
- category2index: Mapping from category string to integer index. | ||
- item_index_2_category_index: Mapping from item integer index to category integer index. | ||
- processed_train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering) from which indices were derived. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring for build_feature_indices
in the Returns
section lists - processed_train_df
. However, this DataFrame is not actually returned by the function, and the function's return type hint tuple[dict[str, int], dict[str, int], dict[str, int], dict[int, int]]
confirms this.
Could you please update the docstring to accurately reflect the returned values?
"""preprocess the dataset | ||
|
||
Args: | ||
dataset: dataset from the datasets library(transformers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
train_df: train dataset | ||
val_df: validation dataset | ||
test_df: test dataset | ||
user2index: user to index dictionary | ||
item2index: item to index dictionary | ||
category2index: category to index dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Returns
section of the docstring for seq_rec_preprocess_dataset
lists category2index
but omits item_index_2_category_index
, which is part of the returned tuple as indicated by the type hint dict[int, int]
.
To ensure the documentation is complete, could you add item_index_2_category_index
to the list of returned dictionaries in the docstring?
Returns: | ||
bipartite_df: bipartite graph dataframe | ||
user2index: user to index dictionary | ||
item2index: item to index dictionary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring for bipartite_graph_preprocess_dataset
has a couple of discrepancies:
- It states
bipartite_df: bipartite graph dataframe
as a return value, but the function actually returns three DataFrames:train_df, val_df, test_df
. - It lists
user2index
anditem2index
as returned dictionaries, but it also returnscategory2index
anditem_index_2_category_index
.
Could you update the Returns
section to accurately describe all returned components, similar to how seq_rec_preprocess_dataset
documents its returns?
- Implemented `bipartite_graph.py` for preprocessing Amazon Reviews dataset into a bipartite graph format. - Created `common.py` for shared functions including dataset fetching and index building. - Developed `seq_rec.py` for sequential recommendation processing, including data preparation and dataset classes. - Introduced `__init__.py` to expose key functionalities from the Amazon Reviews dataset module.
…e Bipartite Graph Processing - Added SpecialUserIndex, SpecialItemIndex, and SpecialCategoryIndex to replace the previous SpecialIndex. - Updated the bipartite_graph.py to utilize new special indices and introduced create_bipartite_graph function for graph creation. - Enhanced AmazonReviewsBipartiteGraphDataModule to include data loading and transformation logic. - Modified common.py to reflect changes in special indices and updated build_feature_indices function accordingly. - Updated seq_rec.py to use new special indices in preprocessing. - Refactored tests in test_amazon_reviews_dataset.py to accommodate changes in special indices and ensure correct functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request enhances the Amazon Reviews dataset preprocessing functionality by introducing modularity and new capabilities for bipartite graph analysis. The changes improve code reusability through extracted common preprocessing logic and add robust handling of edge cases.
Key changes include:
- Refactored preprocessing logic into reusable components with
_common_preprocess_dataset
andbuild_feature_indices
functions - Added new
bipartite_graph_preprocess_dataset
function for graph-based analysis workflows - Enhanced robustness with better handling of missing categories and null values
Reviewed Changes
Copilot reviewed 9 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
amazon_reviews_dataset.py |
Major refactoring to extract common preprocessing logic and add bipartite graph preprocessing functionality |
test_amazon_reviews_dataset.py |
Comprehensive test coverage for new and refactored preprocessing functions |
seq_rec.py |
Refactored to use common preprocessing utilities and updated special index references |
common.py |
New shared utilities including feature index building and common preprocessing functions |
bipartite_graph.py |
New module for bipartite graph preprocessing and data handling |
libs/ml_sandbox_libs/tests/test_data/test_amazon_reviews_dataset.py
Outdated
Show resolved
Hide resolved
libs/ml_sandbox_libs/tests/test_data/test_amazon_reviews_dataset.py
Outdated
Show resolved
Hide resolved
pl.min("timestamp").alias("timestamp"), | ||
pl.len().alias("num_ratings"), | ||
) | ||
if (main_df["num_ratings"] != 1).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check evaluates the entire column for every batch. Consider moving this validation outside the _preprocess function or making it conditional (e.g., only in debug mode) to improve performance.
Copilot uses AI. Check for mistakes.
- item2index: Mapping from item ID (parent_asin) string to integer index. | ||
- category2index: Mapping from category string to integer index. | ||
- item_index_2_category_index: Mapping from item integer index to category integer index. | ||
- processed_train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering) from which indices were derived. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring mentions 'processed_train_df' in the returns section, but this function does not return this value. Remove this line from the docstring.
- processed_train_df: The Polars DataFrame for training data after initial processing (conversion, metadata join, optional history filtering) from which indices were derived. | |
# Removed processed_train_df as it is not returned by the function. |
Copilot uses AI. Check for mistakes.
/gemini review |
…et.py Co-authored-by: Copilot <[email protected]>
…et.py Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces enhancements to the dataset preprocessing logic, improving modularity and maintainability with a common preprocessing utility and a dedicated function for building feature indices. It adds support for bipartite graph construction and includes comprehensive tests. I've identified some critical bugs in the new AmazonReviewsBipartiteGraphDataModule
where dataloaders were using incorrect data and hardcoded parameters, and I've provided suggestions to improve code clarity and efficiency.
libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset/bipartite_graph.py
Outdated
Show resolved
Hide resolved
libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset/bipartite_graph.py
Outdated
Show resolved
Hide resolved
libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset/bipartite_graph.py
Outdated
Show resolved
Hide resolved
# indexファイルが存在しないため、以下のlinkから明示的にURLを指定 | ||
# https://data.pyg.org/whl/torch-2.7.0+cu126.html | ||
# https://data.pyg.org/whl/torch-2.7.0+cpu.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment explaining the need for hardcoded URLs is in Japanese. For the benefit of a wider audience of contributors, it would be helpful to have this comment in English.
Additionally, relying on hardcoded URLs can be brittle. If these URLs change, the build will break. While this might be the only option for now, it's a maintainability risk to be aware of.
# The following packages do not have a proper index file, so we specify the URLs explicitly.
# See: https://data.pyg.org/whl/torch-2.7.0+cu126.html
# See: https://data.pyg.org/whl/torch-2.7.0+cpu.html
self.filter_no_history = filter_no_history | ||
self.transform = T.Compose([T.RemoveIsolatedNodes(), T.RemoveSelfLoops()]) | ||
|
||
def prepare_data(self) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prepare_data
method in AmazonReviewsBipartiteGraphDataModule
re-processes the data every time it's called. This is inefficient for development and repeated runs.
Consider adding a caching mechanism to save and load the preprocessed all_df
and index dictionaries to disk, similar to the implementation in AmazonReviewsSeqRecDataModule
.
def common_preprocess_dataset( | ||
dataset_dict: D.DatasetDict, metadata: D.Dataset, filter_no_history: bool = True | ||
) -> tuple[ | ||
tuple[pl.DataFrame, pl.DataFrame, pl.DataFrame], | ||
pl.DataFrame, | ||
tuple[dict[str, int], dict[str, int], dict[str, int], dict[int, int]], | ||
tuple[pl.DataFrame, pl.DataFrame, pl.DataFrame], | ||
]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function common_preprocess_dataset
returns a complex nested tuple. This makes the code harder to read and maintain, as consumers of this function need to unpack it using numeric indices.
To improve readability and maintainability, consider using a NamedTuple
or a dataclass
for the return value. This would allow accessing fields by name, making the code more self-documenting.
…and maintainability
…ataset/bipartite_graph.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ataset/bipartite_graph.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…ataset/bipartite_graph.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
User description
This pull request refactors and enhances preprocessing functions for the Amazon Reviews dataset, introducing modularity and new functionality. Key changes include splitting preprocessing logic into reusable components, adding a new preprocessing function for bipartite graph construction, and improving the handling of edge cases like missing categories. These updates aim to make the codebase more maintainable and extendable.
Refactoring and Modularity Improvements:
_common_preprocess_dataset
to encapsulate shared preprocessing logic, such as schema overrides, metadata joining, and history filtering. This modular function is reused across other preprocessing functions.seq_rec_preprocess_dataset
to delegate common preprocessing steps to_common_preprocess_dataset
, improving code reuse and clarity.New Functionality:
build_feature_indices
, a dedicated function for generating user, item, and category indices, along with mappings between item indices and category indices. This separates index-building logic from dataset preprocessing for better maintainability.bipartite_graph_preprocess_dataset
, a new function for preparing the dataset specifically for bipartite graph analysis. It aggregates user-item interactions and computes additional features like the number of ratings.Robustness Enhancements:
PR Type
Enhancement
Description
Extracts feature index building into new function.
Introduces common dataset preprocessing utility.
Refactors sequential recommendation preprocessing.
Adds new bipartite graph preprocessing function.
Changes walkthrough 📝
amazon_reviews_dataset.py
Refactor and add new dataset preprocessing functions
libs/ml_sandbox_libs/src/ml_sandbox_libs/data/amazon_reviews_dataset.py
build_feature_indices
function, handling user, item, and category indices, including special
UNK/PAD tokens and robust null category handling.
_common_preprocess_dataset
to encapsulate shared dataloading, schema overrides, metadata joining, and history filtering
logic.
seq_rec_preprocess_dataset
to utilize_common_preprocess_dataset
for initial processing, improvingmodularity.
bipartite_graph_preprocess_dataset
function for bipartitegraph analysis, which aggregates user-item interactions and calculates
num_ratings
.test_amazon_reviews_dataset.py
Add comprehensive tests for new and refactored preprocessing functions
libs/ml_sandbox_libs/tests/test_data/test_amazon_reviews_dataset.py
test_build_feature_indices
to validate thefunctionality of the extracted
build_feature_indices
function.test_seq_rec_preprocess_dataset
to verify the refactoredsequential recommendation preprocessing, including history processing.
test_bipartite_graph_preprocess_dataset
to ensure thecorrect aggregation and output of the new bipartite graph
preprocessing.