Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions examples/00_quick_start/nn_embedding_ranker_movielens.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if there was an error in the submission of the notebook, but I can't see it

"cells": [],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
4 changes: 3 additions & 1 deletion recommenders/evaluation/spark_evaluation.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
# Licensed under the MIT License.

import numpy as np
import warnings # Added for R-Precision warning
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is still from the other PR, it needs to be removed


try:
from pyspark.mllib.evaluation import RegressionMetrics, RankingMetrics
Expand Down Expand Up @@ -213,6 +214,7 @@ def __init__(
self.col_rating = col_rating
self.col_prediction = col_prediction
self.threshold = threshold
self.rating_pred_raw = rating_pred # Store raw predictions before processing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rating_pred is already stored in self.rating_pred, and self.rating_pred_raw is not used in this class. Any reason you want to store?


# Check if inputs are Spark DataFrames.
if not isinstance(self.rating_true, DataFrame):
Expand Down Expand Up @@ -831,7 +833,7 @@ def user_item_serendipity(self):
Y.C. Zhang, D.Ó. Séaghdha, D. Quercia and T. Jambor, Auralist:
introducing serendipity into music recommendation, WSDM 2012

Eugene Yan, Serendipity: Accuracy’s unpopular best friend in Recommender Systems,
Eugene Yan, Serendipity's unpopular best friend in Recommender Systems,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eugeneyan.com, April 2020

Returns:
Expand Down
2 changes: 2 additions & 0 deletions recommenders/models/embedding_ranker/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Copyright (c) Recommenders contributors.
# Licensed under the MIT License.
208 changes: 208 additions & 0 deletions recommenders/models/embedding_ranker/embedding_ranker_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,208 @@
# Copyright (c) Recommenders contributors.
# Licensed under the MIT License.

import os
import numpy as np
import pandas as pd
import torch

from recommenders.utils.constants import (
DEFAULT_USER_COL,
DEFAULT_ITEM_COL,
DEFAULT_RATING_COL,
DEFAULT_PREDICTION_COL,
DEFAULT_K,
)

def predict_rating(
model,
test_df,
col_user=DEFAULT_USER_COL,
col_item=DEFAULT_ITEM_COL,
col_rating=DEFAULT_RATING_COL,
col_prediction=DEFAULT_PREDICTION_COL,
batch_size=1024,
):
"""Predict ratings for user-item pairs in test data.
Args:
model (NNEmbeddingRanker): Trained embedding ranker model.
test_df (pandas.DataFrame): Test dataframe containing user-item pairs.
col_user (str): User column name.
col_item (str): Item column name.
col_rating (str): Rating column name.
Comment on lines +1 to +33
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all this can become the notebook.

Some thoughts for the notebook:

  • Here is a good example of a useful notebook: https://github.com/recommenders-team/recommenders/blob/main/examples/02_model_collaborative_filtering/lightgcn_deep_dive.ipynb it explains both the math behind it and an implementation
  • Think of the notebook as a way to showcase how to use the ranker and what is the ranker about.
  • The objective of the notebook is that it needs to be useful. That's the most important metric.
  • Ideally, a person could go to this notebook, add their data, run it, and understand how to use it.
  • For the notebook, you can just showcase the content of these functions directly inside the notebook.
  • Something very important is that we follow the principle of explicit is better than implicit. For example, we don't something like for metric_name, metric_func in metrics.items(), because it adds a layer of complexity. Instead we show the metrics explicitely: rmse(true, pred), precision_at_k(true, pred, params), etc. Each person with a quick view can see what is going on.
  • Feel free to come to our Monday meeting if you want to understand better how we do the notebooks @demoncoder-crypto

col_prediction (str): Prediction column name.
batch_size (int): Batch size for predictions to avoid memory issues.
Returns:
pandas.DataFrame: Dataframe with user, item, prediction columns.
"""
# Create a copy of the test data with only the needed columns
test_copy = test_df[[col_user, col_item]]

# Process in batches to avoid memory issues
predictions = []
for i in range(0, len(test_copy), batch_size):
batch = test_copy.iloc[i:i+batch_size]
users = batch[col_user].values
items = batch[col_item].values
batch_predictions = model.predict(users, items, is_list=True)
predictions.extend(batch_predictions)

# Add predictions to the dataframe
test_copy[col_prediction] = predictions

return test_copy

def generate_recommendations(
model,
train_df,
test_df=None,
col_user=DEFAULT_USER_COL,
col_item=DEFAULT_ITEM_COL,
col_prediction=DEFAULT_PREDICTION_COL,
top_k=DEFAULT_K,
remove_seen=True,
batch_size=1024,
):
"""Generate top-k recommendations for all users or users in test data.
Args:
model (NNEmbeddingRanker): Trained embedding ranker model.
train_df (pandas.DataFrame): Training dataframe used to identify seen items if remove_seen=True.
test_df (pandas.DataFrame, optional): Test dataframe to identify users to generate recommendations for.
If None, recommendations are generated for all users in training.
col_user (str): User column name.
col_item (str): Item column name.
col_prediction (str): Prediction column name.
top_k (int): Number of top items to recommend.
remove_seen (bool): Whether to remove items that appear in the training data.
batch_size (int): Batch size for predictions to avoid memory issues.
Returns:
pandas.DataFrame: Dataframe with user, item, prediction columns for top-k items per user.
"""
# If test_df is provided, use users from test set
# Otherwise, use all users from training
if test_df is not None:
users = test_df[col_user].unique()
else:
users = train_df[col_user].unique()

# Get all items
all_items = list(model.item_id_map.keys())

# Filter users that are not in the training set
valid_users = [user for user in users if user in model.user_id_map]

if not valid_users:
raise ValueError("No valid users found in the dataset")

# Create combinations of users and items to predict
all_pairs = []
for user in valid_users:
for item in all_items:
all_pairs.append((user, item))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_pairs = [(u, i) for u in valid_users for i in all_items]
or

from itertools import product
all_pairs = list(product(valid_users, all_items))


# Process in batches to avoid memory issues
all_predictions = []
for i in range(0, len(all_pairs), batch_size):
batch_pairs = all_pairs[i:i+batch_size]
batch_users = [pair[0] for pair in batch_pairs]
batch_items = [pair[1] for pair in batch_pairs]
batch_predictions = model.predict(batch_users, batch_items, is_list=True)

# Create batch results
for j, (user, item) in enumerate(batch_pairs):
all_predictions.append({
col_user: user,
col_item: item,
col_prediction: batch_predictions[j]
})

# Convert to dataframe
result_df = pd.DataFrame(all_predictions)

# If remove_seen is True, remove items that appear in the training data
if remove_seen:
# Create a set of user-item pairs from the training data
seen_pairs = set(zip(train_df[col_user], train_df[col_item]))

# Filter out seen pairs
result_df = result_df[~result_df.apply(lambda row: (row[col_user], row[col_item]) in seen_pairs, axis=1)]

# Get top-k recommendations for each user
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you reuse predict_rating if generating_recommendation is using the same logic under the hood and sort & cut top_k in the end?

top_k_df = (
result_df
.sort_values([col_user, col_prediction], ascending=[True, False])
.groupby(col_user).head(top_k)
.reset_index(drop=True)
)

return top_k_df

def evaluate_model(
model,
test_df,
metrics,
col_user=DEFAULT_USER_COL,
col_item=DEFAULT_ITEM_COL,
col_rating=DEFAULT_RATING_COL,
col_prediction=DEFAULT_PREDICTION_COL,
k=DEFAULT_K,
batch_size=1024,
):
"""Evaluate model using rating and ranking metrics.
Args:
model (NNEmbeddingRanker): Trained embedding ranker model.
test_df (pandas.DataFrame): Test dataframe.
metrics (dict): Dictionary of metric functions to use for evaluation.
Keys should be metric names and values should be functions that take
test_df, predictions_df, and other parameters.
col_user (str): User column name.
col_item (str): Item column name.
col_rating (str): Rating column name.
col_prediction (str): Prediction column name.
k (int): K value for ranking metrics.
batch_size (int): Batch size for predictions to avoid memory issues.
Returns:
dict: Dictionary with metric name as key and metric value as value.
"""
# Generate predictions for test data
predictions_df = predict_rating(
model,
test_df,
col_user=col_user,
col_item=col_item,
col_rating=col_rating,
col_prediction=col_prediction,
batch_size=batch_size,
)

# Calculate metrics
results = {}
for metric_name, metric_func in metrics.items():
# Different metrics may have different required parameters
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the only difference is k, you may:

results[metric_name] = metric_func(
    test_df,
    predictions_df,
    col_user=col_user,
    col_item=col_item,
    col_rating=col_rating,
    col_prediction=col_prediction,
    k=k if 'k' in metric_func.__code__.co_varnames else None,
 )

if 'k' in metric_func.__code__.co_varnames:
results[metric_name] = metric_func(
test_df,
predictions_df,
col_user=col_user,
col_item=col_item,
col_rating=col_rating,
col_prediction=col_prediction,
k=k
)
else:
results[metric_name] = metric_func(
test_df,
predictions_df,
col_user=col_user,
col_item=col_item,
col_rating=col_rating,
col_prediction=col_prediction
)

return results
Loading