-
Notifications
You must be signed in to change notification settings - Fork 129
Open
Description
Problem statement
Merlin models should allow you to train models from different libraries (TF/PT/lightfm/xgboost/implicit etc) and then compare them in a consistent "apples-to-apples" type way. Right now each library has its own evaluation metrics, that aren't directly comparable to other frameworks. During the paper experiments for instance, we had to implement custom metric code for implicit to match the output from TF. Customers also want to be able to compare to their own metrics.
Goals
- Customers are able to compare models across libraries.
- Coverage of metrics: Precision, NDCG, MAP, AUC
- Support for both retrieval and ranking models
- Only used during final evaluation process. This isn't used during training. (Non goal)
- Ability to slice on key features from user/item
- Example
Constraints
- Scoring should happen for all items, not just negative samples, in order to be able to compare correctly for ranking metric
- Only one implementation of these metrics so that the calculations are consistent. This implies conversion.
- We want to be able to compare across multiple frameworks.
- Pre-requisites:
- [RMP] Support Offline Batch processing of Recs Generation Pipelines #419
- Support for external libraries and systems level evaluation
Starting Point
- Input is an ordered list of recommendations that gets scored/sliced
- Evaluation is from a list of prediction
- cuDF implementation
- Integration with visualization tooling. (marc to add options)
Notes
- Gabriel has a POC of evaluation framework that takes a cudf dataframe with predictions and computes using cupy popular top-k metrics (recall, precision, mrr, map, ndcg) - code: https://github.com/rapidsai/recsys/tree/main/benchmark_recsys/code/evaluation
- Marc built a system like this at Spotify which converted to specific framework metrics
Tasks
( Ground work for cross framework evaluation )
- Create a mechanism for transferring data across frameworks
- Add Merlin Array classes core#102
- Design document
- PoC for Cross framework model evaluation metrics
- Prepare presentation ( TBC ) and collect feedback from team
( Enter the goal here )
- Establish a standard set of metrics (to be used in conjunction with cross-framework data transfer)
- Create example for cross framework evaluation metrics
Reactions are currently unavailable