Skip to content

[RMP] Cross-framework model evaluation metrics #407

@karlhigley

Description

@karlhigley

Problem statement

Merlin models should allow you to train models from different libraries (TF/PT/lightfm/xgboost/implicit etc) and then compare them in a consistent "apples-to-apples" type way. Right now each library has its own evaluation metrics, that aren't directly comparable to other frameworks. During the paper experiments for instance, we had to implement custom metric code for implicit to match the output from TF. Customers also want to be able to compare to their own metrics.

Goals

  • Customers are able to compare models across libraries.
  • Coverage of metrics: Precision, NDCG, MAP, AUC
  • Support for both retrieval and ranking models
  • Only used during final evaluation process. This isn't used during training. (Non goal)
  • Ability to slice on key features from user/item
  • Example

Constraints

  • Scoring should happen for all items, not just negative samples, in order to be able to compare correctly for ranking metric
  • Only one implementation of these metrics so that the calculations are consistent. This implies conversion.
  • We want to be able to compare across multiple frameworks.
  • Pre-requisites:
  • [RMP] Support Offline Batch processing of Recs Generation Pipelines #419
  • Support for external libraries and systems level evaluation

Starting Point

  • Input is an ordered list of recommendations that gets scored/sliced
  • Evaluation is from a list of prediction
  • cuDF implementation
  • Integration with visualization tooling. (marc to add options)

Notes

Tasks

( Ground work for cross framework evaluation )

  • Create a mechanism for transferring data across frameworks

( Enter the goal here )

Metadata

Metadata

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions