[RMP] Quick-start for retrieval models training pipeline

## Problem:
Merlin provides documentation and a number of example notebooks on how to use tools like NVTabular, Dataloader and Merlin Models. In order to build a pipeline for training and evaluation purposes, a Data Scientist needs to analyze that material, copy-and-paste code snippets demonstrating the API and glue that code together to implement scripts for experimentation and benchmarking.
It might also not be clear to the users the advanced API options featured by Merlin Models that can be mapped as a hyperparameter, and potentially improve models accuracy.

## Goal:
This RMP provides a Quick-start for building retrieval models training pipelines.  
It addresses the retrieval models part of this larger RMP #732, in particular the steps 4-7 of the Data Scientist journey when experimenting with Merlin Models. 

The Quick start for retrieval is composed by:


### Template scripts
- Generic template script for preprocessing
- Generic template script for training retrieval models (MF, TwoTower, YouTubeDNN), exposing the main hyperparameters.

### Documentation

- Documentation of the scripts command line arguments
- Documentation of best practices learned from our experimentation:
    - Hyperparameter tuning: search space, most important hyperparameters and best hparams for TenRec public dataset for STL and MTL models
    - Intuitions of API options (building blocks, arguments) that can improve models accuracy


## Constraints:
- Preprocessing - The pre-processing template notebook will perform some basic feature encoding for categorical (e.g. categorify) and continuous variables (e.g. standardization). The customer can expand the template with advanced preprocessing ops demonstrated in our examples.
- Training - The training and evaluation script for Merlin Models should be totally configurable, taking as input the parquet files and schema, and a number of hyperparameters exposed via command line arguments. The output of this script should be the evaluation metrics, logged as a CSV file and also to Weights&Biases.

## Starting Point:
The retrieval training scripts we have developed for the [Retrieval](https://github.com/NVIDIA-Merlin/research/tree/main/retrieval_exp) and [Two-stage](https://github.com/NVIDIA-Merlin/research/tree/main/two_stage) research projects.

## Tasks:

- [ ] Create a generic dataset preprocessing template for retrieval models
- [ ] https://github.com/NVIDIA-Merlin/Merlin/issues/688

#### Tenrec dataset experiments
- [ ] NVIDIA-Merlin/models#803

#### Research & Documentation
- [ ] NVIDIA-Merlin/models#802  
- [ ] NVIDIA-Merlin/models#804

#### Testing
- [x] NVIDIA-Merlin/models#805

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RMP] Quick-start for retrieval models training pipeline #828

Problem:

Goal:

Template scripts

Documentation

Constraints:

Starting Point:

Tasks:

Tenrec dataset experiments

Research & Documentation

Testing

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RMP] Quick-start for retrieval models training pipeline #828

Description

Problem:

Goal:

Template scripts

Documentation

Constraints:

Starting Point:

Tasks:

Tenrec dataset experiments

Research & Documentation

Testing

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions