This repository provides a modular template for building recommender systems in Python using implicit feedback data. It is designed to streamline experimentation of recommendation models with a modern ML stack. Two neural-based models are implemented: Matrix Factorization and MLP based on one of two user representations: user represented by its embedding or user represented by histories of clicked items (item embeddings).
- PyTorch Lightning – for scalable and structured model training
- Hydra – for flexible configuration management
- ClearML – for experiment tracking and ML workflow orchestration
- (Optional) AWS S3 – for storing datasets and models remotely
As an example, this template uses the ContentWise Impressions dataset - a collection of implicit interactions and impressions of movies and TV series from an Over-The-Top media service, which delivers its media contents over the Internet. In the preprocessing phase the dataset is being limited to content of movies only.
Exporatory data analysis can be found in contentwise_eda.ipynb.
- Rapid prototyping of recommender systems
- Benchmarking implicit models
- Educational purposes (learning modern ML tools in practice)
More details about setup, usage, and customization can be found in the sections below.
To make use of this repository, follow these steps:
-
Download the dataset
Download the ContentWise Impressions dataset, specifically theCW10Mdirectory.
Place it in the following path:cache/data-cw10m/ -
Set up external services
- Configure your connection to a ClearML server for experiment tracking.
- (Optional) Set up access to AWS S3 if you want to use remote storage for data or/and models.
Prepare environment variables related to ClearML and AWS in .env (see .env.example):
CLEARML_CONFIG_FILE=clearml.conf
CLEARML_WEB_HOST=<your-clearml-web-host>
...
Create and activate virtual environment with conda:
conda create --name <env_name> python=3.13.2
conda activate <env_name>
Install with pip:
pip install . # Add flag -e to install in editable mode(Optional) Using docker compose:
docker compose up -d # Run container based on docker-compose.yml(Optional) Using plain docker:
docker build -t ds-image . # Build image defined in Dockerfile
docker run -dit --gpus all --name ds-container ds-image # Run container based on that imagepython steps/process_data.pyAfter running this script the following datasets are being generated:
train.parquet- behavioral data about 'movies consumption' for training (implict feedback)validation.parquet- behavioral data for validationuser_mapper.parquet- user name to user index mapperitem_mapper.parquet- item name to item index mapperlast_user_histories.parquet- histories of last n consumed item per user - computed on train data
python steps/evaluate_baselines.pyOffline metrics (AUROC & NDCG) of baselines solutions:
Training MLP based on user histories for 20 epochs:
python steps/train.py experiment=mlp_with_history trainer.max_epochs=20python steps/optimize_hparams.py
python steps/infer.py
python steps/serve.py
python steps/run_pipeline.pydocker exec -it ds-container bash # Execute bash in a running container
docker compose start/stop/down
docker builder prune # Remove build cache

