A machine learning-based movie recommendation system that provides personalized movie suggestions to users. The system integrates collaborative filtering, content-based filtering, and popularity-based models within a microservices architecture. It features real-time data ingestion, model training, CI/CD, and a REST API.
The Netflicks recommendation system follows a modular and scalable microservices architecture with the following key components:
- Data Collection: Raw movie and user interaction data is simulated and stored in the
data/
directory. - Kafka Integration: Simulated real-time event streams including movie ratings, watch logs, and recommendation requests.
- Preprocessing: Data cleaning, transformation, genre standardization, and one-hot encoding in the
preprocessing/
module. - Database: PostgreSQL stores normalized and validated movie, user, rating, and watch history data in the
db/
directory.
- Feature Engineering: Constructs user-item sparse matrices and genre vectors.
- Model Training: Implements collaborative filtering (ALS), content-based filtering (genre similarity), and popularity models in
model_training/
. - Model Storage: Trained models and vectors saved as versioned
.pkl
files in themodels/
directory using MLflow. - Model Evaluation: Performance metrics tracked using offline (RMSE, HitRate) and online evaluation.
- Flask Server: REST API implemented in
api/
to serve recommendations. - Model Serving: Dynamically serves predictions using CF, CBF, or popularity models depending on user history.
- Docker Containerization: API, training, and validation workflows containerized for consistent deployment.
- Load Balancing: Designed to handle concurrent user requests.
- Unit Tests: Component-level testing in
testing/
. - Pipeline Testing: Simulates API calls and validates recommendation quality.
- CI/CD: Automated retraining, validation, and deployment using Jenkins.
- Kafka → Preprocessing → PostgreSQL
- PostgreSQL → Model Training → Model Artifacts
- API Request → Inference Engine → Recommendations
.
├── api/ # API server implementation
├── data/ # Simulated and processed data storage
├── db/ # PostgreSQL schema and database manager
├── kafka_import/ # Kafka stream simulator
├── model_training/ # Model training scripts for CF, CBF, Popularity
├── models/ # Trained model and vector storage
├── pipeline/ # Offline and online evaluation scripts
├── preprocessing/ # Data transformation, formatting, and loading
└── testing/ # Unit and integration test scripts
- Python 3.10
- Docker
- pip
- Clone the repository:
git clone <repository-url>
cd Netflicks
- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
- Install dependencies:
pip install -r requirements.run.txt # For running the API
pip install -r requirements.train.txt # For training the model
pip install -r requirements.data.txt # For data processing
Create a .env
file in the root directory with the following variables:
# Add your environment variables here
- Build the Docker image:
docker build -f Dockerfile.run -t netflicks-api .
- Run the container:
docker run -p 8082:8082 -v $(pwd)/models:/app/models netflicks-api
- Start the API server:
python api/server.py
The API will be available at http://localhost:8082
GET /recommend/{user_id}
: Get movie recommendations for a specific user
- Data Preprocessing:
python preprocessing/preprocess.py
- Model Training:
python model_training/train.py
Run tests using:
python -m pytest testing/
This project uses Jenkins for CI/CD. The pipeline performs the following steps:
- Pulls the latest data version using DVC
- Trains models and logs them to MLflow
- Validates and deploys models via Docker containers
- Starts monitoring services (Prometheus, Grafana)
Configuration:
- Jenkinsfile defines stages for preprocessing, training, validation, API testing, and monitoring.
- Containers:
Dockerfile.train
,Dockerfile.validate
,Dockerfile.run
- Monitoring: Prometheus scrapes metrics; Grafana displays dashboards
Benefits:
- Builds triggered on code commits or daily schedule
- Secrets handled via Jenkins credentials
- Consistent environments across dev, test, and prod
Monitoring Dashboards:
- Request latency histogram
- Recommendation count per user
- Recommendation item count
- User segment activity breakdown
MLflow & DVC:
- MLflow tracks metrics, parameters, and artifacts for each model
- DVC manages dataset versions
Check out the demo: Netflicks Demo