This project transforms a basic employee attrition prediction model into a full-fledged MLOps system, implementing industry best practices for machine learning operations. The system focuses on predicting employee attrition - a critical HR analytics problem that helps organizations identify employees at risk of leaving and take proactive measures to improve retention.
-
Production-Ready ML System
- Robust data pipeline for consistent data processing
- Automated model training and validation workflows
- Reliable model deployment and serving infrastructure
- Comprehensive monitoring and drift detection
- Continuous Integration/Continuous Deployment (CI/CD)
-
Responsible AI Implementation
- Fairness assessment and bias mitigation
- Model explainability through SHAP values
- Transparent decision-making process
- Ethical considerations in predictions
-
End-to-End MLOps Pipeline
- Data versioning and lineage tracking
- Experiment tracking with MLflow
- Automated model retraining triggers
- Performance monitoring and alerting
- Model versioning and promotion
-
Data Management
- Automated data preprocessing pipeline
- Feature engineering and validation
- Data quality monitoring
- Reference data management
-
Model Development
- Automated model training pipeline
- Hyperparameter optimization
- Model evaluation and selection
- Cross-validation and testing
-
Monitoring & Maintenance
- Monitoring Strategy: High-level monitoring approach and model governance
- Drift Detection: Technical implementation of drift detection
- Performance metric tracking
- Automated alert system
- Model health monitoring
-
Deployment & Serving
- FastAPI-based prediction service
- Streamlit frontend for predictions
- Model version management
- A/B testing capability
-
Responsible AI
- Fairness metrics calculation
- Bias detection and mitigation
- SHAP-based feature importance
- Prediction explanations
- Backend: Python 3.11+, FastAPI
- ML Libraries: scikit-learn, optuna, shap, evidently
- Monitoring: MLflow, custom drift detection
- Frontend: Streamlit
- Deployment: Docker, Docker Compose
- CI/CD: GitHub Actions
The system predicts the likelihood of employee attrition by analyzing various factors such as:
- Employee demographics
- Job characteristics
- Work environment metrics
- Performance indicators
- Compensation and benefits
- Career development opportunities
This prediction helps organizations:
- Identify at-risk employees
- Understand key factors driving attrition
- Develop targeted retention strategies
- Optimize HR policies and practices
- Improve employee satisfaction and engagement
A full-stack MLOps solution for employee attrition prediction with robust drift detection capabilities, featuring:
- Automated model training, retraining, and promotion
- Drift detection and monitoring
- API and Streamlit frontend
- MLflow tracking and artifact management
- CI/CD with GitHub Actions and Docker Compose
- ML Model Training: Automated training and validation of employee attrition prediction models
- MLflow Integration: Tracking experiments, model registration, and model versioning
- Drift Detection System:
- Feature drift monitoring with statistical tests
- Prediction drift monitoring for model outputs
- Automated alerts when drift is detected
- Detailed HTML reports with feature-by-feature analysis
- FastAPI endpoint for on-demand drift detection
- MLflow integration for tracking drift metrics over time
- Customizable drift thresholds for different sensitivity levels
- GitHub Actions Workflows:
- Automated drift detection on schedule
- Model promotion workflow
- MLflow metadata maintenance
- Visualization: Comprehensive HTML reports for data and prediction drift
- Frontend: Streamlit app for live predictions and model info
- API: Serves predictions and model info (FastAPI)
- Frontend: Streamlit app for live predictions and model info
- MLflow: Model tracking and artifact storage
- Automation: All workflows managed by GitHub Actions
The drift detection system consists of:
-
Reference Data Management:
- Saving baseline data for comparison
- Storing feature distributions and statistics
-
Drift Detection Pipeline:
- Feature drift detection using statistical tests
- Prediction drift monitoring
- HTML report generation
-
Automation:
- GitHub Actions workflows for scheduled monitoring
- Automatic issue creation for detected drift
- Model retraining triggers
-
API Layer:
- FastAPI endpoints for drift detection
- Report generation and retrieval
# Clone the repository
git clone https://github.com/BTCJULIAN/Employee-Attrition-2.git
cd Employee-Attrition-2
# Install dependencies with Poetry
poetry install
# Run drift detection with default settings
python check_production_drift.py
# Generate HTML report for current data
python scripts/generate_drift_report.py --current-data path/to/data.csv
# Save new reference data (baseline) for drift comparison
python save_reference_data.py
The project includes a FastAPI endpoint for drift detection:
# Start the API server
python drift_api.py
# Test the API with the test client
python test_drift_api_client.py
# Access the API documentation at http://localhost:8000/docs
Example API request:
import requests
import pandas as pd
# Load data
df = pd.read_csv("path/to/data.csv")
data = df.to_dict(orient="records")
# Detect drift
response = requests.post(
"http://localhost:8000/detect-drift",
json={"data": data, "threshold": 0.05}
)
# Check results
result = response.json()
print(f"Drift detected: {result['drift_detected']}")
For comprehensive documentation on drift detection and monitoring:
- Monitoring Strategy: High-level monitoring approach and model governance
- Drift Detection Guide: Technical implementation and usage details
Repair and maintain MLflow metadata with:
python scripts/mlflow_maintenance.py --fix-run-metadata
-
Clone the repo and set up
.env
:cp .env.example .env # Edit .env with your configuration
-
Build and run all services:
docker-compose up --build
-
Access:
- API: http://localhost:8000
- Frontend: http://localhost:8501
- MLflow: http://localhost:5001
/ ├── .github/workflows/ # GitHub Actions workflows ├── scripts/ # Automation and utility scripts ├── src/ # Source code │ ├── employee_attrition_mlops/ # Core ML logic │ ├── monitoring/ # Drift detection │ └── frontend/ # Streamlit app ├── tests/ # Test files ├── docs/ # Documentation ├── mlruns/ # MLflow experiment tracking data ├── mlartifacts/ # MLflow model artifacts and metadata ├── reference_data/ # Baseline data for drift detection ├── reference_predictions/ # Reference model predictions ├── reports/ # Generated reports └── test_artifacts/ # Test output files
The project uses two main MLflow directories:
mlruns/
: Contains experiment tracking data, including:- Run metadata
- Metrics
- Parameters
- Tags
- Run history
mlartifacts/
: Stores model artifacts and metadata, including:- Saved models
- Model configurations
- Feature importance plots
- SHAP explanations
- Drift detection reports
The project includes two main drift detection scripts:
-
check_production_drift.py
:- Main drift detection script
- Runs statistical tests on production data
- Generates HTML reports
- Updates MLflow metrics
- Used in automated workflows
-
check_drift_via_api.py
:- API-based drift detection
- Used for on-demand drift checks
- Supports custom thresholds
- Returns JSON results
- Used in the drift detection API
The project maintains two main log files:
production_automation.log
: Logs from production automation workflowstest_production_automation.log
: Logs from test runs of production automation
These logs are used for:
- Debugging automation issues
- Monitoring workflow execution
- Tracking drift detection results
- Auditing model changes
This section provides a detailed explanation of the repository's organization and the purpose of each directory and key files.
Contains GitHub Actions workflows for CI/CD automation:
production_automation.yml
: Main workflow for production deploymentdrift_detection.yml
: Scheduled drift monitoringmodel_promotion.yml
: Model version promotion pipelinetesting.yml
: Automated testing workflow
Project documentation and guides:
- Architecture diagrams and system design
- Setup and installation guides
- API documentation
- User manuals
- Development guidelines
MLflow tracking directories (typically in .gitignore
):
mlruns/
: Experiment tracking datamlartifacts/
: Model artifacts and metadata- Contains run histories, metrics, and model versions
Local model storage (outside MLflow registry):
- Saved model files
- Model checkpoints
- Pre-trained models
- Model configurations
Generated analysis and monitoring reports:
- Drift detection reports
- Model performance metrics
- Fairness analysis reports
- Confusion matrices
- Feature importance visualizations
Standalone Python scripts for various tasks:
optimize_train_select.py
: Model training and selectionbatch_predict.py
: Batch prediction processingcreate_drift_reference.py
: Reference data generationmlflow_maintenance.py
: MLflow metadata managementgenerate_reports.py
: Report generation utilities
Core Python package containing reusable code:
config.py
: Configuration managementdata_processing.py
: Data preprocessing and feature engineeringpipelines.py
: ML pipeline implementationutils.py
: Utility functions and helpersapi.py
: FastAPI endpoints
Streamlit-based user interface:
app.py
: Main Streamlit application- UI components and layouts
- Visualization utilities
- User interaction handlers
Monitoring and drift detection logic:
- Drift detection algorithms
- Alert generation
- Performance monitoring
- Statistical testing
Test suite for the codebase:
- Unit tests for individual components
- Integration tests for system workflows
- Test fixtures and utilities
- Performance benchmarks
Project configuration and dependency management:
- Package metadata
- Dependencies and versions
- Development tools configuration
- Build settings
Lock file for Poetry dependency management:
- Exact dependency versions
- Hash verification
- Dependency resolution
Containerization configurations:
- Base image setup
- Dependency installation
- Application configuration
- Environment setup
Multi-container application orchestration:
- Service definitions
- Network configuration
- Volume mappings
- Environment variables
Environment configuration (should be gitignored):
- Database credentials
- API keys and secrets
- Service endpoints
- Feature flags
All automation is managed by GitHub Actions workflows:
- Testing and linting
- Drift detection
- Model retraining
- Batch prediction
- Model promotion
- API redeployment
Create a .env
file in the root directory with the following variables:
# Database Configuration
DATABASE_URL_PYMSSQL=mssql+pymssql://username:password@hostname:1433/database
# MLflow Configuration
MLFLOW_TRACKING_URI=http://localhost:5001 # MLflow server
MLFLOW_MODEL_NAME=employee_attrition_model
MLFLOW_MODEL_STAGE=Production
# API Configuration
API_HOST=0.0.0.0
API_PORT=8000
This guide provides detailed setup instructions for macOS, including solutions for common challenges.
Before starting, ensure you have the following installed:
-
Git
# Verify Git installation git --version
-
Python 3.11
# Install Python 3.11 using Homebrew brew install [email protected] # Verify Python version python3.11 --version
-
Homebrew
# Install Homebrew if not already installed /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
-
ODBC Driver Installation
- Challenge: Microsoft ODBC driver installation can be tricky on macOS
- Solution: See the Setup Details Guide
-
Pydantic v2 Dependency Conflicts
- Challenge: Some packages may require specific Pydantic versions
- Solution: See the Setup Details Guide
-
Python Version Management
- Challenge: Multiple Python versions can cause conflicts
- Solution: See the Setup Details Guide
-
Clone the Repository
git clone https://github.com/BTCJULIAN/Employee-Attrition-2.git cd Employee-Attrition-2
-
Install Poetry
# Install Poetry curl -sSL https://install.python-poetry.org | python3.11 - # Add Poetry to your PATH (add to ~/.zshrc or ~/.bash_profile) echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.zshrc source ~/.zshrc
-
Install ODBC Drivers
# Install unixodbc brew install unixodbc # Install Microsoft ODBC Driver for SQL Server brew tap microsoft/mssql-release https://github.com/Microsoft/homebrew-mssql-release brew update brew install msodbcsql17 mssql-tools
-
Install Project Dependencies
# Install all dependencies including development tools poetry install --with dev
-
Configure Environment Variables
Create a
.env
file in the project root directory:cp .env.example .env
Edit the
.env
file with your configuration:# Database Configuration DATABASE_URL_PYMSSQL=mssql+pymssql://username:password@hostname:1433/database # MLflow Configuration MLFLOW_TRACKING_URI=http://localhost:5001 MLFLOW_MODEL_NAME=employee_attrition_model MLFLOW_MODEL_STAGE=Production # API Configuration API_HOST=0.0.0.0 API_PORT=8000
Note: The
.env
file is included in.gitignore
for security reasons. Never commit sensitive information. -
Start MLflow Server
In a new terminal window, run:
# Start MLflow UI server poetry run mlflow ui --host 0.0.0.0 --port 5001
Access the MLflow UI at: http://localhost:5001
-
Check Poetry Environment
poetry env info
-
Test Database Connection
poetry run python -c "from src.employee_attrition_mlops.config import get_db_connection; get_db_connection()"
-
Verify MLflow Connection
poetry run python -c "import mlflow; print(mlflow.get_tracking_uri())"
-
ODBC Driver Issues
- Verify ODBC driver installation:
odbcinst -q -d
- Check ODBC configuration:
odbcinst -q -s
- Verify ODBC driver installation:
-
Python Version Issues
- Ensure correct Python version:
poetry env use python3.11
- Ensure correct Python version:
-
Dependency Installation Issues
- Clear Poetry cache and retry:
poetry cache clear . --all poetry install --with dev
- Clear Poetry cache and retry:
After completing the setup:
- Run the test suite:
poetry run pytest
- Start the API server:
poetry run python src/employee_attrition_mlops/api.py
- Launch the frontend:
poetry run streamlit run src/frontend/app.py
This section describes the end-to-end MLOps workflow implemented in this project, highlighting key MLOps principles and best practices for production-grade machine learning systems.
-
Comprehensive Logging & Governance
- MLflow tracking for experiment reproducibility
- Detailed model metadata and lineage tracking
- Version control for data, code, and models
- Audit trails for model changes and deployments
-
Automated Testing & Validation
- Unit tests for individual components
- Integration tests for pipeline workflows
- Data validation at each processing stage
- Model performance validation
-
Monitoring & Baselining
- Reference data generation for drift detection
- Statistical baselines for feature distributions
- Performance metric tracking over time
- Automated alerting for anomalies
-
Continuous Integration/Deployment
- Automated testing on code changes
- Model versioning and promotion
- Containerized deployment
- Environment consistency
-
Responsible AI Implementation
- Fairness assessment and monitoring
- Model explainability tracking
- Bias detection and mitigation
- Ethical considerations in predictions
The data pipeline begins with loading and processing employee data:
graph LR
A[Database] --> B[Data Loading]
B --> C[Data Cleaning]
C --> D[Feature Engineering]
D --> E[Processed Data]
-
Initial Data Setup:
- Use
scripts/seed_database_from_csv.py
to populate the database with initial data - Supports both CSV and direct database connections
- Use
-
Data Processing Pipeline:
- Data loaded via
src/employee_attrition_mlops/data_processing.py
- Automated cleaning and preprocessing
- Feature engineering and validation
- Train/test/validation splits
- Data loaded via
The model development pipeline is orchestrated by scripts/optimize_train_select.py
:
graph TD
A[Data] --> B[HPO with Optuna]
B --> C[Model Selection]
C --> D[Final Training]
D --> E[Validation]
E --> F[MLflow Logging]
-
Hyperparameter Optimization:
- Uses Optuna for efficient HPO
- Cross-validation for robust evaluation
- Multiple model architectures considered
-
Model Selection & Training:
- Best model selected based on validation metrics
- Final training on full training set
- Comprehensive validation suite
-
MLflow Artifacts:
- Performance metrics and plots
- Fairness analysis reports
- SHAP explanations
- Drift detection baselines
- Model registered to 'Staging' stage
Automated workflows managed by GitHub Actions:
graph LR
A[PR Created] --> B[Linting/Testing]
B --> C[Training Pipeline]
C --> D[Model Promotion]
D --> E[API Deployment]
-
Pull Request Workflow:
- Code linting and formatting
- Unit and integration tests
- Documentation validation
-
Production Automation:
- Triggered on main branch updates
- Runs full training pipeline
- Builds and deploys API service
- Updates model registry
The trained model is served through a FastAPI service:
graph LR
A[Model Registry] --> B[API Service]
B --> C[Batch Prediction]
B --> D[Real-time Prediction]
-
API Service:
- FastAPI implementation in
src/employee_attrition_mlops/api.py
- Docker containerization
- Health checks and monitoring
- Swagger documentation
- FastAPI implementation in
-
Prediction Modes:
- Real-time predictions via API
- Batch predictions using
scripts/batch_predict.py
- Support for both single and bulk requests
Continuous monitoring and automated retraining:
graph LR
A[Production Data] --> B[Drift Detection]
B --> C{Drift Detected?}
C -->|Yes| D[Retrain Trigger]
C -->|No| E[Continue Monitoring]
D --> F[Training Pipeline]
-
Drift Detection:
- Reference data generation via
scripts/create_drift_reference.py
- Statistical tests for feature drift
- Prediction drift monitoring
- Automated alert system
- Reference data generation via
-
Retraining Triggers:
- Significant drift detection
- Scheduled retraining
- Performance degradation
- Manual override capability
Ethical considerations and model transparency:
graph TD
A[Model] --> B[Fairness Analysis]
A --> C[Explainability]
A --> D[Bias Detection]
B --> E[Reports]
C --> E
D --> E
-
Fairness Assessment:
- Multiple fairness metrics
- Protected attribute analysis
- Bias mitigation strategies
-
Explainability:
- SHAP value generation
- Feature importance analysis
- Prediction explanations
- Decision boundary visualization
All components are integrated through:
-
MLflow Tracking:
- Experiment management
- Model versioning
- Artifact storage
- Metric tracking
-
Docker Compose:
- Service orchestration
- Environment consistency
- Easy deployment
- Scalability
-
GitHub Actions:
- Automated workflows
- Environment management
- Deployment coordination
- Monitoring integration
This section explains the key technology choices made for this MLOps pipeline and their benefits.
Why MLflow?
- Comprehensive Tracking: MLflow provides a unified platform for tracking experiments, parameters, metrics, and artifacts
- Model Registry: Built-in model versioning and stage management (Staging/Production)
- Reproducibility: Detailed logging of environment, code versions, and dependencies
- Integration: Seamless integration with popular ML frameworks and cloud providers
Key Benefits:
- Centralized experiment management
- Easy model versioning and promotion
- Detailed model lineage tracking
- Built-in UI for experiment visualization
Why Poetry?
- Deterministic Builds: Lock file ensures consistent dependency versions
- Virtual Environment Management: Automatic environment creation and activation
- Dependency Resolution: Efficient resolution of complex dependency trees
- Development Workflow: Built-in commands for building, publishing, and testing
Key Benefits:
- Consistent development environments
- Simplified dependency management
- Better security through version pinning
- Streamlined development workflow
Why Fairlearn?
- Comprehensive Fairness Metrics: Multiple fairness definitions and metrics
- Bias Mitigation: Built-in algorithms for bias reduction
- Protected Attributes: Support for analyzing multiple protected groups
- Integration: Works well with scikit-learn pipelines
Why SHAP?
- Model-Agnostic: Works with any ML model
- Local & Global Explanations: Individual and overall feature importance
- Visualization: Rich visualization capabilities
- Trust: Widely accepted in industry and research
Key Benefits:
- Ethical model development
- Transparent decision-making
- Regulatory compliance
- Stakeholder trust
Why GitHub Actions?
- Native Integration: Tight integration with GitHub repositories
- Flexible Workflows: Customizable pipeline definitions
- Matrix Testing: Parallel testing across environments
- Artifact Management: Built-in artifact storage and sharing
Key Benefits:
- Automated testing and deployment
- Consistent build environments
- Easy integration with other tools
- Cost-effective for open-source projects
Why FastAPI?
- Performance: High-performance async framework
- Type Safety: Built-in data validation and type checking
- Documentation: Automatic OpenAPI/Swagger documentation
- Modern Features: Async support, dependency injection
Key Benefits:
- Fast and scalable API service
- Self-documenting endpoints
- Easy integration with ML models
- Strong type safety
Why Docker?
- Isolation: Consistent runtime environments
- Portability: Run anywhere with Docker installed
- Scalability: Easy horizontal scaling
- Versioning: Container version control
Key Benefits:
- Environment consistency
- Simplified deployment
- Easy scaling
- Version control for deployments
This section provides quick reference commands for common development and deployment tasks.
-
Activate Poetry Environment
poetry shell
-
Install Dependencies
# Install all dependencies poetry install # Install with development tools poetry install --with dev
-
Run Training Pipeline
# Run full training pipeline with HPO poetry run python scripts/optimize_train_select.py # Run with specific configuration poetry run python scripts/optimize_train_select.py --config config/training_config.yaml
-
Generate Drift Reference Data
# Create new reference dataset poetry run python scripts/create_drift_reference.py # Specify custom reference data path poetry run python scripts/create_drift_reference.py --output data/reference/new_reference.csv
-
Run Batch Predictions
# Process batch predictions poetry run python scripts/batch_predict.py --input data/predictions/input.csv --output data/predictions/results.csv
-
Run Test Suite
# Run all tests poetry run pytest # Run specific test file poetry run pytest tests/test_model.py # Run with coverage report poetry run pytest --cov=src --cov-report=html
-
Code Quality Checks
# Run linter poetry run ruff check . # Run type checker poetry run mypy src/ # Format code poetry run black .
-
Start MLflow UI
# Start MLflow UI server poetry run mlflow ui --host 0.0.0.0 --port 5001 # Access at: http://localhost:5001
-
Run Drift Detection
# Check for data drift poetry run python scripts/check_production_drift.py # Generate drift report poetry run python scripts/generate_drift_report.py --current-data data/current.csv
-
Build Docker Images
# Build API service docker build -t employee-attrition-api -f Dockerfile . # Build frontend docker build -t employee-attrition-frontend -f Dockerfile.frontend .
-
Run with Docker Compose
# Start all services docker-compose up --build # Start in detached mode docker-compose up -d # View logs docker-compose logs -f # Stop services docker-compose down
-
Access Services
- API: http://localhost:8000
- Frontend: http://localhost:8501
- MLflow: http://localhost:5001
- API Documentation: http://localhost:8000/docs
- Seed Database
# Seed from CSV poetry run python scripts/seed_database_from_csv.py --input data/raw/employees.csv # Verify database connection poetry run python -c "from src.employee_attrition_mlops.config import get_db_connection; get_db_connection()"
- Run GitHub Actions Locally
# Install act brew install act # Run specific workflow act -W .github/workflows/production_automation.yml
-
Common Issues
# Clear Poetry cache poetry cache clear . --all # Rebuild Docker images docker-compose build --no-cache # Check service logs docker-compose logs -f [service_name]
-
Environment Verification
# Check Python version poetry run python --version # Verify dependencies poetry show --tree # Check MLflow connection poetry run python -c "import mlflow; print(mlflow.get_tracking_uri())"
Note: All commands assume you're in the project root directory and have Poetry installed. For Docker commands, ensure Docker and Docker Compose are installed and running.
MIT
The project documentation is organized into several key areas:
- Architecture: System design and components
- Setup Guide: Installation and configuration
- Getting Started: Quick start guide
- API Documentation: API reference
- MLflow Usage: Experiment tracking and model management
- Monitoring Strategy: High-level monitoring approach and model governance
- Drift Detection Guide: Technical implementation of drift detection
- MLOps Workflow: End-to-end pipeline guide
- CI/CD Workflow: Continuous integration and deployment
- Responsible AI: Fairness assessment and bias mitigation
- Troubleshooting Guide: Common issues and solutions