This project simulates work done within Scania’s Data Analytics & Fleet Management Division, where the goal is to predict component failures in trucks before they occur. Using real-world operational data from Scania’s APS (Air Pressure System), this analysis enables predictive maintenance, reducing unplanned downtime, maintenance costs, and safety risks.
Scania’s Fleet Management System (FMS) and Service 360 Pro leverage telematics data and AI to forecast potential failures. This project aims to replicate that data-driven approach by building a predictive model capable of classifying faulty (pos) and non-faulty (neg) vehicles.
- Build a data-driven predictive maintenance model using Scania’s APS dataset.
- Handle severe class imbalance effectively.
- Compare performance between Logistic Regression (from scratch) and Gaussian Naive Bayes (from scratch).
- Provide actionable insights for fleet managers and maintenance teams.
- Dataset Source: Scania APS Failure dataset
- Features: 170+ anonymized operational and histogram variables
- Target:
pos→ trucks with failureneg→ healthy trucks
- Missing Values: Represented as
"na", filled using median imputation. - Imbalance: Extremely high — healthy trucks (~98%) vs faulty (~2%).
- Missing Value Handling: Replaced with column-wise median.
- Highly Correlated Features: Removed where correlation > 0.95.
- Feature Scaling: Standardized using z-score normalization.
- Train-Validation Split: 80:20 stratified split.
- Balancing Strategy:
- No Downsampling (retain all majority samples).
- SMOTE Oversampling applied to minority class (10× increase).
📊 A pie chart and missing value bar plots were created to visualize data imbalance and completeness.
- Built using NumPy only (no scikit-learn).
- Includes sigmoid function, gradient descent optimization, and binary cross-entropy loss.
- Trained for 1000 epochs with validation loss tracking.
- Assumes features follow normal (Gaussian) distribution.
- Calculates mean, variance, and prior probability for each class.
- Used log probability for numerical stability.
Each model was evaluated on:
- Accuracy
- Precision
- Recall
- F1-Score
- Confusion Matrix (heatmap)
| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Logistic Regression (from scratch) | 0.9871 | 0.6764 | 0.8640 | 0.7588 |
| Gaussian Naive Bayes (from scratch) | 0.9639 | 0.3848 | 0.9040 | 0.5398 |
- Logistic Regression achieved a balanced trade-off between precision and recall, making it suitable for operational deployment.
- Gaussian Naive Bayes achieved higher recall, useful for initial screening, but generated more false positives.
- For Scania’s fleet management system, recall is crucial — missing a faulty vehicle (false negative) could result in costly breakdowns.
- Therefore, Logistic Regression provides a more reliable, balanced performance for predictive maintenance.
- Missing Value Feature Visualization:
Plotted bar plots to check missing value percentage to decide which feature need to drop - Confusion Matrix Heatmap:
Visualizes model’s classification power and error spread. - Future Work:
- Try Ensemble Learning (e.g., Random Forest).
- Integrate real-time sensor data for continuous learning.
- Model Deplyement and Monitor CI/CD Pipelines
- Class Imbalance Pie Chart
- Missing Values Bar Plot
- Confusion Matrix Heatmap
If you’re presenting this project:
- Slide 1–2: Scania company intro & predictive maintenance goal.
- Slide 3–5: Data challenges — missing values, imbalance, correlations.
- Slide 6–8: Model development (Logistic & GNB).
- Slide 9: Model performance table.
- Slide 10–11: Key insights & business impact.
- Slide 12: Conlcusion.
- Languages: Python (NumPy, Pandas, Matplotlib, Seaborn)
- ML Libraries:
imblearn(SMOTE only) - Visualization: Matplotlib
- Environment: Jupyter Notebook / Colab
This project demonstrates how Scania’s data-driven approach can be replicated to predict component failures, improving fleet reliability, maintenance efficiency, and safety.
The final model — Logistic Regression — provides a strong foundation for Scania’s predictive maintenance analytics pipeline.