This project develops a robust, explainable, and risk-aware pricing model for auto insurance policies. It incorporates statistical analysis, machine learning, and reproducible data practices to predict insurance claim severity and optimize premium pricing.
โ Completed as part of the Week 3 Challenge at 10Academy.
- Understand and explore insurance data to uncover actionable insights.
- Establish a reproducible data pipeline using Git, GitHub, and DVC.
- Statistically validate hypotheses related to insurance risk.
- Build predictive models to estimate:
- ๐ฐ Claim Severity โ How much we might pay.
- ๐ Claim Probability โ How likely a customer is to claim.
- Construct a dynamic pricing formula that incorporates business margins.
Area | Tools Used |
---|---|
Programming | Python, Jupyter |
Data Handling | Pandas, NumPy, DVC |
Visualization | Matplotlib, Seaborn, Plotly |
Modeling | Scikit-learn, XGBoost, SHAP, LIME |
Version Control | Git, GitHub, GitHub Actions |
CI/CD | GitHub Actions |
Environment | venv + requirements.txt |
.
โโโ data/ # Raw and processed data (tracked via DVC)
โโโ models/ # Saved models
โโโ notebooks/ # Jupyter notebooks for EDA, testing, modeling
โโโ src/ # Core source code
โ โโโ preprocessing/ # Cleaning, transformation, encoding
โ โโโ task_3/ # Hypothesis testing modules
โ โโโ task_4/ # Modeling pipeline and interpretation
โโโ tests/ # Unit tests
โโโ .dvc/ # DVC metadata
โโโ .github/workflows/ # GitHub Actions CI pipeline
โโโ dvc.yaml # DVC pipeline definition
โโโ requirements.txt # Python dependencies
โโโ README.md # Project overview (this file)
- Configured Git and GitHub, created
task-1
branch - Performed EDA on claims, premiums, and customer demographics
- Visualized insights across provinces, genders, and vehicle types
- Identified key drivers of loss ratio and risk
- Installed DVC and initialized version control
- Added data files to DVC tracking
- Set up a local remote storage and pushed data
- Ensured reproducibility and auditability of datasets
- Formulated and tested statistical hypotheses:
- ๐ Risk varies across provinces and zip codes
- ๐ฅ Gender differences in claim frequency and severity
- ๐ธ Profitability margins vary by region
- Used t-tests, z-tests, chi-squared where applicable
- Business interpretations provided for each result
- Built severity regression models: Linear, Random Forest, XGBoost
- Evaluated using RMSE, Rยฒ
- Used SHAP and LIME for feature importance
- Modeled claim probability (classification) for pricing
- Final pricing formula:
Insight | Impact on Pricing Strategy |
---|---|
โค4 Cylinder vehicles โ โ Severity Risk | Apply loading to small-engine vehicles |
Non-VAT Registered โ โ Risk | Raise base rate for unregistered customers |
Converted/Modified Vehicles = โ Risk | Apply higher risk surcharge |
Alarm/Immobilizer โ โ Risk | Provide discount for security features |
New Vehicles โ โ Risk | Discount for newer vehicles |
- Clone the repository:
git clone https://github.com/ayanasamuel8/End-to-End-Insurance-Risk-Analytics-and-Predictive-Modeling.git cd insurance-risk-model
Install dependencies:
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
Run notebooks:
jupyter notebook
Run tests:
pytest
๐งช Data Versioning with DVC bash dvc init dvc add data/raw/insurance_data.csv dvc remote add -d localstorage /path/to/your/storage dvc push To reproduce the data pipeline:
bash dvc pull โ CI/CD GitHub Actions is configured for:
Code linting
Unit tests
Model validation (optional step)
Workflow defined in .github/workflows/deploy.yml.
๐ Results Summary ๐งฎ Best Severity Model: XGBoost RMSE improvement: +ฮ% vs. baseline Top features: Engine Size, Vehicle Age, Province, Conversion Status
๐ง Classification Accuracy: ~X% Enables dynamic, fair, and risk-adjusted premium pricing
๐ฅ Contributors ๐ค Ayana Samuel Role: Full Data Science Workflow Skills: EDA, DVC, Statistical Testing, Machine Learning, GitOps GitHub: https://github.com/ayanasamuel8/End-to-End-Insurance-Risk-Analytics-and-Predictive-Modeling.git
๐ License This project is licensed for academic and demonstration use. Contact the author for commercial usage rights.