A step-by-step notebook that engineers a proxy “bad-client” label, performs feature-rich EDA, and builds a leakage-free XGBoost credit scorecard with cross-validation and threshold tuning on 438 k loan-applicant records.
- Raw CSV → missing-value audit
- Days-to-years conversion + weighted heuristic risk score
- EDA (KDEs, lollipops, correlation, Chi-Square)
- One-hot encoding → 50 numeric features
- Baseline Logistic Regression (AUC 0.98)
- Leakage demo + fix → XGBoost (AUC 0.9997)
- Precision–Recall tuning for business cut-offs
Goal: classify applicants as high-risk or low-risk when no historical default flag exists, using only profile data (income, dependents, occupation, etc.).
Stage | Detail |
---|---|
Feature Eng. | Convert day counts to years; build 5-rule risk score |
Proxy Label | BAD_CLIENT = 1 if risk_score > 0.20 |
EDA | KDE + lollipop plots to visualise class separation |
Stats | Pearson heat-map, Chi-Square on categoricals |
Pre-proc | Drop ID, impute, binary map, one-hot (50 features) |
Baseline | Class-weighted Logistic Regression |
Tree Model | XGBoost + scale_pos_weight , 5-fold CV |
Threshold | PR curve to pick 0.30 vs 0.70 cut-offs |
Metric | Logistic Reg. | XGBoost (Refined) |
---|---|---|
ROC AUC | 0.9772 | 0.9997 |
Precision (Bad) | 0.1085 | 0.9738 |
Recall (Bad) | 0.9456 | 1.0000 |
F1 (Bad) | 0.1947 | 0.9867 |
Why almost perfect? The label is deterministic; a tree can rediscover the rules exactly.
Swap in real repayment data and retrain for realistic scores.
Threshold | Precision | Recall | When to Use |
---|---|---|---|
0.30 | 0.94 | 1.00 | Risk-averse – catch every defaulter, accept more manual reviews |
0.70 | 0.99 | 0.997 | Cost-focused – minimise false positives, tolerate a ~0.3 % miss-rate |
PS: Please access data in this link here: link: https://drive.google.com/file/d/1WeIIugqArR0fdVt7Wpcyj_3KmD4h8Jzb/view?usp=sharing
Shawn Waringu
Data Scientist & Analyst