This project involves analyzing and building predictive models on the German Credit dataset. The goal is to identify important factors influencing credit decisions and build a robust pipeline to predict creditworthiness. The project workflow spans data cleaning, exploratory data analysis (EDA), feature engineering, model building, and testing, all of which are seamlessly integrated into an . This app allows users to explore visualizations and findings of the analysis and utilize the deployed model for real-time classification and predictions.
Link to Website: German Credit Risk Analysis and Modeling
Conclusion: Achieved 78% accuracy and 77.36% precision using Support Vector Machine (c=1, kernel='rbf')
This notebook includes:
- Data Loading: Encoding and decoding non-numeric data types for consistency.
- Understanding Data: Analysis of data types and unique values in each column.
- Adjustments made to ensure uniqueness in
credit_history
and splittingpersonal_status
intogender
andmarital_status
columns.
- Adjustments made to ensure uniqueness in
- Outlier Detection: Visual exploration of critical attributes:
- Observations for
duration
andcredit_amount
revealed their importance in distinguishing target classes.
- Observations for
- Exploratory Data Analysis (EDA):
- Univariate Analysis: Highlighted target class imbalance.
- Bivariate Analysis:
- Numerical Attributes: Identified significant differences in
duration
,credit_amount
, andage
between good and bad credit. - Categorical Attributes: Proportional analysis due to target class imbalance.
- Numerical Attributes: Identified significant differences in
- Multivariate Analysis: Key findings include correlations between
duration
andcredit_amount
, and betweenage
andresidence_since
.
- Feature Importance:
- Techniques used:
- Random Forest Classifier (Ensemble Learning - Bagging)
- Gradient Boosting Classifier (Ensemble Learning - Boosting)
- Chi-Square Test for Independence
- ANOVA F-test for Numerical-Categorical Relationships
- Point Biserial Correlation for Numerical-Binary Relationships
- Mutual Information for Dependency Assessment
- Cross-validation of EDA findings with feature importance rankings.
- Techniques used:
- Data Saving: Cleaned data with important attributes stored for subsequent steps.
This notebook includes:
- Loading Data: Preparing the dataset for modeling.
- Normalizing Numerical Attributes: Ensuring numerical attributes are scaled appropriately.
- Encoding Categorical Variables: Transforming categorical data into numerical format.
- Training and Testing Dataset: Splitting the data into train and test sets.
- Elimination of Class Imbalance:
- Techniques explored:
- Oversampling minority class (e.g., duplicating data or using SMOTE).
- Undersampling majority class.
- Combined Over- and Undersampling.
- Decision: Used the first technique (SMOTE) due to limited data availability.
- Techniques explored:
- Hyperparameter Tuning and Model Building:
- Several configurations were tested:
- HPT with all combined attributes (EDA + other techniques) and SMOTE.
- HPT with only EDA-based important attributes and SMOTE.
- HPT with only other techniques-based important attributes and SMOTE.
- HPT with all combined attributes (EDA + other techniques) without SMOTE.
- HPT with only EDA-based important attributes without SMOTE.
- HPT with only other techniques-based important attributes without SMOTE.
- Final Conclusion:
- Models without oversampling the minority class (using SMOTE) and employing stratification during train-test splits yielded higher accuracy compared to other methods.
- Ranking of Methods Based on Test Accuracy and Corresponding Models:
-
HPT with only onther techniques based important attributes without SMOTE - SVM (c=1, kernel='rbf') - (Accuracy-78%, Precision-77.36%)
-
HPT with all the combined attributes (from EDA + other techniques) without SMOTE - SVM (c=1, kernel='rbf') - (Accuracy-77.6%, Precision-76.98%)
-
HPT with only EDA based important attributes without SMOTE - Logistic Regression (c=0.1) - (Accuracy-77%, Precision-75.81%)
-
HPT with only EDA based important attributes and SMOTE - Gradient Boosting - (Accuracy-74.3%, Precision-73.8%)
-
HPT with only onther techniques based important attributes and SMOTE - Random Forest - (Accuracy-74.3%, Precision-73.17%)
-
HPT with all the combined attributes (from EDA + other techniques) and SMOTE - Logistic Regression (c=0.1) - (Accuracy-73.3%, Precision-76.52%)
-
- Several configurations were tested:
- Tests models on the raw dataset to assess their generalizability.
- Combines preprocessing, feature engineering, and model prediction into a single pipeline for deployment.
-
label_encoder.pkl
: Stores label encoders for categorical variables. -
pipeline.pkl
: Serialized pipeline for deployment.
- german-credit-app.py: Streamlit-based application for interacting with the model.
- pipeline.py: Contains functions for preprocessing and model inference.
All dependencies are listed in requirements.txt
for easy installation.