German Credit Data Analysis and Predictive Modeling

Project Overview

This project involves analyzing and building predictive models on the German Credit dataset. The goal is to identify important factors influencing credit decisions and build a robust pipeline to predict creditworthiness. The project workflow spans data cleaning, exploratory data analysis (EDA), feature engineering, model building, and testing, all of which are seamlessly integrated into an . This app allows users to explore visualizations and findings of the analysis and utilize the deployed model for real-time classification and predictions.

Link to Website: German Credit Risk Analysis and Modeling

Conclusion: Achieved 78% accuracy and 77.36% precision using Support Vector Machine (c=1, kernel='rbf')

Notebooks Summary

1. German Credit Data Analysis and Predictive Modeling.ipynb

This notebook includes:

Data Loading: Encoding and decoding non-numeric data types for consistency.
Understanding Data: Analysis of data types and unique values in each column.
- Adjustments made to ensure uniqueness in credit_history and splitting personal_status into gender and marital_status columns.
Outlier Detection: Visual exploration of critical attributes:
- Observations for duration and credit_amount revealed their importance in distinguishing target classes.
Exploratory Data Analysis (EDA):
- Univariate Analysis: Highlighted target class imbalance.
- Bivariate Analysis:
  - Numerical Attributes: Identified significant differences in duration, credit_amount, and age between good and bad credit.
  - Categorical Attributes: Proportional analysis due to target class imbalance.
- Multivariate Analysis: Key findings include correlations between duration and credit_amount, and between age and residence_since.
Feature Importance:
- Techniques used:
  - Random Forest Classifier (Ensemble Learning - Bagging)
  - Gradient Boosting Classifier (Ensemble Learning - Boosting)
  - Chi-Square Test for Independence
  - ANOVA F-test for Numerical-Categorical Relationships
  - Point Biserial Correlation for Numerical-Binary Relationships
  - Mutual Information for Dependency Assessment
- Cross-validation of EDA findings with feature importance rankings.
Data Saving: Cleaned data with important attributes stored for subsequent steps.

2. Model Building.ipynb

This notebook includes:

Loading Data: Preparing the dataset for modeling.
Normalizing Numerical Attributes: Ensuring numerical attributes are scaled appropriately.
Encoding Categorical Variables: Transforming categorical data into numerical format.
Training and Testing Dataset: Splitting the data into train and test sets.
Elimination of Class Imbalance:
- Techniques explored:
  - Oversampling minority class (e.g., duplicating data or using SMOTE).
  - Undersampling majority class.
  - Combined Over- and Undersampling.
- Decision: Used the first technique (SMOTE) due to limited data availability.
Hyperparameter Tuning and Model Building:
- Several configurations were tested:
  - HPT with all combined attributes (EDA + other techniques) and SMOTE.
  - HPT with only EDA-based important attributes and SMOTE.
  - HPT with only other techniques-based important attributes and SMOTE.
  - HPT with all combined attributes (EDA + other techniques) without SMOTE.
  - HPT with only EDA-based important attributes without SMOTE.
  - HPT with only other techniques-based important attributes without SMOTE.
- Final Conclusion:
  - Models without oversampling the minority class (using SMOTE) and employing stratification during train-test splits yielded higher accuracy compared to other methods.
- Ranking of Methods Based on Test Accuracy and Corresponding Models:
  1. HPT with only onther techniques based important attributes without SMOTE - SVM (c=1, kernel='rbf') - (Accuracy-78%, Precision-77.36%)
  2. HPT with all the combined attributes (from EDA + other techniques) without SMOTE - SVM (c=1, kernel='rbf') - (Accuracy-77.6%, Precision-76.98%)
  3. HPT with only EDA based important attributes without SMOTE - Logistic Regression (c=0.1) - (Accuracy-77%, Precision-75.81%)
  4. HPT with only EDA based important attributes and SMOTE - Gradient Boosting - (Accuracy-74.3%, Precision-73.8%)
  5. HPT with only onther techniques based important attributes and SMOTE - Random Forest - (Accuracy-74.3%, Precision-73.17%)
  6. HPT with all the combined attributes (from EDA + other techniques) and SMOTE - Logistic Regression (c=0.1) - (Accuracy-73.3%, Precision-76.52%)

3. Model Testing on Raw Dataset.ipynb

Tests models on the raw dataset to assess their generalizability.

4. Pipeline.ipynb

Combines preprocessing, feature engineering, and model prediction into a single pipeline for deployment.

Resources

label_encoder.pkl: Stores label encoders for categorical variables.
pipeline.pkl: Serialized pipeline for deployment.

Scripts

german-credit-app.py: Streamlit-based application for interacting with the model.
pipeline.py: Contains functions for preprocessing and model inference.

Dependencies

All dependencies are listed in requirements.txt for easy installation.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
notebooks		notebooks
resources		resources
.gitignore		.gitignore
README.md		README.md
german-credit-app.py		german-credit-app.py
pipeline.png		pipeline.png
pipeline.py		pipeline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

German Credit Data Analysis and Predictive Modeling

Project Overview

Notebooks Summary

1. German Credit Data Analysis and Predictive Modeling.ipynb

2. Model Building.ipynb

3. Model Testing on Raw Dataset.ipynb

4. Pipeline.ipynb

Resources

Scripts

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Dhruv-Limbani/German-Credit-Data-Analysis-and-Modeling

Folders and files

Latest commit

History

Repository files navigation

German Credit Data Analysis and Predictive Modeling

Project Overview

Notebooks Summary

1. German Credit Data Analysis and Predictive Modeling.ipynb

2. Model Building.ipynb

3. Model Testing on Raw Dataset.ipynb

4. Pipeline.ipynb

Resources

Scripts

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages