You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In 2021 during three months, Nigerian car insurance company held a competition in African data science competition platform called Zindi. In this competition the organizer wanted to know wheter or not a client will submit a vehicle insurance claim in the next 3 months. In this competition 600+ competitors participated.
Data
The dataset consisted of Train == 12000, Test == 1200, Sample_Submition, Nigerian_State_LGA_Name.
Metrics
F1_score for evaluating our algorithm.
ML Task
Binary Classification task.
Problems
The dataset was unbalanced.
It had missing values in some columns.
Age column had outliers.
Despite distinct IDs duplicated rows existed.
State and LGA column names were incorrect.
Some duplicated rows had different target.
Solved
Used RandomOverSampler algorithm to oversample the minority class.
I tried to impute NaNs with Iterative-Imputer and KNN-Imputer.
I used absolute value of Age to fix negative values.
When I deleted duplicated values I got lower F1_score in public LB so I did not fix it. But in private LB I found out I should have deleted it.
Interestingly I used Nigerian_State_LGA_Name dataset to correct Names in LGA and State.
I again did not fix duplicated rows with different targets.
Unsolved
Did not pay attention to scaling, transforming, feature selection, which led to overfitting.
rather than following ML rules I followed what public LB told me about duplicated rows.
I did not use Stacking or boosting from ensembles efficiently.
Algorithms Used
CatBoost for binary Classification.
Iterative-Imputer with ExtraTrees for Imputing Missing Values by Label-Encoding the categorical dtype.
RandomOverSampler for Over-Sampling minority class.
Others.
🛠 Tech Tools
👾
⚙️
💻
About
Nigerian car insurance company competition in African data science platform "Zindi".