No middle ground
This entry is to explain the approach I took in solving a prediction problem in a competition I entered last year.
The competition involved predicting outcome of short term loans (N10,000 -N50,000). The competition was hosted on Kaggle, thanks to Data Science Nigeria and One Finance & Investment Limited.
I don’t intend on going in-depth into details, but instead I’ll try and highlight the important steps I took in approaching the problem and the results.
Data
The data to be used in training and testing the model had information about demographics of customers, previous loans taken by customers and current loans. For a complete break down of the data-set and the features, check out the report.
Observation
After cleaning the data and carrying out exploratory data analysis, I made a few observations but I’ll be talking about just one important observation that influenced how I approached the modelling phase.
The amount borrowed by the customers ranges from N10,000 to N50,000. In the plot below it’s evident that most of the loans where N10,000. This observation led me to the NO MIDDLE GROUND approach.
Approach
So after observing that majority of the loans where N10,000 I thought It’ll be best to create a separate model for the N10,000 loans and another one for the rest. Now I hoped that splitting the problem and treating as two separate problems, would not only make it easier to understand but also improve accuracy. This approach was inspired by Corne Nagel (Chief Data Scientist at OneFi) while having a quick chat with him during one of the breakfast sessions at the bootcamp. He explained how he sometimes splits models between distinctive groups in some of his experiments, so I decided to give it a shot.
Results
I used the decision tree algorithm to build the model. While the evaluation metric for the competition was classification accuracy, I’m going to be showing a confusion matrix instead because I believe it gives a true and clear picture of how the model performed.
The N10,000 model predicted 468 good loans as good, and the misclassified the remaining 59 good loans as bad.
While it accurately predicted 85 bad loans as bad, and misclassified the remaining 127 bad loans as good.
The success of the model can be measured properly depending on the business problem, what are the tradeoffs for correctly predicting bad loans as bad?
Once again and even better, all good loans were accurately predicted, but the business problem is identifying potential bad loans and it managed to predict 4 bad loans accurately while misclassifying 85 bad loans as good.
Conclusion
The model created doesn’t seem to have performed well; it almost always fail when it comes to accurately predicting bad loans.
This could be because of the class imbalance i.e there are way more instances of good loans than bad loans. This is a problem that stems from the business side — In real life scenarios bad loans are anomalies and don’t happen as frequently as good loans.
Click here for the full report (more specifics), here for my notebook and finally, here for the competition home page.