Extract data provided by lending club, and transform it to be useable by predictive models.
We are analyzing data from LendingClub, a peer-to-peer lending services company, and creating a machine learning model that will predict an applicant’s credit risk.
The purpose of the model is to help streamline the application process for loans. For the project we will employ different sampling techniques to account for unbalanced classes. We will then test and compare a balanced random forest classifier and easy ensemble classifier, two models that reduce bias, and determine if either of the models can be used to consistently predict an applicant’s credit risk.
Naïve Random Oversampling logistic regression
SMOTE oversampling logistic regression
Cluster centroids undersampling logistic regression
SMOTEENN combination sampling logistic regression
Balanced Random Forest Classifier
Easy ensemble adaboost classifier
Of the six machine learning models we used, the easy ensemble classifier showed the best results. It had the highest balanced accuracy score, precision, and recall. For a project like this, predicting an applicant’s credit risk, recall is more important than precision. We want to know if an applicant is high risk, how likely our model is to classify them as such. Because of this, I would recommend moving forward and implementing the easy ensemble classifier to assist with processing loan applications, as it’s recall and accuracy score are both high.