Predict customer credit card defaults for next month. Keywords: R, logic regression, decision tree
The credit card issuer in Taiwan faced an increasing default crisis since 2006. They over-issued credit cards to unqualified applications in order to expand the market share. This project will use credit card datasets from 2000 to 2003 to find out key drivers of payment default and improve credit card scoring system to better predict customers’ likelihood of default.
default_payment_next_month: Will the customer default next month (Yes = 1, No = 0)?
limit_bal: Household credit limit
sex: 1 = male; 2 = female
education 1 = graduate school; 2 = university; 3 = high school; 4 = others
marriage 1 = married; 2 = single; 3 = divorced, 0 = others
age years
pay_april through pay_sept Payment status from the previous 6 months (April - Sept. 2006). -2: No consumption; -1: Paid in full; 0: The
use of revolving credit; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
bill_april through bill_sept Bill statement amount from the previous 6 months (April - Sept. 2006).
pay_amt_april through pay_amt_sept Previous payment amount from the previous 6 months (April - Sept. 2006)
I first conducted a brief exploratory analysis and found correlation between whether the cardholders will default next month and sex, bill amount in September, marriage, and credit limit. After analyzing these features (please see figure 5 to 8), I summarized:
The dataset is quite clean as there are no near zero variables or missing values.
I made hypotheses that cardholders who always paid off their statement balance will pay the bill next month. I created a new feature call “fullpmt_per” which measures the percentage of full bill payments in the last 6 months.
I built 6 machine learning models and measured their performances by predict accuracy. The final model is simple decision tree model which has the highest predict accuracy of 82.26%.
Accuracy:
Accuracy:
Credit card payment default can stem from various of factors. Married men are high default risk groups. The more the credit limit, the less the default risk. The final model I adopted, which is simple decision tree model, provides an interesting interpretation of key factors driving the default. It suggests that if people paid bill in September, there are 89% chance they won’t default next month. If they paid in both September and July, chances are 99% they will pay bills next month.
The negative correlation of credit limit and default shows that the current credit scoring system is effective to certain extent but definitely not enough considering the default crisis. I suggest banks to investigate cardholders’ age, sex, marriage status, credit limit, payment history and reevaluate their weights in current credit approval and scoring system. They might need to build an effective credit card suspension system to cancel the accounts of certain cardholders obviously does not have the ability to pay off debt or keeps not paying on purpose.
Figure 1. Distribution of outcome variable and some features
Figure 2. The distribution of card holder’s age
Figure 3. Correlations between default_next_month, limit_bal, pay_amt_sept, and bill_sept
Figure 4. Correlations between default_next_month,, education, sex, age, and marriage
Figure 5. The relationship between default and credit limit
Figure 6. The relationship between default and sex
Figure 7. The relationship between default and marriage
Figure 8. The relationship between default and bill amount in September
Figure 9. Count of default_next_month for cardholder paid in full