In [1]:
%load_ext rmagic
import rpy2 as Rpy
In this project, we are going to do loan default prediction. Our main goal is to determine whether a loan will default, as well as the loss incurred if it does default. As most loan default predictions result in a binary "yes" or "no", we will also incorporate how severe the default was through the loss that was incurred. This way, the financial investor can evaluate whether to take the risk or not, based on how much we believe they will lose if the loan is not returned.
From our past study, we divide our project into two stages. In the first stage, we will identify whether the loan will default or not by using a binary classifier. In the second stage, we will use the loan default data which is classified from the first stage to predict loss by regression. We get some data from Kaggle that we can apply these methods to. We will be able to submit our results directly in Kaggle to evaluate how well we have predicted the results in our test set.
Spencer Tung (spencer.zh.tung@gmail.com)
Shi Wen (wens@cs.ucla.edu)
Zijun Xue (xuezijun@ucla.edu)
Huiting Zhang (vicky.ht.zhang@cs.ucla.edu)
We are going to analyze loan default data from Kaggle (http://www.kaggle.com/c/loan-default-prediction). The training data provided not only indicates whether a person defaulted or not, but also how much
A quick glance at the data reveals that each of the features have also been anonymized. That is, we are given almost 800 columns of data that reflect transactions and other financial information about each user, but we are unable to intuitively decide on which columns "should" have a greater weight than another. This will probably make isolating the obvious trends more difficult, but will also force us to rely solely on identifying patterns independently of any biases we may have.
The training set includes one extra column than the testing set. This extra column is the loss data for each user, normalized to 100%. A 0 entry means that the user was able to pay back their loan in full, and a 100 entry means that the user was unable to pay back any of their loan. Any entry in between 0 and 100 reflects the percentage of the total loan that the user was unable to pay back before defaulting.
In [8]:
%%R -h 1000 -w 1000
Loan_Data = read.csv( "train_v2.csv", header=TRUE, sep=",")
In [23]:
%%R
loan_colnames = colnames(Loan_Data)
print(loan_colnames)
As such, the only meaningful graph we are able to produce is plotting the user ID against the loss percentages, which we have done below.
In [22]:
%%R -h 1500 -w 1500
plot(Loan_Data$id, Loan_Data$loss, type = "p", main = "Normalized Default Data",
xlab = "ID Numbers", ylab = "Percentage of Loan Defaulted", cex.main = 2.5, cex.lab = 1.5)
We referred to a number of sources when coming up with a strategy to analyze our data. The bulk of our analysis can be separated into two stages:
The ones who did not default will have a reading of 0. Including them in our regression model will just heavily skew our prediction. With a guarantee that the remaining values are all between 1 and 100, we can create a model that attempts to predict the amount that was defaulted by.
Our first step is to decide which of the data to use for our classification and regression. Out of the almost 800 features (with no distinguishing identification), we ran a pairwise comparison of each feature. Each pair was checked with the Pearson correlation coefficient, a measure that determines the linear correlation between two variables. Variables that had a high correlation with one another were eliminated, leaving a hundred variables of the original 800.
We decided to use a gradient boosting classifer to sort out those who defaulted, and those who did not. Here is the result of using a 1000 input training set on the test data:
For regression, we used a mix of a gaussian process regression, a support vector regression, and a gradient boosting regression to get our results. Here is the result of using a 1000 input training set on the test data:
Benchmark of all zeros was made for our benefit:
Using a 1000 data set as our training set, we ran all three models at once on the test set. This resulted in the following placement on the leaderboards:
Upon increasing our training set to 5000 inputs, we did the same as before, running all three models generated on the test data. This time, we saw a substantial increase in the ranks:
Now that we have a rudimentary model in the works, we aim to expand our training set to encompass most, if not all, of the observations. As of yet, running just a 5000 observation system takes about 3 hours. Running a training set on the entire 100000 data points would take an immense amount of time, so we have held off on using all of the observations until a better method can be put to the grind.
[1] My solution to the Loan Default Prediction Competition --Josef Feigl
[2] Loan Default Prediction Challenge Description of Solution --Junchao Lyu
[3] Loan Default Prediction at Kaggle --Guocong Song
In [ ]: