Banks struggle with many issues these days. One of these is a thorough understanding of their clients. Not only their background information is important but also their behaviour. Interest arises in whether they are involved in possibly unwanted behaviour (e.g. money laundering or terrorism financing) or increasing their risk of default (e.g. missing monthly payments).
Machine learning is starting to help banks investigate their clients, provide new and better credit scores and predict which clients may run into payment issues.
In this project we look at a bank's data. We use an old data set (from 1999) which includes almost all properties real banks have. Even though this data is not as rich as that of a bank nowadays, it will provide a nice introductions into the techniques used by data analysts working for large banks.
The dataset is from the PKDD'99 and can be found here:
http://lisp.vse.cz/pkdd99/berka.htm.
The competition had the following original description:
The bank wants to improve their services. For instance, the bank managers have only vague idea, who is a good client (whom to offer some additional services) and who is a bad client (whom to watch carefully to minimize the bank loses). Fortunately, the bank stores data about their clients, the accounts (transactions within several months), the loans already granted, the credit cards issued. The bank managers hope to improve their understanding of customers and seek specific actions to improve services. A mere application of a discovery tool will not be convincing for them.
Our research will try to answer the following question:
Using a customer's properties and transactional behaviour, can we predict whether or not he/she is able to pay future loan payments?
To answer this question, we need to make some assumptions before we can attempt to answer this question. Some of these will not directly be clear but will once we dive deeper into the data.
In this project, we will use an F1-score to evaluate how well our model will be able to predict a future defaults. We will use the finished loans for both traning and testing data. As we will see that we have a relatively small amount of defaults and F1 score will give us a nice balance over accuracy and false positive ratio. The F1 score is defined as: $$F_1 = 2\cdot \frac{Precision \cdot Recall}{Precision + Recall}$$ where $$ Precision = \frac{tp}{tp+fp}$$ and $$ Recall = \frac{tp}{tp+fn}$$
The data for this project is provided on the competition website. In this project, it is downloaded and handled in the Get_data.py
file. The data has the following structure:
Get_data.py
file we deal with the merging and checking of the quality of the data. Any reader interested in downloading and checking the data for themselves can follow the steps in that file.
The data is provided to us in a flat text file with ;
delimiter and \n
EOL statements. Many of the headers/cells in the dataset are provided in Czech and translated to English for further use. The data is then loaded into pandas dataframes and merged into five files:
client_info.cv
containing all the info on the clientdemographic_info
containing all the info on the demographics of the Czech republicTransaction_info
A file containing all transactions made for the accounts in the datasetorder_info
A file containing info on the ordered products per clientloan_info
A file containing the loan information All of these files are in the further steps reduced to clients and accounts which have actual loans. These steps are described in Loan_payment_feature_engineering.ipynb
.
For some relevant features, we can provide the plots as found in the data. This allows us to check the distribution and whether or not there is a connection to missed payments. In these plots, we compare the distributions for loans which are past debt and those which are not.
We can give some quick statistics and plots:
Feature | mean | std | min | 25% | 50% | 75% | max |
---|---|---|---|---|---|---|---|
Amount | 151410 | 113372 | 4980 | 66732 | 116928 | 210654 | 590820 |
Balance | 44023 | 13793 | 6690 | 33725 | 44879 | 54396 | 79272 |
Birthyear | 1958 | 13 | 1935 | 1947 | 1958 | 1969 | 1980 |
Payments | 4190 | 2215 | 304 | 2477 | 3934 | 5813 | 9910 |
We can see the following things:
With these, we hope to be able to predict which loans will be in default, and which will run along just fine.
In this section, we will try to answer the following questions:
To tackle the first, remember that we want to predict future debt using past debt. Our data is highly skewed however, as we have 606 accounts without any debt and only 76 with debt. Hence, using our accuracy score may not be the best choice. We will go with the F1 score as it will provide us with a nice balance between precision, recall and accuracy.
Now actually configuring the data to do what we want is quite difficult. As we want to predict 'future missed payments' we need to use data in which we can predict future payments (as our past_debt only shows us if there is any past debt). We do however have the missed counts in all the quarters which we can use. Thus the setup will be as follows:
We will use the following features:
As for the models, we will use the usual suspects:
Naive Bayes: A model that leverages conditional probability. Using known distributions of data to predict related distributions. It is run by the main rule: $$ p(C_k \mid \mathbf {x}) = \frac {p(C_{k})\ p(\mathbf {x} \mid C_{k})}{p(\mathbf {x} )} $$ or in layman's terms: $$ \mbox{posterior} = \frac {{\mbox{prior}}\times {\mbox{likelihood}}}{\mbox{evidence}} $$ In our case, we will use a gaussian kernel that makes the assumption that the data is somewhat normally distributed to estimate the probability density function.
SVM: A linear SVM tries to lay a line (or in more dimension a (hyper)plane) inbetween data in order to create a seperation. Each side of the line receives a different classification (in our case payment miss or not). This is then optimized to build the most robust and least error prone line.
A a benchmark we will use the Naive Bayes (GaussianNB) classifier. As we will see later, it has an F1 score of 0.44. Even though this model is far more advanced than the industry standard, it is a good baseline.
This project took some preprocessing. As described earlier a lot of effort had to be done wrestling the data into a single dataset and translating the Czech to English.
At first the idea was to use the transaction network (with accounts as nodes and transactions as edges) to provide us with a lot of information. The dataset however, does not have the same key for the receiving and sending parties, making this network very sparsely connected. The small exercise in doing this can be found in the file Network_analysis.ipynb
. As a matter of fact, the largest connected component only had 7 nodes as seen in the picture below:
Therefore, I focused on the loan problem at hand. For this I made a lot of steps on the transactions and especially on the missing of payments. I started with all loans and looked at whether those who had missed payments actually had 'flat' months (months in which no loan was paid) to check if the loans were correctly identified. The following picture shows that this is true:
As we assumed that all loans are equal, we push them all back to start-date and see that all prepayments happen in the first month and we can use this data to construct our features:
After this, we can take the first 75% of the time for each loan to predict future missed payment, and use the last 25% as the input for the target variable.
In the end, we did choose to not include card/demographic behaviour due to the lack of variance in the data.
After all the feature engineering and exploratory analysis shown above, I moved on to looking at the global correlation structure. I rebuild a plot I often use when working in R to give the first insights.
After this, there was the phase in which we could start doing machine learning. I decided to not go for a CV scheme, as with the low amount of actual delayed payments in the data, there was the risk of creating subsets with no accounts with delayed payments.
To do a grid search I wrote my own little implementation (based on the GridSearchCV in skLearn) which does the job without cross-validation. It can be found in gridSearch.py
. I have used the gridsearch for the SVM, logistic regression and decision trees. This greatly improved performance compared with naive implementations (often giving an F1-score of 0.0). In the results section, the best scoring parameter settings are given using Occam's razor.
As stated above, I will test the following algorithms on the data:
I decided to run a standard naive bayes to see how it would perform compared with other algorithms.
I ran the SVM with both the linear and rbf kernal, a C-value of either 1 or 10 and balanced and unweighted classes.
With Logistic regression I went with both the L1 and L2 loss functions, a C-value of either 1 or 10 and balanced and unweighted classes.
For the Decision tree, I ran with 0.25, 0.5 and 0.75 max features combined with a max depth of either 1, 2 or 3.
All these will run on baisc settings with 1000 estimators.
For all algorithms, I will use the skLearn implementations combined with the gridsearch function for those for whom it may help. Running the models gave no complications as my macbook pro was sufficiently fast to run them. It was however hard to find a model which gave a satisfying F1-score, but this will be discussed in more detail in the Results part of this report.
The most difficult part of this project was preprocessing the data into a file that we could use. As loans run through time, have different length and different loan amounts and start/end dates, the data can't be simply fed to algorithms.
I also had to choose whether or not to use past missed payments in the classifier. I ended up choosing not to (as then it won't be usefull in an onboarding setting) leaving somewhat troublesome model performance.
Below is a table which shows both the F1-score for our models. We see that the F1-score is generally low. To deduce these scores, I have done several different tran/test combinations. In all of these combinations, the best score was 0.67 for the F1-score.
Model | Config | Features | Score |
---|---|---|---|
Naive Bayes | Gaussian/default | all | 0.44 |
SVM | {'kernel': 'rbf', 'C': 1, 'class_weight': 'balanced'} | Startyear client, Duration, payments | 0.67 |
Logistic Regression | 'C': 1, 'class_weight': 'balanced', 'penalty': 'l2' | Amount, Payments, Startyear Client, Balance | 0.39 |
Decision Tree | {'max_depth': 3, 'max_features': 0.75} | Amount, Balance | 0.33 |
Bagging | base_estimatior: Cart, n_estimators: 1000 | all | 0.4 |
AdaBoost | n_estimators: 1000 | all | 0.5 |
Random Forest | n_estimators: 1000 | all | 0.4 |
As our SVM seems to perfom the best, we choose to use it as our final model. In other checks, we found it to be a quite robust method, performing well in combination with the gridsearch function for various subsets.
We chose the SVM with a radial basis function as kernel and a simple C parameter (the misclassification/simplicity tradeoff) of value 1. This combined with balanced class weights (due to skewness) provides a very solid model.
The combination of both training and testing sets gives us a robust model, which I have tested on different splits. The score of only 0.67 makes the results not trustwordy enough to use in production. We do however have a low benchmark (no machine learning based predictions are often done by banks). Therefore, we may even say it is acceptable (but far from optimal).
I would have liked a better model and this sure is work for future improvement. More data and more customers would cetrainly help in this case.
Our simple SVM outperforms the Naive Bayes classifier with an F1-score of 0.67 vs 0.44. This means that it has a significant improvement over the other model and we can therefore justify changing to an SVM implementation.
Our final question is whether we can predict if running loans will get into trouble. Remember that we did not use past payment issues as predictor, loans which lack this information (newly formed) can then not be correctly classified. Therefore, we visualize how we classify running loans, based on their past debt. We see that this is not very good. No loans with past debt are classified correctly.
I started this project hoping to find payment issues using network properties. I found however that the data (or any data openly available) was not sufficient to provide me with both a loan/transactional dataset and interesting relationships between parties. Therefore I settled with a more traditional project.
The research question Using a customer's properties and transactional behaviour, can we predict whether or not he/she is able to pay future loan payments? has shown difficult to answer with the given dataset.
A lot of work has gone into manipulating data on the start of the project. Therefore, the data crunching was maybe on the heavy side and I could have spent more time in machine learning. But it was good to go through all the steps in Python for a change (our default language is R) and see where it may be more suitable than our default goto tools.
The machine learning part was therefore somewhat straightforward, different models were tested but none seemed to be a very good predictor. For this, we may need more samples and features.
The final model is somewhat dissapointing, but it is a good example of default machine learning not working for all problems.
The model can be improved in several ways. The data can be cut up in a different way to generate other features. We can take into account the skewness better by either building a skewed training set or generating more samples (with bootstraps, autoencoders or a GAN). This may improve the model and provide us a better prediction for running loans. In the end, the model would benefit from more personal data, often not disclosed in public datasets which better captivates properties of clients.