Loan payment feature engineering.

In this notebook, we will engineer the futures to predict whether an account will be unable to pay its loan in the future. We will use the following charasteristics:

  • Loan characteristics (size, count, payments etc.)
  • Account charasteristics
  • Transactional behavior
  • Demographic information
  • Card info

For the finished loans, we will use the first 75% of the running time to predict whether there is a missed payment in the last 25% of the running time.

Imports:


In [1]:
import pandas as pd
import numpy as np

Load data:


In [2]:
client_info = pd.read_csv('data/client_info.csv')
demographic_info = pd.read_csv('data/demographic_data.csv')
transaction_info = pd.read_csv('data/transction_info.csv')
order_info = pd.read_csv('data/order_info.csv')
loan_info = pd.read_csv('data/loan_info.csv')


/Users/moeben001/anaconda2/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2717: DtypeWarning: Columns (8) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Loan data

Look at the loan data and see which features can be made.


In [3]:
loan_info['status'].value_counts()


Out[3]:
run_no_problem    403
fin_no_problem    203
run_but_debt       45
fin_unpaid         31
Name: status, dtype: int64

We start by looking at running loans with payment problems to see how we can define a model their behavior.

Scope on run_but_dept clients (for plots)


In [4]:
scope = loan_info[loan_info['status']=='run_but_debt']['account_id'].unique()
transaction_scope = transaction_info[transaction_info['account_id'].isin(scope)]

Scope only on loans and pivot.


In [5]:
transaction_scope = transaction_scope[transaction_scope['k_symbol']=='loan']
transaction_scope = transaction_scope.loc[:,['account_id','date','amount']]
transaction_scope = transaction_scope.pivot(index = 'date', 
                                            columns = 'account_id', 
                                            values = 'amount')
transaction_scope = transaction_scope.fillna(0)

Plot all payments.


In [6]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')

In [7]:
trans_plot = transaction_scope.cumsum()
plt.figure()
ax = trans_plot.plot(legend = False)
ax.set_ylabel("amount")


Out[7]:
<matplotlib.text.Text at 0x116369f50>
<matplotlib.figure.Figure at 0x116133410>

All these payments seem to have flat months indeed! As we assume that loans are payed monthly, the flat lines on the map appear to be missed payments. As order_info provides us with the heigth of every montly payment, we can plot the difference between the money ordered and eventually paid.
Also, all payments occur on the 12th of the month. Many customers who sign the contract of the loan before the 12th, therefore providing a missing payment in the data which may be artificial. We assume that loan payments start the next month therefore we bump the date by 12 days if the day in the month is before the 12th, removing our missed payments.


In [8]:
from datetime import datetime
payment_diff = transaction_scope.copy()
for acct in scope:
    loan_start = loan_info[loan_info['account_id']==acct]['date'].unique()[0]
    loan_date = datetime.strptime(loan_start, '%Y-%m-%d')
    if loan_date.day < 12:
        loan_date = datetime(loan_date.year, loan_date.month, loan_date.day + 12)
    loan_start = loan_date.strftime('%Y-%m-%d')
    loan_order = loan_info[loan_info['account_id']==acct]['payments'].unique()[0]
    payment_diff.loc[payment_diff.index > loan_start, acct] = payment_diff.loc[payment_diff.index > loan_start,acct] - loan_order

Plot all missed payments.


In [9]:
plt.figure()
ax = payment_diff.plot(legend = False)
ax.set_ylabel("amount")


Out[9]:
<matplotlib.text.Text at 0x116597d90>
<matplotlib.figure.Figure at 0x1164165d0>

On first inspection, it seems that we missed some initial payments. This is not a big deal as we are only intersted in missed payments and not so much in early payments (which may or may not have been refunded).
Note also that in this graph, spikes are one-time missers, where plateaus are multiple consecutive missed payments. Now, the main bulk of missed payments is at the end of the graph, this is because we look only at running loans, which are generally started later. Therefore we choose to revert this to once again looking at all loans and later look at only the running ones. This essentially means running the same code again, with the adjustment that for ended loans, we need to not assume that a zero payment after the startdate is a missed one.


In [10]:
scope = loan_info['account_id'].unique()
running = loan_info[loan_info['status']=='run_but_debt']['account_id'].unique()

transaction_scope = transaction_info[transaction_info['account_id'].isin(scope)]
transaction_scope = transaction_scope[transaction_scope['k_symbol']=='loan']

transaction_scope = transaction_scope.loc[:,['account_id','date','amount']]
transaction_scope = transaction_scope.pivot(index = 'date', 
                                            columns = 'account_id', 
                                            values = 'amount')
transaction_scope = transaction_scope.fillna(0)

We now have to add some months to have dates for every loan in the full dataset. Function borrowed from: https://stackoverflow.com/questions/4130922/ Note that this code is somewhat messy but it will do for now.


In [11]:
from datetime import datetime
import calendar
def add_months(sourcedate, months):
    month = sourcedate.month - 1 + months
    year = int(sourcedate.year + month / 12 )
    month = month % 12 + 1
    day = min(sourcedate.day,calendar.monthrange(year,month)[1])
    return datetime(year,month,day)

payment_diff = transaction_scope.copy()
for acct in scope:
    if acct in running:
        loan_start = loan_info[loan_info['account_id']==acct]['date'].unique()[0]
        loan_date = datetime.strptime(loan_start, '%Y-%m-%d')
        if loan_date.day < 12:
            loan_date = datetime(loan_date.year, loan_date.month, loan_date.day + 12)
        loan_start = loan_date.strftime('%Y-%m-%d')
        loan_order = loan_info[loan_info['account_id']==acct]['payments'].unique()[0]
        payment_diff.loc[payment_diff.index > loan_start, acct] = payment_diff.loc[payment_diff.index > loan_start,acct] - loan_order 
    else:
        loan_start = loan_info[loan_info['account_id']==acct]['date'].unique()[0]
        loan_date = datetime.strptime(loan_start, '%Y-%m-%d')
        if loan_date.day < 12:
            loan_date = datetime(loan_date.year, loan_date.month, loan_date.day + 12)
        loan_start = loan_date.strftime('%Y-%m-%d')
        loan_date_end = add_months(loan_date,loan_info[loan_info['account_id']==acct]['duration'].unique()[0])
        loan_end = loan_date_end.strftime('%Y-%m-%d')
        monthly_payment = loan_info[loan_info['account_id']==acct]['payments'].unique()[0]
        dates = (payment_diff.index > loan_start) & (payment_diff.index < loan_end)
        payment_diff.loc[dates, acct] = payment_diff.loc[dates, acct] - monthly_payment

Now we can again make a plot. Which should show more earlier missed payments (and some more positives which were prepaid).


In [12]:
plt.figure()
ax = payment_diff.plot(legend = False)
ax.set_ylabel("amount")


Out[12]:
<matplotlib.text.Text at 0x116948110>
<matplotlib.figure.Figure at 0x116544190>

Missing payments into loan.

Now that we know this, we can do our final transformation on the missed payments. Even though we have information on the unemployment rates of '95 and '96, we act as if all our periods are equal. Hence we can now start making features.
We will use the features:

  • Loan amount
  • Loan duration
  • Number of total missed payments
  • Number of missed payments in first, second, third and last quarter of the loan
  • Max number of consecutively missed payments

As we will provide the split into the 80/20 estimate, we will first split and then generate the features on the first 80%. The last 20% will only be provided a "missed payment" column which is either 0 or 1.

Let's first remake the above picture for 'months into loan'.

This means moving all starts to 0. My implementation seems suboptimal, but it will do the job. I could have also moved all transactions and redo the pivoting.


In [13]:
max_duration = loan_info['duration'].max()
monthly_missed = pd.DataFrame(index=range(0,60), columns=list(payment_diff))

In [14]:
for acct in scope:
    if acct in running:
        loan_start = loan_info[loan_info['account_id']==acct]['date'].unique()[0]
        tmp = payment_diff.loc[payment_diff.index > loan_start, acct].tolist()
        tmp = [round(i,0) for i in tmp]
        tmp = tmp + [0]*(60 - len(tmp))
        tmp = [int(i) for i in tmp]
        monthly_missed[acct] = tmp
    else:
        loan_start = loan_info[loan_info['account_id']==acct]['date'].unique()[0]
        loan_date = datetime.strptime(loan_start, '%Y-%m-%d')
        loan_date_end = add_months(loan_date,loan_info[loan_info['account_id']==acct]['duration'].unique()[0])
        loan_end = loan_date_end.strftime('%Y-%m-%d')
        tmp = payment_diff.loc[(payment_diff.index > loan_start) & (payment_diff.index < loan_end), acct].tolist()
        tmp = [round(i,0) for i in tmp]
        tmp = tmp + [0]*(60 - len(tmp))
        tmp = [int(i) for i in tmp]
        monthly_missed[acct] = tmp

In [15]:
plt.figure()
ax = monthly_missed.plot(legend = False)
ax.set_ylabel("amount")
ax.set_xlabel("month")


Out[15]:
<matplotlib.text.Text at 0x1189a7250>
<matplotlib.figure.Figure at 0x1189dc490>

This confirms the fact that payment differences which are positive occur in the first month. Hence, this is the prepayment mismatch we have (but at least we don't set ones to missed which aren't).

Let's make the loan table.


In [16]:
df = loan_info.copy()
df.columns = ['loan_id','account_id','start_date','amount','duration','payments','status']

Merge the total misses


In [17]:
tmp = pd.DataFrame(monthly_missed.apply(lambda column: (column < -1).sum()))
tmp.columns = ['total_missed']
df = pd.merge(left = df, right = tmp, left_on = 'account_id', right_index = True)

Split the dataset on loan duration


In [18]:
durations = df['duration'].value_counts().to_dict()
quarter_missed = {}
misses = pd.DataFrame(columns = ['first_quarter','second_quarter','third_quarter','last_quarter'])
for key in durations.keys():
    tmp =  df[df['duration']==int(key)]['account_id'].tolist()
    quarter_missed[key] = monthly_missed.loc[:,tmp]
    tmp_df = pd.DataFrame(quarter_missed[key].loc[range(0,key/4),].apply(lambda column: (column < -1).sum()))
    tmp_df['second_quarter'] = quarter_missed[key].loc[range(key/4, 2*key/4),].apply(lambda column: (column < -1).sum())
    tmp_df['third_quarter'] = quarter_missed[key].loc[range(2*key/4, 3*key/4),].apply(lambda column: (column < -1).sum())
    tmp_df['last_quarter'] = quarter_missed[key].loc[range(3*key/4, 4*key/4),].apply(lambda column: (column < -1).sum())
    tmp_df.columns = ['first_quarter','second_quarter','third_quarter','last_quarter']
    misses = pd.concat([misses,tmp_df])
for col in misses.columns:
    misses[col] = misses[col].astype(int)
df = pd.merge(left = df, right = misses, left_on = 'account_id', right_index = True)

In [19]:
quarter_missed.keys()


Out[19]:
[48, 24, 60, 12, 36]

In [20]:
for col in misses.columns:
    misses[col] = misses[col].astype(int)

Max number of consecutively missed payments


In [21]:
consecutive = monthly_missed.copy()
consecutive[consecutive > -2] =0
consecutive[consecutive < -1] = -1

In [22]:
# From: https://stackoverflow.com/questions/36441521
import itertools
tmplist = []
for acct in consecutive.columns:
    a = consecutive[acct].tolist()
    z = [(x[0], len(list(x[1]))) for x in itertools.groupby(a)]
    z = [x for x in z if x[0] !=0]
    if len(z)==0:
        tmp = 0
    else:
        tmp = max(z, key=lambda x:x[1])[1]
    tmplist.append(tmp)
max_missed = pd.DataFrame()
max_missed['account_id'] = consecutive.columns
max_missed['max_missed'] = tmplist

In [23]:
df = pd.merge(left = df, right = max_missed, left_on = 'account_id', right_on = 'account_id')

Client charasteristics

We look at the following charasteristics:

  • Number of clients per account.
  • Average balance on the account.
  • Birthyear of the owner
  • Sex of the owner
  • Startyear of the account

Note that we will need to cut the data into 75% running time for finished clients when we accumulate different data.

Scope on only the clients with a loan:


In [24]:
client_scope = client_info[client_info['account_id'].isin(scope)]
client_scope.describe()


/Users/moeben001/anaconda2/lib/python2.7/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)
Out[24]:
client_id district_id_client disp_id account_id district_id_branch loan_id amount duration payments card_id
count 827.000000 827.000000 827.000000 827.000000 827.000000 827.000000 827.000000 827.000000 827.000000 170.000000
mean 7171.419589 38.215236 7049.634825 5863.874244 37.621524 6180.746070 151801.538089 36.261185 4231.251511 791.076471
std 4053.288033 25.033481 3925.311133 3277.925506 25.195087 681.292733 114577.041848 17.172800 2238.757849 345.872296
min 2.000000 1.000000 2.000000 2.000000 1.000000 4959.000000 4980.000000 12.000000 304.000000 16.000000
25% 3608.500000 14.000000 3608.500000 2988.000000 13.000000 5585.000000 67536.000000 24.000000 2495.500000 NaN
50% 7011.000000 41.000000 7011.000000 5794.000000 39.000000 6190.000000 115992.000000 36.000000 3937.000000 NaN
75% 10795.500000 60.000000 10487.500000 8757.000000 60.000000 6760.500000 210564.000000 48.000000 5916.000000 NaN
max 13971.000000 77.000000 13663.000000 11362.000000 77.000000 7308.000000 590820.000000 60.000000 9910.000000 1247.000000

Get the number of clients


In [25]:
tmp = pd.DataFrame(client_scope.groupby('account_id').agg('count')['client_id'])
df = pd.merge(left = df, right = tmp, left_on = 'account_id', right_index = True)

Get the average balance on the account

For finished clients, cut all data after 75% of the running time. NB: This actually comes from the transaction_scope data.


In [27]:
# Get the scope
trans_scope = transaction_info[transaction_info['account_id'].isin(scope)]
trans_scope = pd.merge(left = trans_scope, right = df.loc[:,['account_id','start_date','duration','status']])
trans_scope['start_date'] = pd.to_datetime(trans_scope['start_date'], format='%Y-%m-%d')
trans_scope['date'] = pd.to_datetime(trans_scope['date'], format='%Y-%m-%d')
# Split the data in running and finished
trans_scope_running = trans_scope[trans_scope['status'].isin(['run_no_problem','run_but_debt'])].copy()
trans_scope_finished = trans_scope[trans_scope['status'].isin(['fin_no_problem','fin_unpaid'])].copy()
# Remove the transactions after the third quarter
trans_scope_finished['cutoff_duration'] = trans_scope_finished['duration']*3/4
trans_scope_finished['cutoff_date'] = trans_scope_finished['start_date'] + \
    trans_scope_finished['cutoff_duration'].values.astype("timedelta64[M]")
trans_scope_finished = trans_scope_finished[trans_scope_finished['cutoff_date'] > trans_scope_finished['date']]
# Remove the helpcolumns
del trans_scope_finished['cutoff_duration']
del trans_scope_finished['cutoff_date']

trans_scope = pd.concat([trans_scope_finished, trans_scope_running])

Calculate the average balance on the account.


In [ ]:
avg_balance = pd.DataFrame(trans_scope.groupby('account_id').agg('mean')['balance'])
df = pd.merge(left = df, right = avg_balance, left_on = 'account_id', right_index = True)

Get the average birthyear of the owner.


In [29]:
owner_scope = client_scope[client_scope['type_client']=="OWNER"].copy()
owner_scope['birth_number'] = pd.to_datetime(owner_scope['birth_number'], 
                                                format='%Y-%m-%d')
owner_scope['birthyear_owner'] = owner_scope['birth_number'].dt.year
df = pd.merge(left = df, 
              right = owner_scope.loc[:,['account_id','birthyear_owner']], 
              left_on = 'account_id', right_on = 'account_id')

Add the sex of the owner.


In [30]:
owner_scope['female'] = 0
owner_scope.loc[owner_scope.sex == 'F', 'female'] = 1
df = pd.merge(left = df, 
              right = owner_scope.loc[:,['account_id','female']], 
              left_on = 'account_id', right_on = 'account_id')

In [31]:
owner_scope['date_client'] = pd.to_datetime(owner_scope['date_client'], 
                                            format='%Y-%m-%d')
owner_scope['startyear_client'] = owner_scope['date_client'].dt.year
df = pd.merge(left = df, 
              right = owner_scope.loc[:,['account_id','startyear_client']], 
              left_on = 'account_id', right_on = 'account_id')

Transactional behavior

Let's look at all the transactions

  • The total number transactions
  • The average volume per transaction
  • The total number of cash transactions
  • The aveage volume per cash transaction
  • The total number of cc withdrawals
  • The aveage volume per cc withdrawal

NB: We already filtered the finsihed contracts on the 75/25% split above.


In [32]:
trans_scope['operation'].value_counts()


Out[32]:
cash_withdrawl    59801
outg_bank         34352
cash_credit       22952
inc_bank           6298
cc_withdrawal       639
Name: operation, dtype: int64

In [33]:
average_volume = pd.DataFrame(trans_scope.groupby('account_id').agg('mean')['amount'])
average_volume.columns = ['average_volume']
total_number = pd.DataFrame(trans_scope.groupby('account_id').agg(lambda x: len(x.unique()))['trans_id'])
total_number.columns = ['total_number_trans']

In [34]:
total_number_cash = pd.DataFrame(trans_scope[trans_scope['operation'].isin(['cash_credit','cash_withdrawl'])].groupby('account_id').agg(lambda x: len(x.unique()))['trans_id'])
total_number_cash.columns = ['total_number_cash']
average_cash_volume = pd.DataFrame(trans_scope[trans_scope['operation'].isin(['cash_credit','cash_withdrawl'])].groupby('account_id').agg('mean')['amount'])
average_cash_volume.columns = ['average_volume_cash']

In [35]:
total_number_cc = pd.DataFrame(trans_scope[trans_scope['operation']=="cc_withdrawal"].groupby('account_id').agg(lambda x: len(x.unique()))['trans_id'])
total_number_cc.columns = ['total_number_cc']
average_cc_volume = pd.DataFrame(trans_scope[trans_scope['operation']=="cc_withdrawal"].groupby('account_id').agg('mean')['amount'])
average_cc_volume.columns = ['average_volume_cc']

In [36]:
df = pd.merge(left = df, right = average_volume, 
              left_on = 'account_id', right_index = True, 
              how ='left')
df = pd.merge(left = df, right = total_number, 
              left_on = 'account_id', right_index = True, 
              how ='left')
df = pd.merge(left = df, right = total_number_cash, 
              left_on = 'account_id', right_index = True, 
              how ='left')
df = pd.merge(left = df, right = average_cash_volume, 
              left_on = 'account_id', right_index = True, 
              how ='left')
df = pd.merge(left = df, right = total_number_cc, 
              left_on = 'account_id', right_index = True, 
              how ='left')
df = pd.merge(left = df, right = average_cc_volume, 
              left_on = 'account_id', right_index = True, 
              how ='left')

In [37]:
df = df.fillna(0)

Demographic information

Not taking everything in acocunt (as not account/client specific information). We will use:

  • inhabitants
  • urban_ratio
  • avg unemployment rate (1995 & 1996) divided by # of inhabitants
  • avg commited crimes (1995 & 1996)

In [38]:
demographic_info['avg_unemployment'] = demographic_info[['unemployment_rate_1995', 'unemployment_rate_1996']].mean(axis=1)
demographic_info['avg_committed_crimes'] = demographic_info[['committed_crimes_1995', 'comitted_crimes_1996']].mean(axis=1)
demographic_info.loc[:,'avg_committed_crimes'] = demographic_info.loc[:,'avg_committed_crimes']/demographic_info.loc[:,'inhabitants']

In [39]:
df = pd.merge(left = df,
           right = owner_scope.loc[:,['account_id','district_id_client']],
           how = 'left',
           left_on = 'account_id',
           right_on = 'account_id')

In [40]:
df = pd.merge(left = df,
           right = demographic_info.loc[:,['district_id','inhabitants','urban_ratio',
                                           'avg_unemployment','avg_committed_crimes']],
           how = 'left',
           left_on = 'district_id_client',
           right_on = 'district_id')

Card info

  • We will only look at the type of card We will try to make two groups. We have no information on what 'junior', 'classic' or gold is. Thus we will make a plot to see whether they have any influence on the loan status.
    We only have 107 credit cards for all customers having a loan. Only 5 of those have no debt. Thus we deicd to drop card info for now.

In [41]:
card_scope = client_info.loc[:,['status','type_card']].dropna()
card_scope['status'].value_counts()


Out[41]:
run_no_problem    105
fin_no_problem     60
run_but_debt        3
fin_unpaid          2
Name: status, dtype: int64

Save the data frame:


In [42]:
df.to_csv('data/loan_prep.csv', index = False)