In this notebook, we will engineer the futures to predict whether an account will be unable to pay its loan in the future. We will use the following charasteristics:
For the finished loans, we will use the first 75% of the running time to predict whether there is a missed payment in the last 25% of the running time.
In [1]:
import pandas as pd
import numpy as np
In [2]:
client_info = pd.read_csv('data/client_info.csv')
demographic_info = pd.read_csv('data/demographic_data.csv')
transaction_info = pd.read_csv('data/transction_info.csv')
order_info = pd.read_csv('data/order_info.csv')
loan_info = pd.read_csv('data/loan_info.csv')
In [3]:
loan_info['status'].value_counts()
Out[3]:
We start by looking at running loans with payment problems to see how we can define a model their behavior.
In [4]:
scope = loan_info[loan_info['status']=='run_but_debt']['account_id'].unique()
transaction_scope = transaction_info[transaction_info['account_id'].isin(scope)]
In [5]:
transaction_scope = transaction_scope[transaction_scope['k_symbol']=='loan']
transaction_scope = transaction_scope.loc[:,['account_id','date','amount']]
transaction_scope = transaction_scope.pivot(index = 'date',
columns = 'account_id',
values = 'amount')
transaction_scope = transaction_scope.fillna(0)
In [6]:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
In [7]:
trans_plot = transaction_scope.cumsum()
plt.figure()
ax = trans_plot.plot(legend = False)
ax.set_ylabel("amount")
Out[7]:
All these payments seem to have flat months indeed! As we assume that loans are payed monthly, the flat lines on the map appear to be missed payments. As order_info provides us with the heigth of every montly payment, we can plot the difference between the money ordered and eventually paid.
Also, all payments occur on the 12th of the month. Many customers who sign the contract of the loan before the 12th, therefore providing a missing payment in the data which may be artificial. We assume that loan payments start the next month therefore we bump the date by 12 days if the day in the month is before the 12th, removing our missed payments.
In [8]:
from datetime import datetime
payment_diff = transaction_scope.copy()
for acct in scope:
loan_start = loan_info[loan_info['account_id']==acct]['date'].unique()[0]
loan_date = datetime.strptime(loan_start, '%Y-%m-%d')
if loan_date.day < 12:
loan_date = datetime(loan_date.year, loan_date.month, loan_date.day + 12)
loan_start = loan_date.strftime('%Y-%m-%d')
loan_order = loan_info[loan_info['account_id']==acct]['payments'].unique()[0]
payment_diff.loc[payment_diff.index > loan_start, acct] = payment_diff.loc[payment_diff.index > loan_start,acct] - loan_order
In [9]:
plt.figure()
ax = payment_diff.plot(legend = False)
ax.set_ylabel("amount")
Out[9]:
On first inspection, it seems that we missed some initial payments. This is not a big deal as we are only intersted in missed payments and not so much in early payments (which may or may not have been refunded).
Note also that in this graph, spikes are one-time missers, where plateaus are multiple consecutive missed payments.
Now, the main bulk of missed payments is at the end of the graph, this is because we look only at running loans, which are generally started later.
Therefore we choose to revert this to once again looking at all loans and later look at only the running ones. This essentially means running the same code again, with the adjustment that for ended loans, we need to not assume that a zero payment after the startdate is a missed one.
In [10]:
scope = loan_info['account_id'].unique()
running = loan_info[loan_info['status']=='run_but_debt']['account_id'].unique()
transaction_scope = transaction_info[transaction_info['account_id'].isin(scope)]
transaction_scope = transaction_scope[transaction_scope['k_symbol']=='loan']
transaction_scope = transaction_scope.loc[:,['account_id','date','amount']]
transaction_scope = transaction_scope.pivot(index = 'date',
columns = 'account_id',
values = 'amount')
transaction_scope = transaction_scope.fillna(0)
We now have to add some months to have dates for every loan in the full dataset. Function borrowed from: https://stackoverflow.com/questions/4130922/ Note that this code is somewhat messy but it will do for now.
In [11]:
from datetime import datetime
import calendar
def add_months(sourcedate, months):
month = sourcedate.month - 1 + months
year = int(sourcedate.year + month / 12 )
month = month % 12 + 1
day = min(sourcedate.day,calendar.monthrange(year,month)[1])
return datetime(year,month,day)
payment_diff = transaction_scope.copy()
for acct in scope:
if acct in running:
loan_start = loan_info[loan_info['account_id']==acct]['date'].unique()[0]
loan_date = datetime.strptime(loan_start, '%Y-%m-%d')
if loan_date.day < 12:
loan_date = datetime(loan_date.year, loan_date.month, loan_date.day + 12)
loan_start = loan_date.strftime('%Y-%m-%d')
loan_order = loan_info[loan_info['account_id']==acct]['payments'].unique()[0]
payment_diff.loc[payment_diff.index > loan_start, acct] = payment_diff.loc[payment_diff.index > loan_start,acct] - loan_order
else:
loan_start = loan_info[loan_info['account_id']==acct]['date'].unique()[0]
loan_date = datetime.strptime(loan_start, '%Y-%m-%d')
if loan_date.day < 12:
loan_date = datetime(loan_date.year, loan_date.month, loan_date.day + 12)
loan_start = loan_date.strftime('%Y-%m-%d')
loan_date_end = add_months(loan_date,loan_info[loan_info['account_id']==acct]['duration'].unique()[0])
loan_end = loan_date_end.strftime('%Y-%m-%d')
monthly_payment = loan_info[loan_info['account_id']==acct]['payments'].unique()[0]
dates = (payment_diff.index > loan_start) & (payment_diff.index < loan_end)
payment_diff.loc[dates, acct] = payment_diff.loc[dates, acct] - monthly_payment
Now we can again make a plot. Which should show more earlier missed payments (and some more positives which were prepaid).
In [12]:
plt.figure()
ax = payment_diff.plot(legend = False)
ax.set_ylabel("amount")
Out[12]:
Now that we know this, we can do our final transformation on the missed payments. Even though we have information on the unemployment rates of '95 and '96, we act as if all our periods are equal. Hence we can now start making features.
We will use the features:
As we will provide the split into the 80/20 estimate, we will first split and then generate the features on the first 80%. The last 20% will only be provided a "missed payment" column which is either 0 or 1.
In [13]:
max_duration = loan_info['duration'].max()
monthly_missed = pd.DataFrame(index=range(0,60), columns=list(payment_diff))
In [14]:
for acct in scope:
if acct in running:
loan_start = loan_info[loan_info['account_id']==acct]['date'].unique()[0]
tmp = payment_diff.loc[payment_diff.index > loan_start, acct].tolist()
tmp = [round(i,0) for i in tmp]
tmp = tmp + [0]*(60 - len(tmp))
tmp = [int(i) for i in tmp]
monthly_missed[acct] = tmp
else:
loan_start = loan_info[loan_info['account_id']==acct]['date'].unique()[0]
loan_date = datetime.strptime(loan_start, '%Y-%m-%d')
loan_date_end = add_months(loan_date,loan_info[loan_info['account_id']==acct]['duration'].unique()[0])
loan_end = loan_date_end.strftime('%Y-%m-%d')
tmp = payment_diff.loc[(payment_diff.index > loan_start) & (payment_diff.index < loan_end), acct].tolist()
tmp = [round(i,0) for i in tmp]
tmp = tmp + [0]*(60 - len(tmp))
tmp = [int(i) for i in tmp]
monthly_missed[acct] = tmp
In [15]:
plt.figure()
ax = monthly_missed.plot(legend = False)
ax.set_ylabel("amount")
ax.set_xlabel("month")
Out[15]:
This confirms the fact that payment differences which are positive occur in the first month. Hence, this is the prepayment mismatch we have (but at least we don't set ones to missed which aren't).
In [16]:
df = loan_info.copy()
df.columns = ['loan_id','account_id','start_date','amount','duration','payments','status']
In [17]:
tmp = pd.DataFrame(monthly_missed.apply(lambda column: (column < -1).sum()))
tmp.columns = ['total_missed']
df = pd.merge(left = df, right = tmp, left_on = 'account_id', right_index = True)
In [18]:
durations = df['duration'].value_counts().to_dict()
quarter_missed = {}
misses = pd.DataFrame(columns = ['first_quarter','second_quarter','third_quarter','last_quarter'])
for key in durations.keys():
tmp = df[df['duration']==int(key)]['account_id'].tolist()
quarter_missed[key] = monthly_missed.loc[:,tmp]
tmp_df = pd.DataFrame(quarter_missed[key].loc[range(0,key/4),].apply(lambda column: (column < -1).sum()))
tmp_df['second_quarter'] = quarter_missed[key].loc[range(key/4, 2*key/4),].apply(lambda column: (column < -1).sum())
tmp_df['third_quarter'] = quarter_missed[key].loc[range(2*key/4, 3*key/4),].apply(lambda column: (column < -1).sum())
tmp_df['last_quarter'] = quarter_missed[key].loc[range(3*key/4, 4*key/4),].apply(lambda column: (column < -1).sum())
tmp_df.columns = ['first_quarter','second_quarter','third_quarter','last_quarter']
misses = pd.concat([misses,tmp_df])
for col in misses.columns:
misses[col] = misses[col].astype(int)
df = pd.merge(left = df, right = misses, left_on = 'account_id', right_index = True)
In [19]:
quarter_missed.keys()
Out[19]:
In [20]:
for col in misses.columns:
misses[col] = misses[col].astype(int)
In [21]:
consecutive = monthly_missed.copy()
consecutive[consecutive > -2] =0
consecutive[consecutive < -1] = -1
In [22]:
# From: https://stackoverflow.com/questions/36441521
import itertools
tmplist = []
for acct in consecutive.columns:
a = consecutive[acct].tolist()
z = [(x[0], len(list(x[1]))) for x in itertools.groupby(a)]
z = [x for x in z if x[0] !=0]
if len(z)==0:
tmp = 0
else:
tmp = max(z, key=lambda x:x[1])[1]
tmplist.append(tmp)
max_missed = pd.DataFrame()
max_missed['account_id'] = consecutive.columns
max_missed['max_missed'] = tmplist
In [23]:
df = pd.merge(left = df, right = max_missed, left_on = 'account_id', right_on = 'account_id')
We look at the following charasteristics:
Note that we will need to cut the data into 75% running time for finished clients when we accumulate different data.
In [24]:
client_scope = client_info[client_info['account_id'].isin(scope)]
client_scope.describe()
Out[24]:
In [25]:
tmp = pd.DataFrame(client_scope.groupby('account_id').agg('count')['client_id'])
df = pd.merge(left = df, right = tmp, left_on = 'account_id', right_index = True)
In [27]:
# Get the scope
trans_scope = transaction_info[transaction_info['account_id'].isin(scope)]
trans_scope = pd.merge(left = trans_scope, right = df.loc[:,['account_id','start_date','duration','status']])
trans_scope['start_date'] = pd.to_datetime(trans_scope['start_date'], format='%Y-%m-%d')
trans_scope['date'] = pd.to_datetime(trans_scope['date'], format='%Y-%m-%d')
# Split the data in running and finished
trans_scope_running = trans_scope[trans_scope['status'].isin(['run_no_problem','run_but_debt'])].copy()
trans_scope_finished = trans_scope[trans_scope['status'].isin(['fin_no_problem','fin_unpaid'])].copy()
# Remove the transactions after the third quarter
trans_scope_finished['cutoff_duration'] = trans_scope_finished['duration']*3/4
trans_scope_finished['cutoff_date'] = trans_scope_finished['start_date'] + \
trans_scope_finished['cutoff_duration'].values.astype("timedelta64[M]")
trans_scope_finished = trans_scope_finished[trans_scope_finished['cutoff_date'] > trans_scope_finished['date']]
# Remove the helpcolumns
del trans_scope_finished['cutoff_duration']
del trans_scope_finished['cutoff_date']
trans_scope = pd.concat([trans_scope_finished, trans_scope_running])
Calculate the average balance on the account.
In [ ]:
avg_balance = pd.DataFrame(trans_scope.groupby('account_id').agg('mean')['balance'])
df = pd.merge(left = df, right = avg_balance, left_on = 'account_id', right_index = True)
In [29]:
owner_scope = client_scope[client_scope['type_client']=="OWNER"].copy()
owner_scope['birth_number'] = pd.to_datetime(owner_scope['birth_number'],
format='%Y-%m-%d')
owner_scope['birthyear_owner'] = owner_scope['birth_number'].dt.year
df = pd.merge(left = df,
right = owner_scope.loc[:,['account_id','birthyear_owner']],
left_on = 'account_id', right_on = 'account_id')
In [30]:
owner_scope['female'] = 0
owner_scope.loc[owner_scope.sex == 'F', 'female'] = 1
df = pd.merge(left = df,
right = owner_scope.loc[:,['account_id','female']],
left_on = 'account_id', right_on = 'account_id')
In [31]:
owner_scope['date_client'] = pd.to_datetime(owner_scope['date_client'],
format='%Y-%m-%d')
owner_scope['startyear_client'] = owner_scope['date_client'].dt.year
df = pd.merge(left = df,
right = owner_scope.loc[:,['account_id','startyear_client']],
left_on = 'account_id', right_on = 'account_id')
Let's look at all the transactions
NB: We already filtered the finsihed contracts on the 75/25% split above.
In [32]:
trans_scope['operation'].value_counts()
Out[32]:
In [33]:
average_volume = pd.DataFrame(trans_scope.groupby('account_id').agg('mean')['amount'])
average_volume.columns = ['average_volume']
total_number = pd.DataFrame(trans_scope.groupby('account_id').agg(lambda x: len(x.unique()))['trans_id'])
total_number.columns = ['total_number_trans']
In [34]:
total_number_cash = pd.DataFrame(trans_scope[trans_scope['operation'].isin(['cash_credit','cash_withdrawl'])].groupby('account_id').agg(lambda x: len(x.unique()))['trans_id'])
total_number_cash.columns = ['total_number_cash']
average_cash_volume = pd.DataFrame(trans_scope[trans_scope['operation'].isin(['cash_credit','cash_withdrawl'])].groupby('account_id').agg('mean')['amount'])
average_cash_volume.columns = ['average_volume_cash']
In [35]:
total_number_cc = pd.DataFrame(trans_scope[trans_scope['operation']=="cc_withdrawal"].groupby('account_id').agg(lambda x: len(x.unique()))['trans_id'])
total_number_cc.columns = ['total_number_cc']
average_cc_volume = pd.DataFrame(trans_scope[trans_scope['operation']=="cc_withdrawal"].groupby('account_id').agg('mean')['amount'])
average_cc_volume.columns = ['average_volume_cc']
In [36]:
df = pd.merge(left = df, right = average_volume,
left_on = 'account_id', right_index = True,
how ='left')
df = pd.merge(left = df, right = total_number,
left_on = 'account_id', right_index = True,
how ='left')
df = pd.merge(left = df, right = total_number_cash,
left_on = 'account_id', right_index = True,
how ='left')
df = pd.merge(left = df, right = average_cash_volume,
left_on = 'account_id', right_index = True,
how ='left')
df = pd.merge(left = df, right = total_number_cc,
left_on = 'account_id', right_index = True,
how ='left')
df = pd.merge(left = df, right = average_cc_volume,
left_on = 'account_id', right_index = True,
how ='left')
In [37]:
df = df.fillna(0)
In [38]:
demographic_info['avg_unemployment'] = demographic_info[['unemployment_rate_1995', 'unemployment_rate_1996']].mean(axis=1)
demographic_info['avg_committed_crimes'] = demographic_info[['committed_crimes_1995', 'comitted_crimes_1996']].mean(axis=1)
demographic_info.loc[:,'avg_committed_crimes'] = demographic_info.loc[:,'avg_committed_crimes']/demographic_info.loc[:,'inhabitants']
In [39]:
df = pd.merge(left = df,
right = owner_scope.loc[:,['account_id','district_id_client']],
how = 'left',
left_on = 'account_id',
right_on = 'account_id')
In [40]:
df = pd.merge(left = df,
right = demographic_info.loc[:,['district_id','inhabitants','urban_ratio',
'avg_unemployment','avg_committed_crimes']],
how = 'left',
left_on = 'district_id_client',
right_on = 'district_id')
In [41]:
card_scope = client_info.loc[:,['status','type_card']].dropna()
card_scope['status'].value_counts()
Out[41]:
In [42]:
df.to_csv('data/loan_prep.csv', index = False)