Credit Card Defaults

Data Points : Data set contains 10000 records associated with credit cards accounts.

  • Defualt : Indicates whether credit card holder defualted or not.
  • Student : If holder is a student.
  • Balance : Card holders current outstanding balance.
  • Income : Income of the card holder.

In [1]:
%pylab inline
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
import matplotlib.pyplot as plt
import numpy as np


Populating the interactive namespace from numpy and matplotlib

In [12]:
default=pd.read_csv("Data/Default.csv")
default.head()
default.describe()


Out[12]:
default balance income
count 10000.000000 10000.000000 10000.000000
mean 0.033300 835.374886 33516.981876
std 0.179428 483.714985 13336.639563
min 0.000000 0.000000 771.967729
25% 0.000000 481.731105 21340.462905
50% 0.000000 823.636973 34552.644800
75% 0.000000 1166.308387 43807.729275
max 1.000000 2654.322576 73554.233500
  • Split data into train and test data

In [13]:
from sklearn.cross_validation import train_test_split

In [15]:
train, test = train_test_split(default, test_size=0.33, random_state=1)

In [21]:
# Converting data into the dataframe
train = pd.DataFrame(data=train, columns=default.columns)
test = pd.DataFrame(data=test, columns=default.columns)

In [20]:
train.hist()


Out[20]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fd0bce8c160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd0bce9c208>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fd0bc5ee5f8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fd0bcc4e6a0>]], dtype=object)

In [27]:
plt.scatter(train.income, train.balance)
plt.xlim(0, 75000)
plt.ylim(0, 3000)
plt.xlabel('Income')
plt.ylabel('Balance')
plt.title('Income vs Balance')


Out[27]:
<matplotlib.text.Text at 0x7fd0bb63a358>

In [60]:
train.plot(x='balance', y='income', kind='scatter', alpha=0.3)
plt.ylim([0,80000])
plt.xlim([0,3000])


Out[60]:
(0, 3000)

In [61]:
train_default=train[train.default==1]
train_no_default=train[train.default==0]

In [62]:
plt.figure()


Out[62]:
<matplotlib.figure.Figure at 0x7fd0bc9be978>
<matplotlib.figure.Figure at 0x7fd0bc9be978>

In [63]:
plt.scatter(train_no_default.balance, train_no_default.income, alpha=0.5, marker='+', c='g')
plt.scatter(train_default.balance, train_default.income, marker='o', edgecolors = 'r', facecolors = 'none')
plt.legend( ('No Default', 'Default'), loc='upper right')
plt.ylim([0,80000])
plt.xlim([0,3000])
plt.xlabel('Income')
plt.ylabel('Balance')


Out[63]:
<matplotlib.text.Text at 0x7fd0b97baa20>

Plot the same graph differently


In [64]:
plt.scatter(train_default.balance, train_default.income, alpha=0.5, marker='o', c='g')
plt.scatter(train_no_default.balance, train_no_default.income, marker='+', edgecolors = 'r', facecolors = 'none')
plt.legend( ('Default', 'No Default'), loc='upper right')
plt.ylim([0,80000])
plt.xlim([0,3000])
plt.xlabel('Income')
plt.ylabel('Balance')


Out[64]:
<matplotlib.text.Text at 0x7fd0b9832c88>

Conclusion

  • Balace is more closely related to the default rate.

=============================================

        Logistic Regreassion 

=============================================


In [68]:
# Logistic regression on the balance variable
bal = smf.logit(formula='default ~ balance', data=train).fit()
bal.summary()


Optimization terminated successfully.
         Current function value: 0.080930
         Iterations 10
Out[68]:
Logit Regression Results
Dep. Variable: default No. Observations: 6700
Model: Logit Df Residuals: 6698
Method: MLE Df Model: 1
Date: Tue, 22 Sep 2015 Pseudo R-squ.: 0.4639
Time: 21:25:05 Log-Likelihood: -542.23
converged: True LL-Null: -1011.5
LLR p-value: 4.096e-206
coef std err z P>|z| [95.0% Conf. Int.]
Intercept -10.8119 0.444 -24.348 0.000 -11.682 -9.942
balance 0.0057 0.000 20.752 0.000 0.005 0.006

In [134]:
# Get confidence interval.
bal.conf_int()


Out[134]:
0 1
Intercept -11.682211 -9.941545
balance 0.005122 0.006190

In [122]:
# odds ratio or Beta
# http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm
np.exp(bal.params.balance)


Out[122]:
1.005672015611671
  • Various probability of defualt based on the balance.

In [123]:
prob = bal.predict({'balance': [1200, 2000]})
print (prob)


[ 0.01755528  0.62248346]

In [124]:
bal.predict({'balance': 1900})


Out[124]:
array([ 0.48363176])

In [125]:
x = np.linspace(test.balance.min(), test.balance.max(), 500)
betas = [bal.params.Intercept, bal.params.balance]

In [126]:
y = np.exp(betas[0] + betas[1]*x) / (1 + np.exp(betas[0] + betas[1]*x))
odds = np.exp(betas[0] + betas[1]*x)
log_odds = betas[0] + betas[1]*x

In [ ]:


In [132]:
plt.plot(x, y, 'r', linewidth=2)
plt.ylabel('Probability')
plt.text(500, 0.7, r'$\frac{e^{\beta_o + \beta_1x}}{1+e^{\beta_o + \beta_1x}}$', fontsize=25)


Out[132]:
<matplotlib.text.Text at 0x7fd0b87ea9e8>

In [129]:
plt.plot(x, odds, 'k', linewidth=2)
plt.ylabel('Odds')
plt.text(500, 30, r'$e^{\beta_o + \beta_1x}$', fontsize=20)


Out[129]:
<matplotlib.text.Text at 0x7fd0b8925748>

In [133]:
plt.plot(x, log_odds, 'c', linewidth=2)
plt.ylabel('Log(Odds)')
plt.xlabel('x')
plt.text(500, 2, r'$\beta_o + \beta_1x$', fontsize=15)


Out[133]:
<matplotlib.text.Text at 0x7fd0b8823390>

In [90]:
x_data = pd.DataFrame({'balance': np.linspace(test.balance.min(), 
                                              test.balance.max(),
                                              500)})
y_data = bal.predict(x_data)

In [131]:
plt.figure()
plt.scatter(test.balance, test.default, alpha=0.5)
plt.plot(x_data, y_data, 'r', linewidth=2)
plt.xlabel("Balance")
plt.ylabel("Probability of Default")
plt.ylim([-0.05,1.05])
plt.xlim([0, 2800])

# Draw default
plt.plot([1200, 1500], prob, 'y')


Out[131]:
[<matplotlib.lines.Line2D at 0x7fd0b888b240>]

In [ ]:


In [ ]: