```
In [28]:
```import pandas as pd
import numpy as np
import pylab as pl
import seaborn as sns
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pipeline.util as u
import pipeline.process as pr
import pipeline.read as r
import pipeline.explore as x
import pipeline.evaluate as ev
% matplotlib inline

```
In [3]:
```data = r.read_csv('data/credit-data.csv', parse_zipcodes=["zipcode"], dtype ={"SeriousDlqin2yrs": "category","PersonID":"category"})
data.serious_dlqin2yrs.cat.categories = ["Nondelinquent", "Delinquent"]

```
In [4]:
```summary = x.summary_by_outcome(data, "serious_dlqin2yrs")
summary.ix[:,[0,1,2,4,6,7,8,10]]

```
Out[4]:
```

```
In [5]:
```null_income = u.check_nulls(data, "monthly_income")
null_sum = x.summary_by_outcome(null_income, "serious_dlqin2yrs")
not_null_income = u.get_notnulls(data, "monthly_income")
not_null_sum = x.summary_by_outcome(not_null_income, "serious_dlqin2yrs")
ratio = round(null_sum.ix[:,[0,1,2,5,6,7,8,11]]/not_null_sum.ix[:,[0,1,2,5,6,7,8,11]],2)
ratio.drop("monthly_income")

```
Out[5]:
```

```
In [47]:
```sns.set(style="white")
sns.pairplot(data, vars=["debt_ratio","revolving_utilization_of_unsecured_lines","number_real_estate_loans_or_lines"],\
hue="serious_dlqin2yrs", size=3, plot_kws={'alpha':0.2})
pl.suptitle("Paired distribution of indictors by delinquincy status (keeping NA's)")

```
Out[47]:
```

```
In [48]:
```sns.pairplot(u.get_notnulls(data,"monthly_income"), vars=["monthly_income","debt_ratio","revolving_utilization_of_unsecured_lines","number_real_estate_loans_or_lines"],\
hue="serious_dlqin2yrs", size=3, plot_kws={'alpha':0.2})
pl.suptitle("Paired distribution of indictors by delinquincy status (removing NA's)")

```
Out[48]:
```

The correlation plot shows very strong correlation between the different past-due categories. While age and number of open credit lines and loans are weakly negatively correlated with the past-due categories and uncorrelated with debt ratios, income and number of dependents. Debt ratios, income and number of dependents are postively correlated with number of open credit lines and real estate loans.

```
In [49]:
```x.correlation_plot(data)
pl.suptitle("Correlation matrix")

```
Out[49]:
```

```
In [25]:
```print("Table: Average feature values by zip code")
x.summary_by_outcome(data, "zipcode").iloc[:,1::6]

```
Out[25]:
```

```
In [4]:
```data["debt_ratio_groups"] = pr.cut(data.debt_ratio, [0,0.25,.5,.75,1], labels="auto")
# This apply function is too slow.
#data.debt_ratio = data.debt_ratio.apply(lambda x: pr.cap_values(x,data.debt_ratio.quantile(.99)))

```
In [5]:
```na_cols = ["monthly_income","number_of_dependents"]
data[na_cols] = pr.fill_with_mean(df=data,col=na_cols, group="serious_dlqin2yrs")

```
In [6]:
```data.number_of_dependents.value_counts()
data["number_of_dependents_cut"] = pr.cut(data.number_of_dependents, [0,.99,3.01,20.1], \
method=pd.cut, labels=["No dependents","1-3 dependents", "4+ dependents"],include_lowest=True)
data = pr.get_dummies(data.number_of_dependents_cut, data)

```
In [23]:
```#sample = u.get_subsample(data, 5000)
features = ["No dependents","1-3 dependents", "4+ dependents","debt_ratio_groups",\
"monthly_income","number_of_time30-59_days_past_due_not_worse"]
X = data[features]
y = data.serious_dlqin2yrs
class_names = data.serious_dlqin2yrs.unique()
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

```
In [24]:
```clf = svm.SVC()
clf.fit(X_train,y_train)

```
Out[24]:
```

```
In [25]:
```y_hat = clf.predict(X_test)

```
In [29]:
```#import itertools
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_hat)
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
pl.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
title='Confusion matrix, without normalization')
# Plot normalized confusion matrix
pl.figure()
ev.plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
title='Normalized confusion matrix')
pl.show()

```
```

```
In [30]:
```ev.get_accuracy(cnf_matrix)

```
Out[30]:
```

```
In [32]:
```ev.get_recall(cnf_matrix)

```
Out[32]:
```

```
In [34]:
```ev.get_precision(cnf_matrix)

```
Out[34]:
```

I ran my features through a support vector machine, which is a reasonable model for a classification problem like this. The accuracy of the model was 94.3 percent. For perspective, if we just guessed that every person was nondelinquent, we would have accuracy of 93.3 percent. The precision was 95 percent, but the recall of 17.6 percent captures the short coming of the model. Our goal was to identify potentially delinquent people and support them through some sort of policy initive. In this case, we miss most people.