Model Evaluation



In [2]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')



In [7]:

    
df = pd.read_csv('data/historical_loan.csv')



In [8]:

    
df.head()

Preprocessing the Data



In [9]:

    
df.years = df.years.fillna(np.mean(df.years))



In [10]:

    
#Load the preprocessing module
from sklearn import preprocessing



In [11]:

    
categorical_variables = df.dtypes[df.dtypes=="object"].index.tolist()



In [12]:

    
categorical_variables









    Out[12]:





['grade', 'ownership']



In [13]:

    
for i in categorical_variables:
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(df[i]))
    df[i] = lbl.transform(df[i])



In [14]:

    
df.head()



In [15]:

    
X = df.iloc[:,1:8]



In [16]:

    
y = df.iloc[:,0]

Accuracy Metrics

Misclassification Rate
Confusion Matrix
Precision & Recall
ROC
AUC

Misclassification Rate

The most basic evaluation metric is accuracy score. if $\hat{y}_i$ is the predicted value of the i-th sample and $y_i$ is the corresponding true value, then the fraction of correct predictions over $n_\text{samples}$ is defined as

$$ accuracy(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples}-1} 1(\hat{y}_i = y_i) $$

Confusion Matrix

Confusion matrix evaluate the quality of the output of a classifier.

	Predicted - Yes	Predicted - No
Actual - Yes	True Positive	False Negative
Actual - No	False Positive	True Negative

The diagonal elements represent the number of points for which the predicted label is equal to the true label, while off-diagonal elements are those that are mislabeled by the classifier. The higher the diagonal values of the confusion matrix the better, indicating many correct predictions.

Precision

The precision is the ratio TP / (TP + FP) where TP is the number of true positives and FP the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative. The best value is 1 and the worst value is 0.

Recall

The recall is the ratio TP / (TP + FN) where TP is the number of true positives and FN the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples. The best value is 1 and the worst value is 0.

Receiver Operating Curve (ROC)

“A receiver operating characteristic (ROC), or simply ROC curve, is a graphical plot which illustrates the performance of a binary classifier system as its discrimination threshold is varied. It is created by plotting the fraction of true positives out of the positives (TPR = true positive rate) vs. the fraction of false positives out of the negatives (FPR = false positive rate), at various threshold settings.”

True Positive Rate (sensitivity) -> TPR = TP / (TP+FN)
False Positive Rate (1- specifity) -> FPR = FP / (FP+TN)

Area Under Curve (AUC)

The AUC computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.

How do choose the Error Metric

What is the objective for the problem?
Which error metric would be best help address that problem?

Build Models and Evaluate



In [17]:

    
from sklearn import tree
from sklearn import metrics



In [18]:

    
def model_evaluation(data, target, model, model_name):
    model_fit = model.fit(data, target)
    pred = model_fit.predict(data)
    proba = model_fit.predict_proba(data)
    
    fpr, tpr, thresholds = metrics.roc_curve(target, proba[:,1])
    roc_auc = metrics.auc(fpr, tpr)
    
    print("Model: %s" % model_name)

    # Scores for the model
    print("accuracy: %.3f" % metrics.accuracy_score(target, pred))
    print("recall: %.3f" % metrics.precision_score(target, pred))
    print("precision: %.3f" % metrics.recall_score(target, pred))
    print("confusion_matrix:")
    print(metrics.confusion_matrix(target, pred))
    print("auc: %.3f" % metrics.auc(fpr, tpr))
    
    # ROC Curve
    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, 'b', label='AUC = %0.2f'% roc_auc)
    plt.legend(loc='lower right')
    plt.plot([0,1],[0,1],'r--')
    plt.xlim([-0.1,1.2])
    plt.ylim([-0.1,1.2])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()
    
    return roc_auc

Build Models and Evaluate

Benchmark Model

Our benchmark model is that there is only "no" in the model. We need to improve on this to realistically do better.



In [19]:

    
benchmark = tree.DecisionTreeClassifier(max_depth = 1)



In [20]:

    
benchmark









    Out[20]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')



In [21]:

    
model_evaluation(X, y, benchmark, "benchmark")









    



Model: benchmark
accuracy: 0.632
recall: 0.634
precision: 0.544
confusion_matrix:
[[2869 1161]
 [1686 2011]]
auc: 0.628






    












    Out[21]:





0.62793261386235633

Decision Tree Model - Shallow



In [22]:

    
Shallow = tree.DecisionTreeClassifier(max_depth=10)



In [23]:

    
Shallow









    Out[23]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')



In [24]:

    
model_evaluation(X, y, Shallow, "Shallow")









    



Model: Shallow
accuracy: 0.749
recall: 0.744
precision: 0.725
confusion_matrix:
[[3107  923]
 [1018 2679]]
auc: 0.838






    












    Out[24]:





0.83778122023691659

Decision Tree Model - Full



In [25]:

    
Full = tree.DecisionTreeClassifier()



In [26]:

    
Full









    Out[26]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')



In [27]:

    
model_evaluation(X, y, Full, "Full")









    



Model: Full
accuracy: 1.000
recall: 1.000
precision: 0.999
confusion_matrix:
[[4030    0]
 [   2 3695]]
auc: 1.000






    












    Out[27]:





0.99999986576199196

K-Fold Cross Validation

So far we have been evaluating our metrics on the train data. However, there is an important modelling lesson: you should never evaluate a model on the same data it was fit to because it’s going to seem more confident. Instead, it’s better to divide the data up and use one piece to fit the model and the other piece to evaluate it. A popular technique for this is called k-fold cross validation. You randomly hold out x% of the data and fit the model to the rest. You need to repeat this a few times because of random variation.

Update the ROC using k-fold validation



In [28]:

    
from sklearn.model_selection import StratifiedKFold
from scipy import interp



In [29]:

    
def model_evaluation_crossval(data, target, model, model_name): 
    data = np.array(data)
    target = np.array(target)
    cv = StratifiedKFold(n_splits=5)

    # Create the color options  
    cmap = plt.get_cmap('viridis')
    indices = np.linspace(0, cmap.N, 5)
    colors = [cmap(int(i)) for i in indices]
    
    mean_tpr = 0.0
    mean_fpr = np.linspace(0, 1, 100)
    
    # intiate plot
    plt.figure(figsize=(8, 8))
    
    i = 0
    for (train, test) in cv.split(data, target):
        print(train, test)
        probas_ = model.fit(data[train], target[train]).predict_proba(data[test])
        # Compute ROC curve and area the curve
        fpr, tpr, thresholds = metrics.roc_curve(target[test], probas_[:, 1])
        mean_tpr += interp(mean_fpr, fpr, tpr)
        mean_tpr[0] = 0.0
        roc_auc = metrics.auc(fpr, tpr)
        plt.plot(fpr, tpr, lw=2, color=colors[i],
                 label='ROC fold %d (area = %0.2f)' % (i, roc_auc))
        i = i + 1
    
    # ROC Curve
    mean_tpr /= cv.get_n_splits(data, target)
    mean_tpr[-1] = 1.0
    mean_auc = metrics.auc(mean_fpr, mean_tpr)
    plt.plot(mean_fpr, mean_tpr, color='g', linestyle='--',
             label='Mean ROC (area = %0.2f)' % mean_auc, lw=2)
    plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='k', label='random')

    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.xlim([-0.1,1.1])
    plt.ylim([-0.1,1.1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()



In [30]:

    
model_evaluation_crossval(X, y, Shallow, "Shallow")









    



[1410 1412 1414 ..., 7724 7725 7726] [   0    1    2 ..., 1737 1738 1741]
[   0    1    2 ..., 7724 7725 7726] [1410 1412 1414 ..., 3522 3527 3528]
[   0    1    2 ..., 7724 7725 7726] [2801 2802 2803 ..., 5132 5134 5135]
[   0    1    2 ..., 7724 7725 7726] [4168 4171 4172 ..., 6416 6417 6418]
[   0    1    2 ..., 6416 6417 6418] [5883 5884 5885 ..., 7724 7725 7726]

Exercise 0: Update the model_evaluation function to work on k-fold cross-validation



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Exercise 1: Find the ROC curve for balancing the classes



In [ ]:



In [ ]:



In [ ]:



In [ ]:

Exercise 2: Plot the auc values for maximum depth (2 to 10)



In [ ]:



In [ ]:



In [ ]:

Exercise 3: Plot the auc values for different minimum sample split (2 to 8)



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	default	amount	grade	years	ownership	income	age
0	0	1000	B	2.0	RENT	19200.0	24
1	1	6500	A	2.0	MORTGAGE	66000.0	28
2	0	2400	A	2.0	RENT	60000.0	36
3	0	10000	C	3.0	RENT	62000.0	24
4	1	4000	C	2.0	RENT	20000.0	28

	default	amount	grade	years	ownership	income	age
0	0	1000	1	2.0	3	19200.0	24
1	1	6500	0	2.0	0	66000.0	28
2	0	2400	0	2.0	3	60000.0	36
3	0	10000	2	3.0	3	62000.0	24
4	1	4000	2	2.0	3	20000.0	28