Model Evaluation

How to measure a model? How to find out that the model is doing well or just predicting useless?

This job is done by metrics. There are a bunch of metrics explained in scikit-learn documentation about model evaluation and we are going to see some of them.

Train & Test Data



In [1]:

    
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
print('X.shape =', X.shape)
print('y.shape =', y.shape)
print()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)
print('X_train.shape =', X_train.shape)
print('y_train.shape =', y_train.shape)
print('X_test.shape =', X_test.shape)
print('y_test.shape =', y_test.shape)









    



X.shape = (150, 4)
y.shape = (150,)

X_train.shape = (105, 4)
y_train.shape = (105,)
X_test.shape = (45, 4)
y_test.shape = (45,)

1. Classification Metrics

scikit-learn documentation

Accuracy

Accuracy is the fraction of the correct predictions:

$$ accuracy(y, \hat{y}) = \frac{1}{m} \sum_{i=1}^m 1(y^{(i)} = \hat{y}^{(i)}) $$

where $ 1(x) $ is the indicator function and $ m $ is the number of samples.



In [2]:

    
import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)









    Out[2]:





0.5

Confusion Matrix

For multiclass classification:

and for binary classification:



In [3]:

    
import numpy as np
from sklearn.metrics import confusion_matrix
y_true = [0, 0, 0, 1, 1, 1, 1, 1]
y_pred = [0, 1, 0, 1, 0, 1, 0, 1]
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
np.array([[tn, fp],
          [fn, tp]])









    Out[3]:





array([[2, 1],
       [2, 3]])

Recall, Precision & F-Score

$$ recall = \frac{tp}{tp + fn} $$

$$ precision = \frac{tp}{tp + fp} $$

$$ F_1 = 2 \cdot \frac{precision \cdot recall}{precision + recall} $$



In [4]:

    
from sklearn.metrics import precision_score, recall_score, f1_score
y_pred = [0, 1, 0, 0]
y_true = [0, 1, 0, 1]
print('[[tn fn]\n [fp, tp]]')
print(confusion_matrix(y_true, y_pred))
print()
print('Recall =', recall_score(y_true, y_pred))
print('Precision =', precision_score(y_true, y_pred))
print('F1 =', f1_score(y_true, y_pred))









    



[[tn fn]
 [fp, tp]]
[[2 0]
 [1 1]]

Recall = 0.5
Precision = 1.0
F1 = 0.666666666667



In [5]:

    
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 2, 0]
y_pred = [0, 0, 2, 1, 0]
target_names = ['class 0', 'class 1', 'class 2']
print(classification_report(y_true, y_pred, target_names=target_names))









    



             precision    recall  f1-score   support

    class 0       0.67      1.00      0.80         2
    class 1       0.00      0.00      0.00         1
    class 2       1.00      0.50      0.67         2

avg / total       0.67      0.60      0.59         5

Cross Entropy

$$ l(y, \hat{y}) = - \frac{1}{m} \sum_{i=1}^m y^{(i)} log(\hat{y}^{(i)}) + (1 - y^{(i)})log(1 - \hat{y}^{(i)}) $$



In [6]:

    
from sklearn.metrics import log_loss
y_true = [0, 0, 1, 1]
y_pred = [[.9, .1], [.8, .2], [.3, .7], [.01, .99]]
log_loss(y_true, y_pred)









    Out[6]:





0.17380733669106749

2. Regression Metrics

scikit-learn documentation

Mean Absolute Error

$$ MAE(y, \hat{y}) = \frac{1}{m} \sum_{i=1}^m \mid y^{(i)} - \hat{y}^{(i)} \mid $$



In [7]:

    
from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mean_absolute_error(y_true, y_pred)









    Out[7]:





0.5

Mean Squared Error

$$ MSE(y, \hat{y}) = \frac{1}{m} \sum_{i=1}^m (y^{(i)} - \hat{y}^{(i)})^2 $$



In [8]:

    
from sklearn.metrics import mean_squared_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mean_squared_error(y_true, y_pred)









    Out[8]:





0.375

$ R^2 $ score

From scikit-learn documentation:

The $ R^2 $ score is the coefficient of determination. It provides a measure of how well future samples are likely to be predicted by the model. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a $ R^2 $ score of 0.0.

$$ R^2(y, \hat{y}) = 1 - \frac{\sum_{i=1}^m (y^{(i)} - \hat{y}^{(i)})^2}{\sum_{i=1}^m (y^{(i)} - \bar{y})^2} $$