*From the video series: Introduction to machine learning with scikit-learn*

- What is the purpose of
**model evaluation**, and what are some common evaluation procedures? - What is the usage of
**classification accuracy**, and what are its limitations? - How does a
**confusion matrix**describe the performance of a classifier? - What
**metrics**can be computed from a confusion matrix? - How can you adjust classifier performance by
**changing the classification threshold**? - What is the purpose of an
**ROC curve**? - How does
**Area Under the Curve (AUC)**differ from classification accuracy?

**Training and testing on the same data**- Rewards overly complex models that "overfit" the training data and won't necessarily generalize

**Train/test split**- Split the dataset into two pieces, so that the model can be trained and tested on different data
- Better estimate of out-of-sample performance, but still a "high variance" estimate
- Useful due to its speed, simplicity, and flexibility

**K-fold cross-validation**- Systematically create "K" train/test splits and average the results together
- Even better estimate of out-of-sample performance
- Runs "K" times slower than train/test split

Pima Indian Diabetes dataset from the UCI Machine Learning Repository

```
In [2]:
```# read the data into a Pandas DataFrame
import pandas as pd
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(url, header=None, names=col_names)

```
In [3]:
```# print the first 5 rows of data
pima.head()

```
Out[3]:
```

**Question:** Can we predict the diabetes status of a patient given their health measurements?

```
In [4]:
```# define X and y
feature_cols = ['pregnant', 'insulin', 'bmi', 'age']
X = pima[feature_cols]
y = pima.label

```
In [5]:
```# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

```
In [6]:
```# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

```
Out[6]:
```

```
In [7]:
```# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)

**Classification accuracy:** percentage of correct predictions

```
In [8]:
```# calculate accuracy
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred_class)

```
```

**Null accuracy:** accuracy that could be achieved by always predicting the most frequent class

```
In [9]:
```# examine the class distribution of the testing set (using a Pandas Series method)
y_test.value_counts()

```
Out[9]:
```

```
In [10]:
```# calculate the percentage of ones
y_test.mean()

```
Out[10]:
```

```
In [11]:
```# calculate the percentage of zeros
1 - y_test.mean()

```
Out[11]:
```

```
In [12]:
```# calculate null accuracy (for binary classification problems coded as 0/1)
max(y_test.mean(), 1 - y_test.mean())

```
Out[12]:
```

```
In [13]:
```# calculate null accuracy (for multi-class classification problems)
y_test.value_counts().head(1) / len(y_test)

```
Out[13]:
```

Comparing the **true** and **predicted** response values

```
In [14]:
```# print the first 25 true and predicted responses
print 'True:', y_test.values[0:25]
print 'Pred:', y_pred_class[0:25]

```
```

**Conclusion:**

- Classification accuracy is the
**easiest classification metric to understand** - But, it does not tell you the
**underlying distribution**of response values - And, it does not tell you what
**"types" of errors**your classifier is making

```
In [15]:
```# IMPORTANT: first argument is true values, second argument is predicted values
print metrics.confusion_matrix(y_test, y_pred_class)

```
```

- Every observation in the testing set is represented in
**exactly one box** - It's a 2x2 matrix because there are
**2 response classes** - The format shown here is
**not**universal

**Basic terminology**

**True Positives (TP):**we*correctly*predicted that they*do*have diabetes**True Negatives (TN):**we*correctly*predicted that they*don't*have diabetes**False Positives (FP):**we*incorrectly*predicted that they*do*have diabetes (a "Type I error")**False Negatives (FN):**we*incorrectly*predicted that they*don't*have diabetes (a "Type II error")

```
In [16]:
```# print the first 25 true and predicted responses
print 'True:', y_test.values[0:25]
print 'Pred:', y_pred_class[0:25]

```
```

```
In [17]:
```# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]

**Classification Accuracy:** Overall, how often is the classifier correct?

```
In [18]:
```print (TP + TN) / float(TP + TN + FP + FN)
print metrics.accuracy_score(y_test, y_pred_class)

```
```

**Classification Error:** Overall, how often is the classifier incorrect?

- Also known as "Misclassification Rate"

```
In [19]:
```print (FP + FN) / float(TP + TN + FP + FN)
print 1 - metrics.accuracy_score(y_test, y_pred_class)

```
```

**Sensitivity:** When the actual value is positive, how often is the prediction correct?

- How "sensitive" is the classifier to detecting positive instances?
- Also known as "True Positive Rate" or "Recall"

```
In [20]:
```print TP / float(TP + FN)
print metrics.recall_score(y_test, y_pred_class)

```
```

**Specificity:** When the actual value is negative, how often is the prediction correct?

- How "specific" (or "selective") is the classifier in predicting positive instances?

```
In [21]:
```print TN / float(TN + FP)

```
```

**False Positive Rate:** When the actual value is negative, how often is the prediction incorrect?

```
In [22]:
```print FP / float(TN + FP)

```
```

**Precision:** When a positive value is predicted, how often is the prediction correct?

- How "precise" is the classifier when predicting positive instances?

```
In [23]:
```print TP / float(TP + FP)
print metrics.precision_score(y_test, y_pred_class)

```
```

Many other metrics can be computed: F1 score, Matthews correlation coefficient, etc.

**Conclusion:**

- Confusion matrix gives you a
**more complete picture**of how your classifier is performing - Also allows you to compute various
**classification metrics**, and these metrics can guide your model selection

**Which metrics should you focus on?**

- Choice of metric depends on your
**business objective** **Spam filter**(positive class is "spam"): Optimize for**precision or specificity**because false negatives (spam goes to the inbox) are more acceptable than false positives (non-spam is caught by the spam filter)**Fraudulent transaction detector**(positive class is "fraud"): Optimize for**sensitivity**because false positives (normal transactions that are flagged as possible fraud) are more acceptable than false negatives (fraudulent transactions that are not detected)

```
In [24]:
```# print the first 10 predicted responses
logreg.predict(X_test)[0:10]

```
Out[24]:
```

```
In [25]:
```# print the first 10 predicted probabilities of class membership
logreg.predict_proba(X_test)[0:10, :]

```
Out[25]:
```

```
In [26]:
```# print the first 10 predicted probabilities for class 1
logreg.predict_proba(X_test)[0:10, 1]

```
Out[26]:
```

```
In [27]:
```# store the predicted probabilities for class 1
y_pred_prob = logreg.predict_proba(X_test)[:, 1]

```
In [28]:
```# allow plots to appear in the notebook
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 14

```
In [29]:
```# histogram of predicted probabilities
plt.hist(y_pred_prob, bins=8)
plt.xlim(0, 1)
plt.title('Histogram of predicted probabilities')
plt.xlabel('Predicted probability of diabetes')
plt.ylabel('Frequency')

```
Out[29]:
```

**Decrease the threshold** for predicting diabetes in order to **increase the sensitivity** of the classifier

```
In [30]:
```# predict diabetes if the predicted probability is greater than 0.3
from sklearn.preprocessing import binarize
y_pred_class = binarize(y_pred_prob, 0.3)[0]

```
In [31]:
```# print the first 10 predicted probabilities
y_pred_prob[0:10]

```
Out[31]:
```

```
In [32]:
```# print the first 10 predicted classes with the lower threshold
y_pred_class[0:10]

```
Out[32]:
```

```
In [33]:
```# previous confusion matrix (default threshold of 0.5)
print confusion

```
```

```
In [34]:
```# new confusion matrix (threshold of 0.3)
print metrics.confusion_matrix(y_test, y_pred_class)

```
```

```
In [35]:
```# sensitivity has increased (used to be 0.24)
print 46 / float(46 + 16)

```
```

```
In [36]:
```# specificity has decreased (used to be 0.91)
print 80 / float(80 + 50)

```
```

**Conclusion:**

**Threshold of 0.5**is used by default (for binary problems) to convert predicted probabilities into class predictions- Threshold can be
**adjusted**to increase sensitivity or specificity - Sensitivity and specificity have an
**inverse relationship**

```
In [37]:
```# IMPORTANT: first argument is true values, second argument is predicted probabilities
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

```
```

- ROC curve can help you to
**choose a threshold**that balances sensitivity and specificity in a way that makes sense for your particular context - You can't actually
**see the thresholds**used to generate the curve on the ROC curve itself

```
In [38]:
```# define a function that accepts a threshold and prints sensitivity and specificity
def evaluate_threshold(threshold):
print 'Sensitivity:', tpr[thresholds > threshold][-1]
print 'Specificity:', 1 - fpr[thresholds > threshold][-1]

```
In [39]:
```evaluate_threshold(0.5)

```
```

```
In [40]:
```evaluate_threshold(0.3)

```
```

AUC is the **percentage** of the ROC plot that is **underneath the curve**:

```
In [41]:
```# IMPORTANT: first argument is true values, second argument is predicted probabilities
print metrics.roc_auc_score(y_test, y_pred_prob)

```
```

- AUC is useful as a
**single number summary**of classifier performance. - If you randomly chose one positive and one negative observation, AUC represents the likelihood that your classifier will assign a
**higher predicted probability**to the positive observation. - AUC is useful even when there is
**high class imbalance**(unlike classification accuracy).

```
In [42]:
```# calculate cross-validated AUC
from sklearn.cross_validation import cross_val_score
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()

```
Out[42]:
```

**Confusion matrix advantages:**

- Allows you to calculate a
**variety of metrics** - Useful for
**multi-class problems**(more than two response classes)

**ROC/AUC advantages:**

- Does not require you to
**set a classification threshold** - Still useful when there is
**high class imbalance**

- Blog post: Simple guide to confusion matrix terminology by me
- Videos: Intuitive sensitivity and specificity (9 minutes) and The tradeoff between sensitivity and specificity (13 minutes) by Rahul Patwari
- Notebook: How to calculate "expected value" from a confusion matrix by treating it as a cost-benefit matrix (by Ed Podojil)
- Graphic: How classification threshold affects different evaluation metrics (from a blog post about Amazon Machine Learning)

- Lesson notes: ROC Curves (from the University of Georgia)
- Video: ROC Curves and Area Under the Curve (14 minutes) by me, including transcript and screenshots and a visualization
- Video: ROC Curves (12 minutes) by Rahul Patwari
- Paper: An introduction to ROC analysis by Tom Fawcett
- Usage examples: Comparing different feature sets for detecting fraudulent Skype users, and comparing different classifiers on a number of popular datasets

- scikit-learn documentation: Model evaluation
- Guide: Comparing model evaluation procedures and metrics by me
- Video: Counterfactual evaluation of machine learning models (45 minutes) about how Stripe evaluates its fraud detection model, including slides

- Email: kevin@dataschool.io
- Website: http://dataschool.io
- Twitter: @justmarkham

```
In [1]:
```from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom.css", "r").read()
return HTML(styles)
css_styling()

```
Out[1]:
```