From the video series: Introduction to machine learning with scikit-learn
Pima Indian Diabetes dataset from the UCI Machine Learning Repository
In [2]:
# read the data into a Pandas DataFrame
import pandas as pd
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(url, header=None, names=col_names)
In [3]:
# print the first 5 rows of data
pima.head()
Out[3]:
Question: Can we predict the diabetes status of a patient given their health measurements?
In [4]:
# define X and y
feature_cols = ['pregnant', 'insulin', 'bmi', 'age']
X = pima[feature_cols]
y = pima.label
In [5]:
# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
In [6]:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Out[6]:
In [7]:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)
Classification accuracy: percentage of correct predictions
In [8]:
# calculate accuracy
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred_class)
Null accuracy: accuracy that could be achieved by always predicting the most frequent class
In [9]:
# examine the class distribution of the testing set (using a Pandas Series method)
y_test.value_counts()
Out[9]:
In [10]:
# calculate the percentage of ones
y_test.mean()
Out[10]:
In [11]:
# calculate the percentage of zeros
1 - y_test.mean()
Out[11]:
In [12]:
# calculate null accuracy (for binary classification problems coded as 0/1)
max(y_test.mean(), 1 - y_test.mean())
Out[12]:
In [13]:
# calculate null accuracy (for multi-class classification problems)
y_test.value_counts().head(1) / len(y_test)
Out[13]:
Comparing the true and predicted response values
In [14]:
# print the first 25 true and predicted responses
print 'True:', y_test.values[0:25]
print 'Pred:', y_pred_class[0:25]
Conclusion:
In [15]:
# IMPORTANT: first argument is true values, second argument is predicted values
print metrics.confusion_matrix(y_test, y_pred_class)
Basic terminology
In [16]:
# print the first 25 true and predicted responses
print 'True:', y_test.values[0:25]
print 'Pred:', y_pred_class[0:25]
In [17]:
# save confusion matrix and slice into four pieces
confusion = metrics.confusion_matrix(y_test, y_pred_class)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]
Classification Accuracy: Overall, how often is the classifier correct?
In [18]:
print (TP + TN) / float(TP + TN + FP + FN)
print metrics.accuracy_score(y_test, y_pred_class)
Classification Error: Overall, how often is the classifier incorrect?
In [19]:
print (FP + FN) / float(TP + TN + FP + FN)
print 1 - metrics.accuracy_score(y_test, y_pred_class)
Sensitivity: When the actual value is positive, how often is the prediction correct?
In [20]:
print TP / float(TP + FN)
print metrics.recall_score(y_test, y_pred_class)
Specificity: When the actual value is negative, how often is the prediction correct?
In [21]:
print TN / float(TN + FP)
False Positive Rate: When the actual value is negative, how often is the prediction incorrect?
In [22]:
print FP / float(TN + FP)
Precision: When a positive value is predicted, how often is the prediction correct?
In [23]:
print TP / float(TP + FP)
print metrics.precision_score(y_test, y_pred_class)
Many other metrics can be computed: F1 score, Matthews correlation coefficient, etc.
Conclusion:
Which metrics should you focus on?
In [24]:
# print the first 10 predicted responses
logreg.predict(X_test)[0:10]
Out[24]:
In [25]:
# print the first 10 predicted probabilities of class membership
logreg.predict_proba(X_test)[0:10, :]
Out[25]:
In [26]:
# print the first 10 predicted probabilities for class 1
logreg.predict_proba(X_test)[0:10, 1]
Out[26]:
In [27]:
# store the predicted probabilities for class 1
y_pred_prob = logreg.predict_proba(X_test)[:, 1]
In [28]:
# allow plots to appear in the notebook
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 14
In [29]:
# histogram of predicted probabilities
plt.hist(y_pred_prob, bins=8)
plt.xlim(0, 1)
plt.title('Histogram of predicted probabilities')
plt.xlabel('Predicted probability of diabetes')
plt.ylabel('Frequency')
Out[29]:
Decrease the threshold for predicting diabetes in order to increase the sensitivity of the classifier
In [30]:
# predict diabetes if the predicted probability is greater than 0.3
from sklearn.preprocessing import binarize
y_pred_class = binarize(y_pred_prob, 0.3)[0]
In [31]:
# print the first 10 predicted probabilities
y_pred_prob[0:10]
Out[31]:
In [32]:
# print the first 10 predicted classes with the lower threshold
y_pred_class[0:10]
Out[32]:
In [33]:
# previous confusion matrix (default threshold of 0.5)
print confusion
In [34]:
# new confusion matrix (threshold of 0.3)
print metrics.confusion_matrix(y_test, y_pred_class)
In [35]:
# sensitivity has increased (used to be 0.24)
print 46 / float(46 + 16)
In [36]:
# specificity has decreased (used to be 0.91)
print 80 / float(80 + 50)
Conclusion:
In [37]:
# IMPORTANT: first argument is true values, second argument is predicted probabilities
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for diabetes classifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
In [38]:
# define a function that accepts a threshold and prints sensitivity and specificity
def evaluate_threshold(threshold):
print 'Sensitivity:', tpr[thresholds > threshold][-1]
print 'Specificity:', 1 - fpr[thresholds > threshold][-1]
In [39]:
evaluate_threshold(0.5)
In [40]:
evaluate_threshold(0.3)
AUC is the percentage of the ROC plot that is underneath the curve:
In [41]:
# IMPORTANT: first argument is true values, second argument is predicted probabilities
print metrics.roc_auc_score(y_test, y_pred_prob)
In [42]:
# calculate cross-validated AUC
from sklearn.cross_validation import cross_val_score
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()
Out[42]:
Confusion matrix advantages:
ROC/AUC advantages:
In [1]:
from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom.css", "r").read()
return HTML(styles)
css_styling()
Out[1]: