Pima Indian Diabetes Prediction (with Model Reload)

Import some basic libraries.

  • Pandas - provided data frames
  • matplotlib.pyplot - plotting support

Use Magic %matplotlib to display graphics inline instead of in a popup window.


In [1]:
import pandas as pd                 # pandas is a dataframe library
import matplotlib.pyplot as plt      # matplotlib.pyplot plots data

%matplotlib inline

Loading and Reviewing the Data


In [2]:
df = pd.read_csv("./data/pima-data.csv")

In [3]:
df.shape


Out[3]:
(768, 10)

In [4]:
df.head(5)


Out[4]:
num_preg glucose_conc diastolic_bp thickness insulin bmi diab_pred age skin diabetes
0 6 148 72 35 0 33.6 0.627 50 1.3790 True
1 1 85 66 29 0 26.6 0.351 31 1.1426 False
2 8 183 64 0 0 23.3 0.672 32 0.0000 True
3 1 89 66 23 94 28.1 0.167 21 0.9062 False
4 0 137 40 35 168 43.1 2.288 33 1.3790 True

In [5]:
df.tail(5)


Out[5]:
num_preg glucose_conc diastolic_bp thickness insulin bmi diab_pred age skin diabetes
763 10 101 76 48 180 32.9 0.171 63 1.8912 False
764 2 122 70 27 0 36.8 0.340 27 1.0638 False
765 5 121 72 23 112 26.2 0.245 30 0.9062 False
766 1 126 60 0 0 30.1 0.349 47 0.0000 True
767 1 93 70 31 0 30.4 0.315 23 1.2214 False

Definition of features

From the metadata on the data source we have the following definition of the features.

Feature Description Comments
num_preg number of pregnancies
glucose_conc Plasma glucose concentration a 2 hours in an oral glucose tolerance test
diastolic_bp Diastolic blood pressure (mm Hg)
thickness Triceps skin fold thickness (mm)
insulin 2-Hour serum insulin (mu U/ml)
bmi Body mass index (weight in kg/(height in m)^2)
diab_pred Diabetes pedigree function
Age (years) Age (years)
skin ???? What is this?
diabetes Class variable (1=True, 0=False) Why is our data boolean (True/False)?

Check for null values


In [6]:
df.isnull().values.any()


Out[6]:
False

Correlated Feature Check

Helper function that displays correlation by color. Red is most correlated, Blue least.


In [7]:
def plot_corr(df, size=11):
    """
    Function plots a graphical correlation matrix for each pair of columns in the dataframe.

    Input:
        df: pandas DataFrame
        size: vertical and horizontal size of the plot

    Displays:
        matrix of correlation between columns.  Blue-cyan-yellow-red-darkred => less to more correlated
                                                0 ------------------>  1
                                                Expect a darkred line running from top left to bottom right
    """

    corr = df.corr()    # data frame correlation function
    fig, ax = plt.subplots(figsize=(size, size))
    ax.matshow(corr)   # color code the rectangles by correlation value
    plt.xticks(range(len(corr.columns)), corr.columns)  # draw x tick marks
    plt.yticks(range(len(corr.columns)), corr.columns)  # draw y tick marks

In [8]:
plot_corr(df)



In [9]:
df.corr()


Out[9]:
num_preg glucose_conc diastolic_bp thickness insulin bmi diab_pred age skin diabetes
num_preg 1.000000 0.129459 0.141282 -0.081672 -0.073535 0.017683 -0.033523 0.544341 -0.081672 0.221898
glucose_conc 0.129459 1.000000 0.152590 0.057328 0.331357 0.221071 0.137337 0.263514 0.057328 0.466581
diastolic_bp 0.141282 0.152590 1.000000 0.207371 0.088933 0.281805 0.041265 0.239528 0.207371 0.065068
thickness -0.081672 0.057328 0.207371 1.000000 0.436783 0.392573 0.183928 -0.113970 1.000000 0.074752
insulin -0.073535 0.331357 0.088933 0.436783 1.000000 0.197859 0.185071 -0.042163 0.436783 0.130548
bmi 0.017683 0.221071 0.281805 0.392573 0.197859 1.000000 0.140647 0.036242 0.392573 0.292695
diab_pred -0.033523 0.137337 0.041265 0.183928 0.185071 0.140647 1.000000 0.033561 0.183928 0.173844
age 0.544341 0.263514 0.239528 -0.113970 -0.042163 0.036242 0.033561 1.000000 -0.113970 0.238356
skin -0.081672 0.057328 0.207371 1.000000 0.436783 0.392573 0.183928 -0.113970 1.000000 0.074752
diabetes 0.221898 0.466581 0.065068 0.074752 0.130548 0.292695 0.173844 0.238356 0.074752 1.000000

In [10]:
df.head(5)


Out[10]:
num_preg glucose_conc diastolic_bp thickness insulin bmi diab_pred age skin diabetes
0 6 148 72 35 0 33.6 0.627 50 1.3790 True
1 1 85 66 29 0 26.6 0.351 31 1.1426 False
2 8 183 64 0 0 23.3 0.672 32 0.0000 True
3 1 89 66 23 94 28.1 0.167 21 0.9062 False
4 0 137 40 35 168 43.1 2.288 33 1.3790 True

The skin and thickness columns are correlated 1 to 1. Dropping the skin column


In [11]:
del df['skin']

In [12]:
df.head(5)


Out[12]:
num_preg glucose_conc diastolic_bp thickness insulin bmi diab_pred age diabetes
0 6 148 72 35 0 33.6 0.627 50 True
1 1 85 66 29 0 26.6 0.351 31 False
2 8 183 64 0 0 23.3 0.672 32 True
3 1 89 66 23 94 28.1 0.167 21 False
4 0 137 40 35 168 43.1 2.288 33 True

Check for additional correlations


In [13]:
plot_corr(df)


The correlations look good. There appear to be no coorelated columns.

Mold Data

Data Types

Inspect data types to see if there are any issues. Data should be numeric.


In [14]:
df.head(5)


Out[14]:
num_preg glucose_conc diastolic_bp thickness insulin bmi diab_pred age diabetes
0 6 148 72 35 0 33.6 0.627 50 True
1 1 85 66 29 0 26.6 0.351 31 False
2 8 183 64 0 0 23.3 0.672 32 True
3 1 89 66 23 94 28.1 0.167 21 False
4 0 137 40 35 168 43.1 2.288 33 True

Change diabetes from boolean to integer, True=1, False=0


In [15]:
diabetes_map = {True : 1, False : 0}
df['diabetes'] = df['diabetes'].map(diabetes_map)

Verify that the diabetes data type has been changed.


In [16]:
df.head(5)


Out[16]:
num_preg glucose_conc diastolic_bp thickness insulin bmi diab_pred age diabetes
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

Check for null values


In [17]:
df.isnull().values.any()


Out[17]:
False

No obvious null values.

Check class distribution

Rare events are hard to predict


In [18]:
num_obs = len(df)
num_true = len(df.loc[df['diabetes'] == 1])
num_false = len(df.loc[df['diabetes'] == 0])
print("Number of True cases:  {0} ({1:2.2f}%)".format(num_true, (num_true/num_obs) * 100))
print("Number of False cases: {0} ({1:2.2f}%)".format(num_false, (num_false/num_obs) * 100))


Number of True cases:  268 (34.90%)
Number of False cases: 500 (65.10%)

Good distribution of true and false cases. No special work needed.

Spliting the data

70% for training, 30% for testing


In [19]:
from sklearn.cross_validation import train_test_split

feature_col_names = ['num_preg', 'glucose_conc', 'diastolic_bp', 'thickness', 'insulin', 'bmi', 'diab_pred', 'age']
predicted_class_names = ['diabetes']

X = df[feature_col_names].values     # predictor feature columns (8 X m)
y = df[predicted_class_names].values # predicted class (1=true, 0=false) column (1 X m)
split_test_size = 0.30

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=split_test_size, random_state=42) 
                            # test_size = 0.3 is 30%, 42 is the answer to everything


C:\Users\jerry\Anaconda3\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

We check to ensure we have the the desired 70% train, 30% test split of the data


In [20]:
print("{0:0.2f}% in training set".format((len(X_train)/len(df.index)) * 100))
print("{0:0.2f}% in test set".format((len(X_test)/len(df.index)) * 100))


69.92% in training set
30.08% in test set

Verifying predicted value was split correctly


In [21]:
print("Original True  : {0} ({1:0.2f}%)".format(len(df.loc[df['diabetes'] == 1]), (len(df.loc[df['diabetes'] == 1])/len(df.index)) * 100.0))
print("Original False : {0} ({1:0.2f}%)".format(len(df.loc[df['diabetes'] == 0]), (len(df.loc[df['diabetes'] == 0])/len(df.index)) * 100.0))
print("")
print("Training True  : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), (len(y_train[y_train[:] == 1])/len(y_train) * 100.0)))
print("Training False : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), (len(y_train[y_train[:] == 0])/len(y_train) * 100.0)))
print("")
print("Test True      : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), (len(y_test[y_test[:] == 1])/len(y_test) * 100.0)))
print("Test False     : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), (len(y_test[y_test[:] == 0])/len(y_test) * 100.0)))


Original True  : 268 (34.90%)
Original False : 500 (65.10%)

Training True  : 188 (35.01%)
Training False : 349 (64.99%)

Test True      : 80 (34.63%)
Test False     : 151 (65.37%)

Post-split Data Preparation

Hidden Missing Values


In [22]:
df.head()


Out[22]:
num_preg glucose_conc diastolic_bp thickness insulin bmi diab_pred age diabetes
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

Are these 0 values possible?

How many rows have have unexpected 0 values?


In [23]:
print("# rows in dataframe {0}".format(len(df)))
print("# rows missing glucose_conc: {0}".format(len(df.loc[df['glucose_conc'] == 0])))
print("# rows missing diastolic_bp: {0}".format(len(df.loc[df['diastolic_bp'] == 0])))
print("# rows missing thickness: {0}".format(len(df.loc[df['thickness'] == 0])))
print("# rows missing insulin: {0}".format(len(df.loc[df['insulin'] == 0])))
print("# rows missing bmi: {0}".format(len(df.loc[df['bmi'] == 0])))
print("# rows missing diab_pred: {0}".format(len(df.loc[df['diab_pred'] == 0])))
print("# rows missing age: {0}".format(len(df.loc[df['age'] == 0])))


# rows in dataframe 768
# rows missing glucose_conc: 5
# rows missing diastolic_bp: 35
# rows missing thickness: 227
# rows missing insulin: 374
# rows missing bmi: 11
# rows missing diab_pred: 0
# rows missing age: 0

Impute with the mean


In [24]:
from sklearn.preprocessing import Imputer

#Impute with mean all 0 readings
fill_0 = Imputer(missing_values=0, strategy="mean", axis=0)

X_train = fill_0.fit_transform(X_train)
X_test = fill_0.fit_transform(X_test)

Training Initial Algorithm - Naive Bayes


In [25]:
from sklearn.naive_bayes import GaussianNB

# create Gaussian Naive Bayes model object and train it with the data
nb_model = GaussianNB()

nb_model.fit(X_train, y_train.ravel())


Out[25]:
GaussianNB(priors=None)

Performance on Training Data


In [26]:
# predict values using the training data
nb_predict_train = nb_model.predict(X_train)

# import the performance metrics library
from sklearn import metrics

# Accuracy
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_train, nb_predict_train)))
print()


Accuracy: 0.7542

Performance on Testing Data


In [27]:
# predict values using the testing data
nb_predict_test = nb_model.predict(X_test)

from sklearn import metrics

# training metrics
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, nb_predict_test)))


Accuracy: 0.7359

Metrics


In [28]:
print("Confusion Matrix")
print("{0}".format(metrics.confusion_matrix(y_test, nb_predict_test)))
print("")

print("Classification Report")
print(metrics.classification_report(y_test, nb_predict_test))


Confusion Matrix
[[118  33]
 [ 28  52]]

Classification Report
             precision    recall  f1-score   support

          0       0.81      0.78      0.79       151
          1       0.61      0.65      0.63        80

avg / total       0.74      0.74      0.74       231

Random Forest


In [29]:
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(random_state=42)      # Create random forest object
rf_model.fit(X_train, y_train.ravel())


Out[29]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

Predict Training Data


In [30]:
rf_predict_train = rf_model.predict(X_train)
# training metrics
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_train, rf_predict_train)))


Accuracy: 0.9870

Predict Test Data


In [31]:
rf_predict_test = rf_model.predict(X_test)

# training metrics
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, rf_predict_test)))


Accuracy: 0.7100

In [32]:
print(metrics.confusion_matrix(y_test, rf_predict_test) )
print("")
print("Classification Report")
print(metrics.classification_report(y_test, rf_predict_test))


[[121  30]
 [ 37  43]]

Classification Report
             precision    recall  f1-score   support

          0       0.77      0.80      0.78       151
          1       0.59      0.54      0.56        80

avg / total       0.70      0.71      0.71       231

Logistic Regression


In [33]:
from sklearn.linear_model import LogisticRegression

lr_model =LogisticRegression(C=0.7, random_state=42)
lr_model.fit(X_train, y_train.ravel())
lr_predict_test = lr_model.predict(X_test)

# training metrics
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, lr_predict_test)))
print(metrics.confusion_matrix(y_test, lr_predict_test) )
print("")
print("Classification Report")
print(metrics.classification_report(y_test, lr_predict_test))


Accuracy: 0.7446
[[128  23]
 [ 36  44]]

Classification Report
             precision    recall  f1-score   support

          0       0.78      0.85      0.81       151
          1       0.66      0.55      0.60        80

avg / total       0.74      0.74      0.74       231

Setting regularization parameter


In [34]:
C_start = 0.1
C_end = 5
C_inc = 0.1

C_values, recall_scores = [], []

C_val = C_start
best_recall_score = 0
while (C_val < C_end):
    C_values.append(C_val)
    lr_model_loop = LogisticRegression(C=C_val, random_state=42)
    lr_model_loop.fit(X_train, y_train.ravel())
    lr_predict_loop_test = lr_model_loop.predict(X_test)
    recall_score = metrics.recall_score(y_test, lr_predict_loop_test)
    recall_scores.append(recall_score)
    if (recall_score > best_recall_score):
        best_recall_score = recall_score
        best_lr_predict_test = lr_predict_loop_test
        
    C_val = C_val + C_inc

best_score_C_val = C_values[recall_scores.index(best_recall_score)]
print("1st max value of {0:.3f} occured at C={1:.3f}".format(best_recall_score, best_score_C_val))

%matplotlib inline 
plt.plot(C_values, recall_scores, "-")
plt.xlabel("C value")
plt.ylabel("recall score")


1st max value of 0.613 occured at C=1.400
Out[34]:
<matplotlib.text.Text at 0x28381f994a8>

Logisitic regression with class_weight='balanced'


In [35]:
C_start = 0.1
C_end = 5
C_inc = 0.1

C_values, recall_scores = [], []

C_val = C_start
best_recall_score = 0
while (C_val < C_end):
    C_values.append(C_val)
    lr_model_loop = LogisticRegression(C=C_val, class_weight="balanced", random_state=42)
    lr_model_loop.fit(X_train, y_train.ravel())
    lr_predict_loop_test = lr_model_loop.predict(X_test)
    recall_score = metrics.recall_score(y_test, lr_predict_loop_test)
    recall_scores.append(recall_score)
    if (recall_score > best_recall_score):
        best_recall_score = recall_score
        best_lr_predict_test = lr_predict_loop_test
        
    C_val = C_val + C_inc

best_score_C_val = C_values[recall_scores.index(best_recall_score)]
print("1st max value of {0:.3f} occured at C={1:.3f}".format(best_recall_score, best_score_C_val))

%matplotlib inline 
plt.plot(C_values, recall_scores, "-")
plt.xlabel("C value")
plt.ylabel("recall score")


1st max value of 0.738 occured at C=0.300
Out[35]:
<matplotlib.text.Text at 0x283820009e8>

In [36]:
from sklearn.linear_model import LogisticRegression
lr_model =LogisticRegression( class_weight="balanced", C=best_score_C_val, random_state=42)
lr_model.fit(X_train, y_train.ravel())
lr_predict_test = lr_model.predict(X_test)

# training metrics
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, lr_predict_test)))
print(metrics.confusion_matrix(y_test, lr_predict_test) )
print("")
print("Classification Report")
print(metrics.classification_report(y_test, lr_predict_test))
print(metrics.recall_score(y_test, lr_predict_test))


Accuracy: 0.7143
[[106  45]
 [ 21  59]]

Classification Report
             precision    recall  f1-score   support

          0       0.83      0.70      0.76       151
          1       0.57      0.74      0.64        80

avg / total       0.74      0.71      0.72       231

0.7375

LogisticRegressionCV


In [37]:
from sklearn.linear_model import LogisticRegressionCV
lr_cv_model = LogisticRegressionCV(n_jobs=-1, random_state=42, Cs=3, cv=10, refit=False, class_weight="balanced")  # set number of jobs to -1 which uses all cores to parallelize
lr_cv_model.fit(X_train, y_train.ravel())


Out[37]:
LogisticRegressionCV(Cs=3, class_weight='balanced', cv=10, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=-1, penalty='l2', random_state=42,
           refit=False, scoring=None, solver='lbfgs', tol=0.0001,
           verbose=0)

Predict on Test data


In [38]:
lr_cv_predict_test = lr_cv_model.predict(X_test)

# training metrics
print("Accuracy: {0:.4f}".format(metrics.accuracy_score(y_test, lr_cv_predict_test)))
print(metrics.confusion_matrix(y_test, lr_cv_predict_test) )
print("")
print("Classification Report")
print(metrics.classification_report(y_test, lr_cv_predict_test))


Accuracy: 0.7143
[[110  41]
 [ 25  55]]

Classification Report
             precision    recall  f1-score   support

          0       0.81      0.73      0.77       151
          1       0.57      0.69      0.62        80

avg / total       0.73      0.71      0.72       231

Using your trained Model

Save trained model to file


In [39]:
from sklearn.externals import joblib  
joblib.dump(lr_cv_model, "./data/pima-trained-model.pkl")


Out[39]:
['./data/pima-trained-model.pkl']

Load trained model from file


In [40]:
lr_cv_model = joblib.load("./data/pima-trained-model.pkl")

Test Prediction on data

Once the model is loaded we can use it to predict on some data. In this case the data file contains a few rows from the original Pima CSV file.


In [41]:
# get data from truncated pima data file
df_predict = pd.read_csv("./data/pima-data-trunc.csv")
print(df_predict.shape)


(4, 10)

In [42]:
df_predict


Out[42]:
num_preg glucose_conc diastolic_bp thickness insulin bmi diab_pred age skin diabetes
0 1 89 66 23 94 28.1 0.167 21 0.9062 False
1 2 197 70 45 543 30.5 0.158 53 1.7730 True
2 7 100 0 0 0 30.0 0.484 32 0.0000 True
3 1 103 30 38 83 43.3 0.183 33 1.4972 False

The truncated file contained 4 rows from the original CSV.

Data is the same is in same format as the original CSV file's data. Therefore, just like the original data, we need to transform it before we can make predictions on the data.

Note: If the data had been previously "cleaned up" this would not be necessary.

We do this by executed the same transformations as we did to the original data

Start by dropping the "skin" which is the same as thickness, with different units.


In [43]:
del df_predict['skin']
df_predict


Out[43]:
num_preg glucose_conc diastolic_bp thickness insulin bmi diab_pred age diabetes
0 1 89 66 23 94 28.1 0.167 21 False
1 2 197 70 45 543 30.5 0.158 53 True
2 7 100 0 0 0 30.0 0.484 32 True
3 1 103 30 38 83 43.3 0.183 33 False

We need to drop the diabetes column since that is what we are predicting.
Store data without the column with the prefix X as we did with the X_train and X_test to indicate that it contains only the columns we are prediction.


In [44]:
X_predict = df_predict
del X_predict['diabetes']

Data has 0 in places it should not.

Just like test or test datasets we will use imputation to fix this.


In [45]:
#Impute with mean all 0 readings
fill_0 = Imputer(missing_values=0, strategy="mean", axis=0)
X_predict = fill_0.fit_transform(X_predict)

At this point our data is ready to be used for prediction.

Predict diabetes with the prediction data. Returns 1 if True, 0 if false


In [46]:
lr_cv_model.predict(X_predict)


Out[46]:
array([0, 1, 0, 0], dtype=int64)