TensorFlow Classification

Data

https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes

  1. Title: Pima Indians Diabetes Database

  2. Sources: (a) Original owners: National Institute of Diabetes and Digestive and

                     Kidney Diseases
    

    (b) Donor of database: Vincent Sigillito (vgs@aplcen.apl.jhu.edu)

                       Research Center, RMI Group Leader
                       Applied Physics Laboratory
                       The Johns Hopkins University
                       Johns Hopkins Road
                       Laurel, MD 20707
                       (301) 953-6231
    

    (c) Date received: 9 May 1990

  3. Past Usage:

    1. Smith,~J.~W., Everhart,~J.~E., Dickson,~W.~C., Knowler,~W.~C., \& Johannes,~R.~S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In {\it Proceedings of the Symposium on Computer Applications and Medical Care} (pp. 261--265). IEEE Computer Society Press.

      The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The population lives near Phoenix, Arizona, USA.

      Results: Their ADAP algorithm makes a real-valued prediction between 0 and 1. This was transformed into a binary decision using a cutoff of 0.448. Using 576 training instances, the sensitivity and specificity of their algorithm was 76% on the remaining 192 instances.

  4. Relevant Information: Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. ADAP is an adaptive learning routine that generates and executes digital analogs of perceptron-like devices. It is a unique algorithm; see the paper for details.

  5. Number of Instances: 768

  6. Number of Attributes: 8 plus class

    1. For Each Attribute: (all numeric-valued)
      1. Number of times pregnant
      2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
      3. Diastolic blood pressure (mm Hg)
      4. Triceps skin fold thickness (mm)
      5. 2-Hour serum insulin (mu U/ml)
      6. Body mass index (weight in kg/(height in m)^2)
      7. Diabetes pedigree function
      8. Age (years)
      9. Class variable (0 or 1)
  7. Missing Attribute Values: Yes

  8. Class Distribution: (class value 1 is interpreted as "tested positive for diabetes")

    Class Value Number of instances 0 500 1 268

  9. Brief statistical analysis:

    Attribute number:    Mean:   Standard Deviation:
    1.                     3.8     3.4
    2.                   120.9    32.0
    3.                    69.1    19.4
    4.                    20.5    16.0
    5.                    79.8   115.2
    6.                    32.0     7.9
    7.                     0.5     0.3
    8.                    33.2    11.8

In [1]:
import pandas as pd

In [2]:
diabetes = pd.read_csv('data/pima-indians-diabetes.csv')

In [3]:
diabetes.head()


Out[3]:
Number_pregnant Glucose_concentration Blood_pressure Triceps Insulin BMI Pedigree Age Class Group
0 6 0.743719 0.590164 0.353535 0.000000 0.500745 0.234415 50 1 B
1 1 0.427136 0.540984 0.292929 0.000000 0.396423 0.116567 31 0 C
2 8 0.919598 0.524590 0.000000 0.000000 0.347243 0.253629 32 1 B
3 1 0.447236 0.540984 0.232323 0.111111 0.418778 0.038002 21 0 B
4 0 0.688442 0.327869 0.353535 0.198582 0.642325 0.943638 33 1 C

In [4]:
diabetes.columns


Out[4]:
Index(['Number_pregnant', 'Glucose_concentration', 'Blood_pressure', 'Triceps',
       'Insulin', 'BMI', 'Pedigree', 'Age', 'Class', 'Group'],
      dtype='object')

Clean the Data


In [5]:
# Columns that will be normalized
cols_to_norm = ['Number_pregnant', 'Glucose_concentration', 'Blood_pressure', 'Triceps',
                'Insulin', 'BMI', 'Pedigree']

In [6]:
# Normalizing the columns
diabetes[cols_to_norm] = diabetes[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

In [7]:
diabetes.head()


Out[7]:
Number_pregnant Glucose_concentration Blood_pressure Triceps Insulin BMI Pedigree Age Class Group
0 0.352941 0.743719 0.590164 0.353535 0.000000 0.500745 0.234415 50 1 B
1 0.058824 0.427136 0.540984 0.292929 0.000000 0.396423 0.116567 31 0 C
2 0.470588 0.919598 0.524590 0.000000 0.000000 0.347243 0.253629 32 1 B
3 0.058824 0.447236 0.540984 0.232323 0.111111 0.418778 0.038002 21 0 B
4 0.000000 0.688442 0.327869 0.353535 0.198582 0.642325 0.943638 33 1 C

Feature Columns


In [8]:
diabetes.columns


Out[8]:
Index(['Number_pregnant', 'Glucose_concentration', 'Blood_pressure', 'Triceps',
       'Insulin', 'BMI', 'Pedigree', 'Age', 'Class', 'Group'],
      dtype='object')

In [9]:
import tensorflow as tf

Continuous Features

  • Number of times pregnant
  • Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  • Diastolic blood pressure (mm Hg)
  • Triceps skin fold thickness (mm)
  • 2-Hour serum insulin (mu U/ml)
  • Body mass index (weight in kg/(height in m)^2)
  • Diabetes pedigree function

In [10]:
num_preg = tf.feature_column.numeric_column('Number_pregnant')
plasma_gluc = tf.feature_column.numeric_column('Glucose_concentration')
dias_press = tf.feature_column.numeric_column('Blood_pressure')
tricep = tf.feature_column.numeric_column('Triceps')
insulin = tf.feature_column.numeric_column('Insulin')
bmi = tf.feature_column.numeric_column('BMI')
diabetes_pedigree = tf.feature_column.numeric_column('Pedigree')
age = tf.feature_column.numeric_column('Age')

Categorical Features

If you know the set of all possible feature values of a column and there are only a few of them, you can use categorical_column_with_vocabulary_list. If you don't know the set of possible values in advance you can use categorical_column_with_hash_bucket


In [11]:
assigned_group = tf.feature_column.categorical_column_with_vocabulary_list('Group',['A','B','C','D'])
# Alternative
# assigned_group = tf.feature_column.categorical_column_with_hash_bucket('Group', hash_bucket_size=10)

Converting Continuous to Categorical


In [12]:
import matplotlib.pyplot as plt
%matplotlib inline

In [13]:
diabetes['Age'].hist(bins = 20)


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x29139a85898>

In [14]:
age_buckets = tf.feature_column.bucketized_column(age, boundaries=[20, 30, 40, 50, 60, 70, 80])

Putting them together


In [15]:
feat_cols = [num_preg, plasma_gluc, dias_press, tricep, insulin, 
             bmi, diabetes_pedigree, assigned_group, age_buckets]

Train Test Split


In [16]:
diabetes.head()


Out[16]:
Number_pregnant Glucose_concentration Blood_pressure Triceps Insulin BMI Pedigree Age Class Group
0 0.352941 0.743719 0.590164 0.353535 0.000000 0.500745 0.234415 50 1 B
1 0.058824 0.427136 0.540984 0.292929 0.000000 0.396423 0.116567 31 0 C
2 0.470588 0.919598 0.524590 0.000000 0.000000 0.347243 0.253629 32 1 B
3 0.058824 0.447236 0.540984 0.232323 0.111111 0.418778 0.038002 21 0 B
4 0.000000 0.688442 0.327869 0.353535 0.198582 0.642325 0.943638 33 1 C

In [17]:
diabetes.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 10 columns):
Number_pregnant          768 non-null float64
Glucose_concentration    768 non-null float64
Blood_pressure           768 non-null float64
Triceps                  768 non-null float64
Insulin                  768 non-null float64
BMI                      768 non-null float64
Pedigree                 768 non-null float64
Age                      768 non-null int64
Class                    768 non-null int64
Group                    768 non-null object
dtypes: float64(7), int64(2), object(1)
memory usage: 60.1+ KB

In [18]:
# Dropping 'Class' to exclude the column
x_data = diabetes.drop('Class',axis = 1)

In [19]:
labels = diabetes['Class']

In [20]:
from sklearn.model_selection import train_test_split

In [21]:
# Test train split
X_train, X_test, y_train, y_test = train_test_split(x_data,
                                                    labels,
                                                    test_size = 0.33, 
                                                    random_state = 101)

Input Function


In [22]:
input_func = tf.estimator.inputs.pandas_input_fn(x = X_train,
                                                 y = y_train,
                                                 batch_size = 10, 
                                                 num_epochs = 1000,
                                                 shuffle = True)

Creating the Model


In [23]:
model = tf.estimator.LinearClassifier(feature_columns = feat_cols, 
                                      n_classes = 2)


INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: C:\Users\ARCYFE~1\AppData\Local\Temp\tmpod8f4b08
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\ARCYFE~1\\AppData\\Local\\Temp\\tmpod8f4b08', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000002913C0D5438>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

In [24]:
model.train(input_fn = input_func,
            steps = 1000)


INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\ARCYFE~1\AppData\Local\Temp\tmpod8f4b08\model.ckpt.
INFO:tensorflow:loss = 6.931472, step = 0
INFO:tensorflow:global_step/sec: 71.5283
INFO:tensorflow:loss = 6.5315638, step = 100 (1.401 sec)
INFO:tensorflow:global_step/sec: 88.4107
INFO:tensorflow:loss = 5.0967355, step = 200 (1.132 sec)
INFO:tensorflow:global_step/sec: 89.5847
INFO:tensorflow:loss = 4.257456, step = 300 (1.116 sec)
INFO:tensorflow:global_step/sec: 88.1773
INFO:tensorflow:loss = 5.143954, step = 400 (1.135 sec)
INFO:tensorflow:global_step/sec: 89.3968
INFO:tensorflow:loss = 7.7543244, step = 500 (1.117 sec)
INFO:tensorflow:global_step/sec: 88.8022
INFO:tensorflow:loss = 3.931779, step = 600 (1.126 sec)
INFO:tensorflow:global_step/sec: 90.1316
INFO:tensorflow:loss = 4.8705893, step = 700 (1.109 sec)
INFO:tensorflow:global_step/sec: 88.105
INFO:tensorflow:loss = 6.636447, step = 800 (1.136 sec)
INFO:tensorflow:global_step/sec: 71.8223
INFO:tensorflow:loss = 4.5513725, step = 900 (1.394 sec)
INFO:tensorflow:Saving checkpoints for 1000 into C:\Users\ARCYFE~1\AppData\Local\Temp\tmpod8f4b08\model.ckpt.
INFO:tensorflow:Loss for final step: 4.565303.
Out[24]:
<tensorflow.python.estimator.canned.linear.LinearClassifier at 0x2913c0c8f28>

In [25]:
# Useful link for your own data
# https://stackoverflow.com/questions/44664285/what-are-the-contraints-for-tensorflow-scope-names

Evaluation


In [26]:
eval_input_func = tf.estimator.inputs.pandas_input_fn(
    x = X_test,
    y = y_test,
    batch_size = 10,
    num_epochs = 1,
    shuffle = False)

In [27]:
results = model.evaluate(eval_input_func)


INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-05-07-12:47:26
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\ARCYFE~1\AppData\Local\Temp\tmpod8f4b08\model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-05-07-12:47:27
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.7322835, accuracy_baseline = 0.65748036, auc = 0.78271043, auc_precision_recall = 0.616137, average_loss = 0.53667676, global_step = 1000, label/mean = 0.34251967, loss = 5.2429194, prediction/mean = 0.3794611

In [28]:
results


Out[28]:
{'accuracy': 0.7322835,
 'accuracy_baseline': 0.65748036,
 'auc': 0.78271043,
 'auc_precision_recall': 0.616137,
 'average_loss': 0.53667676,
 'global_step': 1000,
 'label/mean': 0.34251967,
 'loss': 5.2429194,
 'prediction/mean': 0.3794611}

Predictions


In [29]:
pred_input_func = tf.estimator.inputs.pandas_input_fn(
      x = X_test,
      batch_size = 10,
      num_epochs = 1,
      shuffle = False)

In [30]:
# Predictions is a generator! 
predictions = model.predict(pred_input_func)

In [31]:
list(predictions)[0:5]


INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\ARCYFE~1\AppData\Local\Temp\tmpod8f4b08\model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
Out[31]:
[{'class_ids': array([1], dtype=int64),
  'classes': array([b'1'], dtype=object),
  'logistic': array([0.5530699], dtype=float32),
  'logits': array([0.2130822], dtype=float32),
  'probabilities': array([0.4469301, 0.5530699], dtype=float32)},
 {'class_ids': array([1], dtype=int64),
  'classes': array([b'1'], dtype=object),
  'logistic': array([0.6266913], dtype=float32),
  'logits': array([0.5180483], dtype=float32),
  'probabilities': array([0.37330872, 0.6266913 ], dtype=float32)},
 {'class_ids': array([0], dtype=int64),
  'classes': array([b'0'], dtype=object),
  'logistic': array([0.40525192], dtype=float32),
  'logits': array([-0.3836289], dtype=float32),
  'probabilities': array([0.5947481 , 0.40525195], dtype=float32)},
 {'class_ids': array([0], dtype=int64),
  'classes': array([b'0'], dtype=object),
  'logistic': array([0.34092832], dtype=float32),
  'logits': array([-0.65916], dtype=float32),
  'probabilities': array([0.6590717 , 0.34092832], dtype=float32)},
 {'class_ids': array([0], dtype=int64),
  'classes': array([b'0'], dtype=object),
  'logistic': array([0.16000484], dtype=float32),
  'logits': array([-1.6581922], dtype=float32),
  'probabilities': array([0.83999515, 0.16000482], dtype=float32)}]

DNN Classifier


In [32]:
dnn_model = tf.estimator.DNNClassifier(hidden_units=[10, 10, 10],
                                       feature_columns = feat_cols,
                                       n_classes = 2)


INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: C:\Users\ARCYFE~1\AppData\Local\Temp\tmpgk7n0b5i
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\ARCYFE~1\\AppData\\Local\\Temp\\tmpgk7n0b5i', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000002913C4DDA20>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

In [33]:
# Creating an embedding columns with 4 groups (A, B, C, D)
embedded_group_column = tf.feature_column.embedding_column(assigned_group, 
                                                           dimension = 4)

In [34]:
feat_cols = [num_preg, plasma_gluc, dias_press, tricep, insulin, 
             bmi, diabetes_pedigree, embedded_group_column, age_buckets]

In [35]:
input_func = tf.estimator.inputs.pandas_input_fn(x = X_train,
                                                 y = y_train,
                                                 batch_size = 10,
                                                 num_epochs = 1000,
                                                 shuffle = True)

In [36]:
dnn_model = tf.estimator.DNNClassifier(hidden_units=[10, 10, 10],
                                       feature_columns = feat_cols,
                                       n_classes = 2)


INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: C:\Users\ARCYFE~1\AppData\Local\Temp\tmpgfrl367r
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\ARCYFE~1\\AppData\\Local\\Temp\\tmpgfrl367r', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000002913C4DD6A0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

In [37]:
dnn_model.train(input_fn = input_func,
                steps = 1000)


INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\ARCYFE~1\AppData\Local\Temp\tmpgfrl367r\model.ckpt.
INFO:tensorflow:loss = 6.810172, step = 0
INFO:tensorflow:global_step/sec: 95.2824
INFO:tensorflow:loss = 7.896927, step = 100 (1.052 sec)
INFO:tensorflow:global_step/sec: 121.17
INFO:tensorflow:loss = 6.4979115, step = 200 (0.826 sec)
INFO:tensorflow:global_step/sec: 119.37
INFO:tensorflow:loss = 7.4552755, step = 300 (0.837 sec)
INFO:tensorflow:global_step/sec: 118.46
INFO:tensorflow:loss = 5.6559567, step = 400 (0.844 sec)
INFO:tensorflow:global_step/sec: 117.635
INFO:tensorflow:loss = 6.534105, step = 500 (0.850 sec)
INFO:tensorflow:global_step/sec: 117.07
INFO:tensorflow:loss = 2.6847029, step = 600 (0.854 sec)
INFO:tensorflow:global_step/sec: 117.327
INFO:tensorflow:loss = 3.3632967, step = 700 (0.852 sec)
INFO:tensorflow:global_step/sec: 120.02
INFO:tensorflow:loss = 2.3451052, step = 800 (0.833 sec)
INFO:tensorflow:global_step/sec: 120.157
INFO:tensorflow:loss = 3.9158137, step = 900 (0.832 sec)
INFO:tensorflow:Saving checkpoints for 1000 into C:\Users\ARCYFE~1\AppData\Local\Temp\tmpgfrl367r\model.ckpt.
INFO:tensorflow:Loss for final step: 3.1989439.
Out[37]:
<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x2913c4dd240>

In [38]:
eval_input_func = tf.estimator.inputs.pandas_input_fn(
      x = X_test,
      y = y_test,
      batch_size = 10,
      num_epochs = 1,
      shuffle = False)

In [39]:
dnn_model.evaluate(eval_input_func)


INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-05-07-12:47:41
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\ARCYFE~1\AppData\Local\Temp\tmpgfrl367r\model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-05-07-12:47:43
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.7322835, accuracy_baseline = 0.65748036, auc = 0.80122507, auc_precision_recall = 0.6041116, average_loss = 0.5160169, global_step = 1000, label/mean = 0.34251967, loss = 5.0410886, prediction/mean = 0.3570464
Out[39]:
{'accuracy': 0.7322835,
 'accuracy_baseline': 0.65748036,
 'auc': 0.80122507,
 'auc_precision_recall': 0.6041116,
 'average_loss': 0.5160169,
 'global_step': 1000,
 'label/mean': 0.34251967,
 'loss': 5.0410886,
 'prediction/mean': 0.3570464}

Great Job!