TensorFlow Classification

Data

https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes

Title: Pima Indians Diabetes Database

Sources: (a) Original owners: National Institute of Diabetes and Digestive and

                 Kidney Diseases

(b) Donor of database: Vincent Sigillito (vgs@aplcen.apl.jhu.edu)

                   Research Center, RMI Group Leader
                   Applied Physics Laboratory
                   The Johns Hopkins University
                   Johns Hopkins Road
                   Laurel, MD 20707
                   (301) 953-6231

Past Usage:
1. Smith,~J.~W., Everhart,~J.~E., Dickson,~W.~C., Knowler,~W.~C., \& Johannes,~R.~S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In {\it Proceedings of the Symposium on Computer Applications and Medical Care} (pp. 261--265). IEEE Computer Society Press.
  
  The diagnostic, binary-valued variable investigated is whether the patient shows signs of diabetes according to World Health Organization criteria (i.e., if the 2 hour post-load plasma glucose was at least 200 mg/dl at any survey examination or if found during routine medical care). The population lives near Phoenix, Arizona, USA.
  
  Results: Their ADAP algorithm makes a real-valued prediction between 0 and 1. This was transformed into a binary decision using a cutoff of 0.448. Using 576 training instances, the sensitivity and specificity of their algorithm was 76% on the remaining 192 instances.
Relevant Information: Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. ADAP is an adaptive learning routine that generates and executes digital analogs of perceptron-like devices. It is a unique algorithm; see the paper for details.
Number of Instances: 768
Number of Attributes: 8 plus class
1. For Each Attribute: (all numeric-valued)
  1. Number of times pregnant
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  3. Diastolic blood pressure (mm Hg)
  4. Triceps skin fold thickness (mm)
  5. 2-Hour serum insulin (mu U/ml)
  6. Body mass index (weight in kg/(height in m)^2)
  7. Diabetes pedigree function
  8. Age (years)
  9. Class variable (0 or 1)
Missing Attribute Values: Yes
Class Distribution: (class value 1 is interpreted as "tested positive for diabetes")

Class Value Number of instances 0 500 1 268

Brief statistical analysis:

Attribute number:    Mean:   Standard Deviation:
1.                     3.8     3.4
2.                   120.9    32.0
3.                    69.1    19.4
4.                    20.5    16.0
5.                    79.8   115.2
6.                    32.0     7.9
7.                     0.5     0.3
8.                    33.2    11.8



In [1]:

    
import pandas as pd



In [2]:

    
diabetes = pd.read_csv('data/pima-indians-diabetes.csv')



In [3]:

    
diabetes.head()









    Out[3]:







  
    
      
      Number_pregnant
      Glucose_concentration
      Blood_pressure
      Triceps
      Insulin
      BMI
      Pedigree
      Age
      Class
      Group
    
  
  
    
      0
      6
      0.743719
      0.590164
      0.353535
      0.000000
      0.500745
      0.234415
      50
      1
      B
    
    
      1
      1
      0.427136
      0.540984
      0.292929
      0.000000
      0.396423
      0.116567
      31
      0
      C
    
    
      2
      8
      0.919598
      0.524590
      0.000000
      0.000000
      0.347243
      0.253629
      32
      1
      B
    
    
      3
      1
      0.447236
      0.540984
      0.232323
      0.111111
      0.418778
      0.038002
      21
      0
      B
    
    
      4
      0
      0.688442
      0.327869
      0.353535
      0.198582
      0.642325
      0.943638
      33
      1
      C



In [4]:

    
diabetes.columns









    Out[4]:





Index(['Number_pregnant', 'Glucose_concentration', 'Blood_pressure', 'Triceps',
       'Insulin', 'BMI', 'Pedigree', 'Age', 'Class', 'Group'],
      dtype='object')

Clean the Data



In [5]:

    
# Columns that will be normalized
cols_to_norm = ['Number_pregnant', 'Glucose_concentration', 'Blood_pressure', 'Triceps',
                'Insulin', 'BMI', 'Pedigree']



In [6]:

    
# Normalizing the columns
diabetes[cols_to_norm] = diabetes[cols_to_norm].apply(lambda x: (x - x.min()) / (x.max() - x.min()))



In [7]:

    
diabetes.head()









    Out[7]:







  
    
      
      Number_pregnant
      Glucose_concentration
      Blood_pressure
      Triceps
      Insulin
      BMI
      Pedigree
      Age
      Class
      Group
    
  
  
    
      0
      0.352941
      0.743719
      0.590164
      0.353535
      0.000000
      0.500745
      0.234415
      50
      1
      B
    
    
      1
      0.058824
      0.427136
      0.540984
      0.292929
      0.000000
      0.396423
      0.116567
      31
      0
      C
    
    
      2
      0.470588
      0.919598
      0.524590
      0.000000
      0.000000
      0.347243
      0.253629
      32
      1
      B
    
    
      3
      0.058824
      0.447236
      0.540984
      0.232323
      0.111111
      0.418778
      0.038002
      21
      0
      B
    
    
      4
      0.000000
      0.688442
      0.327869
      0.353535
      0.198582
      0.642325
      0.943638
      33
      1
      C

Feature Columns



In [8]:

    
diabetes.columns









    Out[8]:





Index(['Number_pregnant', 'Glucose_concentration', 'Blood_pressure', 'Triceps',
       'Insulin', 'BMI', 'Pedigree', 'Age', 'Class', 'Group'],
      dtype='object')



In [9]:

    
import tensorflow as tf

Continuous Features

Number of times pregnant
Plasma glucose concentration a 2 hours in an oral glucose tolerance test
Diastolic blood pressure (mm Hg)
Triceps skin fold thickness (mm)
2-Hour serum insulin (mu U/ml)
Body mass index (weight in kg/(height in m)^2)
Diabetes pedigree function



In [10]:

    
num_preg = tf.feature_column.numeric_column('Number_pregnant')
plasma_gluc = tf.feature_column.numeric_column('Glucose_concentration')
dias_press = tf.feature_column.numeric_column('Blood_pressure')
tricep = tf.feature_column.numeric_column('Triceps')
insulin = tf.feature_column.numeric_column('Insulin')
bmi = tf.feature_column.numeric_column('BMI')
diabetes_pedigree = tf.feature_column.numeric_column('Pedigree')
age = tf.feature_column.numeric_column('Age')

Categorical Features

If you know the set of all possible feature values of a column and there are only a few of them, you can use categorical_column_with_vocabulary_list. If you don't know the set of possible values in advance you can use categorical_column_with_hash_bucket



In [11]:

    
assigned_group = tf.feature_column.categorical_column_with_vocabulary_list('Group',['A','B','C','D'])
# Alternative
# assigned_group = tf.feature_column.categorical_column_with_hash_bucket('Group', hash_bucket_size=10)

Converting Continuous to Categorical



In [12]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [13]:

    
diabetes['Age'].hist(bins = 20)









    Out[13]:





<matplotlib.axes._subplots.AxesSubplot at 0x29139a85898>



In [14]:

    
age_buckets = tf.feature_column.bucketized_column(age, boundaries=[20, 30, 40, 50, 60, 70, 80])

Putting them together



In [15]:

    
feat_cols = [num_preg, plasma_gluc, dias_press, tricep, insulin, 
             bmi, diabetes_pedigree, assigned_group, age_buckets]

Train Test Split



In [16]:

    
diabetes.head()









    Out[16]:







  
    
      
      Number_pregnant
      Glucose_concentration
      Blood_pressure
      Triceps
      Insulin
      BMI
      Pedigree
      Age
      Class
      Group
    
  
  
    
      0
      0.352941
      0.743719
      0.590164
      0.353535
      0.000000
      0.500745
      0.234415
      50
      1
      B
    
    
      1
      0.058824
      0.427136
      0.540984
      0.292929
      0.000000
      0.396423
      0.116567
      31
      0
      C
    
    
      2
      0.470588
      0.919598
      0.524590
      0.000000
      0.000000
      0.347243
      0.253629
      32
      1
      B
    
    
      3
      0.058824
      0.447236
      0.540984
      0.232323
      0.111111
      0.418778
      0.038002
      21
      0
      B
    
    
      4
      0.000000
      0.688442
      0.327869
      0.353535
      0.198582
      0.642325
      0.943638
      33
      1
      C



In [17]:

    
diabetes.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 10 columns):
Number_pregnant          768 non-null float64
Glucose_concentration    768 non-null float64
Blood_pressure           768 non-null float64
Triceps                  768 non-null float64
Insulin                  768 non-null float64
BMI                      768 non-null float64
Pedigree                 768 non-null float64
Age                      768 non-null int64
Class                    768 non-null int64
Group                    768 non-null object
dtypes: float64(7), int64(2), object(1)
memory usage: 60.1+ KB



In [18]:

    
# Dropping 'Class' to exclude the column
x_data = diabetes.drop('Class',axis = 1)



In [19]:

    
labels = diabetes['Class']



In [20]:

    
from sklearn.model_selection import train_test_split



In [21]:

    
# Test train split
X_train, X_test, y_train, y_test = train_test_split(x_data,
                                                    labels,
                                                    test_size = 0.33, 
                                                    random_state = 101)

Input Function



In [22]:

    
input_func = tf.estimator.inputs.pandas_input_fn(x = X_train,
                                                 y = y_train,
                                                 batch_size = 10, 
                                                 num_epochs = 1000,
                                                 shuffle = True)

Creating the Model



In [23]:

    
model = tf.estimator.LinearClassifier(feature_columns = feat_cols, 
                                      n_classes = 2)









    



INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: C:\Users\ARCYFE~1\AppData\Local\Temp\tmpod8f4b08
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\ARCYFE~1\\AppData\\Local\\Temp\\tmpod8f4b08', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000002913C0D5438>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}



In [24]:

    
model.train(input_fn = input_func,
            steps = 1000)









    



INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\ARCYFE~1\AppData\Local\Temp\tmpod8f4b08\model.ckpt.
INFO:tensorflow:loss = 6.931472, step = 0
INFO:tensorflow:global_step/sec: 71.5283
INFO:tensorflow:loss = 6.5315638, step = 100 (1.401 sec)
INFO:tensorflow:global_step/sec: 88.4107
INFO:tensorflow:loss = 5.0967355, step = 200 (1.132 sec)
INFO:tensorflow:global_step/sec: 89.5847
INFO:tensorflow:loss = 4.257456, step = 300 (1.116 sec)
INFO:tensorflow:global_step/sec: 88.1773
INFO:tensorflow:loss = 5.143954, step = 400 (1.135 sec)
INFO:tensorflow:global_step/sec: 89.3968
INFO:tensorflow:loss = 7.7543244, step = 500 (1.117 sec)
INFO:tensorflow:global_step/sec: 88.8022
INFO:tensorflow:loss = 3.931779, step = 600 (1.126 sec)
INFO:tensorflow:global_step/sec: 90.1316
INFO:tensorflow:loss = 4.8705893, step = 700 (1.109 sec)
INFO:tensorflow:global_step/sec: 88.105
INFO:tensorflow:loss = 6.636447, step = 800 (1.136 sec)
INFO:tensorflow:global_step/sec: 71.8223
INFO:tensorflow:loss = 4.5513725, step = 900 (1.394 sec)
INFO:tensorflow:Saving checkpoints for 1000 into C:\Users\ARCYFE~1\AppData\Local\Temp\tmpod8f4b08\model.ckpt.
INFO:tensorflow:Loss for final step: 4.565303.






    Out[24]:





<tensorflow.python.estimator.canned.linear.LinearClassifier at 0x2913c0c8f28>



In [25]:

    
# Useful link for your own data
# https://stackoverflow.com/questions/44664285/what-are-the-contraints-for-tensorflow-scope-names

Evaluation



In [26]:

    
eval_input_func = tf.estimator.inputs.pandas_input_fn(
    x = X_test,
    y = y_test,
    batch_size = 10,
    num_epochs = 1,
    shuffle = False)



In [27]:

    
results = model.evaluate(eval_input_func)









    



INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-05-07-12:47:26
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\ARCYFE~1\AppData\Local\Temp\tmpod8f4b08\model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-05-07-12:47:27
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.7322835, accuracy_baseline = 0.65748036, auc = 0.78271043, auc_precision_recall = 0.616137, average_loss = 0.53667676, global_step = 1000, label/mean = 0.34251967, loss = 5.2429194, prediction/mean = 0.3794611



In [28]:

    
results









    Out[28]:





{'accuracy': 0.7322835,
 'accuracy_baseline': 0.65748036,
 'auc': 0.78271043,
 'auc_precision_recall': 0.616137,
 'average_loss': 0.53667676,
 'global_step': 1000,
 'label/mean': 0.34251967,
 'loss': 5.2429194,
 'prediction/mean': 0.3794611}

Predictions



In [29]:

    
pred_input_func = tf.estimator.inputs.pandas_input_fn(
      x = X_test,
      batch_size = 10,
      num_epochs = 1,
      shuffle = False)



In [30]:

    
# Predictions is a generator! 
predictions = model.predict(pred_input_func)



In [31]:

    
list(predictions)[0:5]









    



INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\ARCYFE~1\AppData\Local\Temp\tmpod8f4b08\model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.






    Out[31]:





[{'class_ids': array([1], dtype=int64),
  'classes': array([b'1'], dtype=object),
  'logistic': array([0.5530699], dtype=float32),
  'logits': array([0.2130822], dtype=float32),
  'probabilities': array([0.4469301, 0.5530699], dtype=float32)},
 {'class_ids': array([1], dtype=int64),
  'classes': array([b'1'], dtype=object),
  'logistic': array([0.6266913], dtype=float32),
  'logits': array([0.5180483], dtype=float32),
  'probabilities': array([0.37330872, 0.6266913 ], dtype=float32)},
 {'class_ids': array([0], dtype=int64),
  'classes': array([b'0'], dtype=object),
  'logistic': array([0.40525192], dtype=float32),
  'logits': array([-0.3836289], dtype=float32),
  'probabilities': array([0.5947481 , 0.40525195], dtype=float32)},
 {'class_ids': array([0], dtype=int64),
  'classes': array([b'0'], dtype=object),
  'logistic': array([0.34092832], dtype=float32),
  'logits': array([-0.65916], dtype=float32),
  'probabilities': array([0.6590717 , 0.34092832], dtype=float32)},
 {'class_ids': array([0], dtype=int64),
  'classes': array([b'0'], dtype=object),
  'logistic': array([0.16000484], dtype=float32),
  'logits': array([-1.6581922], dtype=float32),
  'probabilities': array([0.83999515, 0.16000482], dtype=float32)}]

DNN Classifier



In [32]:

    
dnn_model = tf.estimator.DNNClassifier(hidden_units=[10, 10, 10],
                                       feature_columns = feat_cols,
                                       n_classes = 2)









    



INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: C:\Users\ARCYFE~1\AppData\Local\Temp\tmpgk7n0b5i
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\ARCYFE~1\\AppData\\Local\\Temp\\tmpgk7n0b5i', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000002913C4DDA20>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}



In [33]:

    
# Creating an embedding columns with 4 groups (A, B, C, D)
embedded_group_column = tf.feature_column.embedding_column(assigned_group, 
                                                           dimension = 4)



In [34]:

    
feat_cols = [num_preg, plasma_gluc, dias_press, tricep, insulin, 
             bmi, diabetes_pedigree, embedded_group_column, age_buckets]



In [35]:

    
input_func = tf.estimator.inputs.pandas_input_fn(x = X_train,
                                                 y = y_train,
                                                 batch_size = 10,
                                                 num_epochs = 1000,
                                                 shuffle = True)



In [36]:

    
dnn_model = tf.estimator.DNNClassifier(hidden_units=[10, 10, 10],
                                       feature_columns = feat_cols,
                                       n_classes = 2)









    



INFO:tensorflow:Using default config.
WARNING:tensorflow:Using temporary folder as model directory: C:\Users\ARCYFE~1\AppData\Local\Temp\tmpgfrl367r
INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\ARCYFE~1\\AppData\\Local\\Temp\\tmpgfrl367r', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x000002913C4DD6A0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}



In [37]:

    
dnn_model.train(input_fn = input_func,
                steps = 1000)









    



INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into C:\Users\ARCYFE~1\AppData\Local\Temp\tmpgfrl367r\model.ckpt.
INFO:tensorflow:loss = 6.810172, step = 0
INFO:tensorflow:global_step/sec: 95.2824
INFO:tensorflow:loss = 7.896927, step = 100 (1.052 sec)
INFO:tensorflow:global_step/sec: 121.17
INFO:tensorflow:loss = 6.4979115, step = 200 (0.826 sec)
INFO:tensorflow:global_step/sec: 119.37
INFO:tensorflow:loss = 7.4552755, step = 300 (0.837 sec)
INFO:tensorflow:global_step/sec: 118.46
INFO:tensorflow:loss = 5.6559567, step = 400 (0.844 sec)
INFO:tensorflow:global_step/sec: 117.635
INFO:tensorflow:loss = 6.534105, step = 500 (0.850 sec)
INFO:tensorflow:global_step/sec: 117.07
INFO:tensorflow:loss = 2.6847029, step = 600 (0.854 sec)
INFO:tensorflow:global_step/sec: 117.327
INFO:tensorflow:loss = 3.3632967, step = 700 (0.852 sec)
INFO:tensorflow:global_step/sec: 120.02
INFO:tensorflow:loss = 2.3451052, step = 800 (0.833 sec)
INFO:tensorflow:global_step/sec: 120.157
INFO:tensorflow:loss = 3.9158137, step = 900 (0.832 sec)
INFO:tensorflow:Saving checkpoints for 1000 into C:\Users\ARCYFE~1\AppData\Local\Temp\tmpgfrl367r\model.ckpt.
INFO:tensorflow:Loss for final step: 3.1989439.






    Out[37]:





<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x2913c4dd240>



In [38]:

    
eval_input_func = tf.estimator.inputs.pandas_input_fn(
      x = X_test,
      y = y_test,
      batch_size = 10,
      num_epochs = 1,
      shuffle = False)



In [39]:

    
dnn_model.evaluate(eval_input_func)









    



INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-05-07-12:47:41
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from C:\Users\ARCYFE~1\AppData\Local\Temp\tmpgfrl367r\model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-05-07-12:47:43
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.7322835, accuracy_baseline = 0.65748036, auc = 0.80122507, auc_precision_recall = 0.6041116, average_loss = 0.5160169, global_step = 1000, label/mean = 0.34251967, loss = 5.0410886, prediction/mean = 0.3570464






    Out[39]:





{'accuracy': 0.7322835,
 'accuracy_baseline': 0.65748036,
 'auc': 0.80122507,
 'auc_precision_recall': 0.6041116,
 'average_loss': 0.5160169,
 'global_step': 1000,
 'label/mean': 0.34251967,
 'loss': 5.0410886,
 'prediction/mean': 0.3570464}

	Number_pregnant	Glucose_concentration	Blood_pressure	Triceps	Insulin	BMI	Pedigree	Age	Class	Group
0	6	0.743719	0.590164	0.353535	0.000000	0.500745	0.234415	50	1	B
1	1	0.427136	0.540984	0.292929	0.000000	0.396423	0.116567	31	0	C
2	8	0.919598	0.524590	0.000000	0.000000	0.347243	0.253629	32	1	B
3	1	0.447236	0.540984	0.232323	0.111111	0.418778	0.038002	21	0	B
4	0	0.688442	0.327869	0.353535	0.198582	0.642325	0.943638	33	1	C

	Number_pregnant	Glucose_concentration	Blood_pressure	Triceps	Insulin	BMI	Pedigree	Age	Class	Group
0	0.352941	0.743719	0.590164	0.353535	0.000000	0.500745	0.234415	50	1	B
1	0.058824	0.427136	0.540984	0.292929	0.000000	0.396423	0.116567	31	0	C
2	0.470588	0.919598	0.524590	0.000000	0.000000	0.347243	0.253629	32	1	B
3	0.058824	0.447236	0.540984	0.232323	0.111111	0.418778	0.038002	21	0	B
4	0.000000	0.688442	0.327869	0.353535	0.198582	0.642325	0.943638	33	1	C