Logistic Regression in TensorFlow

In this notebook, we illustrate the basics of Logistic Regression using TensorFlow, on the Pima Indian Diabetes dataset from UCI Machine Learning Archive.



In [1]:

    
import numpy as np
import pandas as pd
%pylab inline
pylab.style.use('ggplot')









    



Populating the interactive namespace from numpy and matplotlib



In [2]:

    
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
pima_df = pd.read_csv(url, header=None)

Feature Information

For Each Attribute: (all numeric-valued)

Number of times pregnant
Plasma glucose concentration a 2 hours in an oral glucose tolerance test
Diastolic blood pressure (mm Hg)
Triceps skin fold thickness (mm)
2-Hour serum insulin (mu U/ml)
Body mass index (weight in kg/(height in m)^2)
Diabetes pedigree function
Age (years)
Class variable (0 or 1)

First, we read from the UCI archive into a DataFrame.



In [3]:

    
pima_df.columns = ['n_pregnant', 'glucose_conc', 'bp', 
                   'skin_fold_thickness', 'serum_insulin', 'bmi', 'diabetes_ped_func', 
                   'age', 'has_diabetes']



In [4]:

    
pima_df.head()









    Out[4]:






  
    
      
      n_pregnant
      glucose_conc
      bp
      skin_fold_thickness
      serum_insulin
      bmi
      diabetes_ped_func
      age
      has_diabetes
    
  
  
    
      0
      6
      148
      72
      35
      0
      33.6
      0.627
      50
      1
    
    
      1
      1
      85
      66
      29
      0
      26.6
      0.351
      31
      0
    
    
      2
      8
      183
      64
      0
      0
      23.3
      0.672
      32
      1
    
    
      3
      1
      89
      66
      23
      94
      28.1
      0.167
      21
      0
    
    
      4
      0
      137
      40
      35
      168
      43.1
      2.288
      33
      1



In [5]:

    
# Map the values in the target column: 1 -> 'yes', 0 -> 'no'
pima_df = pima_df.assign(has_diabetes=pima_df.has_diabetes.map(lambda v: 'yes' if v == 1 else 'no'))



In [6]:

    
pima_df.head()









    Out[6]:






  
    
      
      n_pregnant
      glucose_conc
      bp
      skin_fold_thickness
      serum_insulin
      bmi
      diabetes_ped_func
      age
      has_diabetes
    
  
  
    
      0
      6
      148
      72
      35
      0
      33.6
      0.627
      50
      yes
    
    
      1
      1
      85
      66
      29
      0
      26.6
      0.351
      31
      no
    
    
      2
      8
      183
      64
      0
      0
      23.3
      0.672
      32
      yes
    
    
      3
      1
      89
      66
      23
      94
      28.1
      0.167
      21
      no
    
    
      4
      0
      137
      40
      35
      168
      43.1
      2.288
      33
      yes

As always, check for class imbalance.



In [7]:

    
pima_df.has_diabetes.value_counts().plot(kind='bar')









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x22359442320>



In [8]:

    
len(pima_df)









    Out[8]:





768

Next, we set aside 20 positive and 20 negative cases as our test set. We'll use the rest of the data as our training set.



In [9]:

    
test_set = pima_df.groupby(pima_df.has_diabetes).apply(lambda g: g.sample(20))

# Groupby creates a multi-index with the label name as the first level
test_set.index = test_set.index.droplevel(0)
train_set = pima_df.loc[pima_df.index.difference(test_set.index)]

print(len(test_set), len(train_set))

Next, we scale the training set, so all features have zero mean and unit variance.



In [10]:

    
from sklearn.preprocessing import StandardScaler

def build_input(f_key, l_key, df):
    """Return a `feed_dict` suitable for tensorflow consumption."""
    
    features = df.drop('has_diabetes', axis=1).astype(np.float64).values
    scaled_features = StandardScaler().fit_transform(features)
    labels = df.has_diabetes.map(lambda v: 1 if v == 'yes' else 0).astype(np.int64).values    
    return {f_key: scaled_features, l_key: labels}



In [11]:

    
import tensorflow as tf
from IPython.display import display
import ipywidgets



In [13]:

    
n = 2000

pg = ipywidgets.FloatProgress(min=1, max=n, description='training...')
display(pg)

tf.reset_default_graph()

weights = tf.get_variable(dtype=np.float64, 
                      name='weights', 
                      shape=(8, 1),
                      initializer=tf.truncated_normal_initializer(mean=0.0, stddev=1.0))

features = tf.placeholder(shape=(None, 8), dtype=np.float64, name='features')
labels = tf.placeholder(shape=None, dtype=np.int64, name='labels')

# dot(w,  X)
w_times_f = tf.matmul(features, weights, name='w_dot_x')

# \hat{P(X)} = 1 / 1 + exp(dot(w, X))
probs = tf.squeeze(1.0 / (1.0 + tf.exp(w_times_f)), name='logistic_function')

# 1-\hat{P(X)}
one = tf.constant(1.0, dtype=np.float64)
one_minus_probs = tf.subtract(one, probs, name='one_minus_probs')
one_minus_labels = tf.subtract(one, tf.cast(labels, np.float64), name='one_minus_y')

# Cross-enropy loss = - [ P(X) * ln \hat{P(X)}  + (1-P(X)) * ln \hat{1-P(X)} ]
cross_entropy_loss = -tf.add(tf.multiply(tf.cast(labels, np.float64), tf.log(probs)), 
                            tf.multiply(one_minus_labels, tf.log(one_minus_probs)), 
                                   name='cross_entropy_loss')
loss_function = tf.reduce_mean(cross_entropy_loss, name='loss_function')

optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.02)
train_op = optimizer.minimize(loss_function, name='loss_minimizer')

losses = []

with tf.Session() as s:
    s.run(tf.global_variables_initializer())
    
    train_input = build_input(features, labels, train_set)
    
    for i in range(1, n+1):
        _, current_loss = s.run([train_op, loss_function], feed_dict=train_input)
        pg.value += 1
        losses.append(current_loss)
    
    pg.bar_style = 'success'
    pg.description = 'done.'
    
    # Evaluate our trained model on the test set
    test_input = build_input(features, labels, test_set)
    test_probs = probs.eval(session=s, feed_dict=test_input)

Once the training loop is complete, we plot our loss function as a function of number of training steps.



In [14]:

    
loss_vals = pd.Series(losses)
loss_vals.rolling(10).mean().plot()









    Out[14]:





<matplotlib.axes._subplots.AxesSubplot at 0x2235c1ce278>

Finally, calculate accuracy of the predictions on the test set.



In [15]:

    
from sklearn.metrics import accuracy_score

test_preds = np.where(test_probs > 0.5, 1.0, 0.0).astype(np.int64)
test_labels = test_input[labels]
accuracy_score(test_labels, test_preds)









    Out[15]:





0.80000000000000004



In [ ]:

	n_pregnant	glucose_conc	bp	skin_fold_thickness	serum_insulin	bmi	diabetes_ped_func	age	has_diabetes
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1