Logistic Regression in TensorFlow

In this notebook, we illustrate the basics of Logistic Regression using TensorFlow, on the Pima Indian Diabetes dataset from UCI Machine Learning Archive.


In [1]:
import numpy as np
import pandas as pd
%pylab inline
pylab.style.use('ggplot')


Populating the interactive namespace from numpy and matplotlib

In [2]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
pima_df = pd.read_csv(url, header=None)

Feature Information

For Each Attribute: (all numeric-valued)

  1. Number of times pregnant
  2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  3. Diastolic blood pressure (mm Hg)
  4. Triceps skin fold thickness (mm)
  5. 2-Hour serum insulin (mu U/ml)
  6. Body mass index (weight in kg/(height in m)^2)
  7. Diabetes pedigree function
  8. Age (years)
  9. Class variable (0 or 1)

First, we read from the UCI archive into a DataFrame.


In [3]:
pima_df.columns = ['n_pregnant', 'glucose_conc', 'bp', 
                   'skin_fold_thickness', 'serum_insulin', 'bmi', 'diabetes_ped_func', 
                   'age', 'has_diabetes']

In [4]:
pima_df.head()


Out[4]:
n_pregnant glucose_conc bp skin_fold_thickness serum_insulin bmi diabetes_ped_func age has_diabetes
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

In [5]:
# Map the values in the target column: 1 -> 'yes', 0 -> 'no'
pima_df = pima_df.assign(has_diabetes=pima_df.has_diabetes.map(lambda v: 'yes' if v == 1 else 'no'))

In [6]:
pima_df.head()


Out[6]:
n_pregnant glucose_conc bp skin_fold_thickness serum_insulin bmi diabetes_ped_func age has_diabetes
0 6 148 72 35 0 33.6 0.627 50 yes
1 1 85 66 29 0 26.6 0.351 31 no
2 8 183 64 0 0 23.3 0.672 32 yes
3 1 89 66 23 94 28.1 0.167 21 no
4 0 137 40 35 168 43.1 2.288 33 yes

As always, check for class imbalance.


In [7]:
pima_df.has_diabetes.value_counts().plot(kind='bar')


Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x22359442320>

In [8]:
len(pima_df)


Out[8]:
768

Next, we set aside 20 positive and 20 negative cases as our test set. We'll use the rest of the data as our training set.


In [9]:
test_set = pima_df.groupby(pima_df.has_diabetes).apply(lambda g: g.sample(20))

# Groupby creates a multi-index with the label name as the first level
test_set.index = test_set.index.droplevel(0)
train_set = pima_df.loc[pima_df.index.difference(test_set.index)]

print(len(test_set), len(train_set))


40 728

Next, we scale the training set, so all features have zero mean and unit variance.


In [10]:
from sklearn.preprocessing import StandardScaler

def build_input(f_key, l_key, df):
    """Return a `feed_dict` suitable for tensorflow consumption."""
    
    features = df.drop('has_diabetes', axis=1).astype(np.float64).values
    scaled_features = StandardScaler().fit_transform(features)
    labels = df.has_diabetes.map(lambda v: 1 if v == 'yes' else 0).astype(np.int64).values    
    return {f_key: scaled_features, l_key: labels}

In [11]:
import tensorflow as tf
from IPython.display import display
import ipywidgets

In [13]:
n = 2000

pg = ipywidgets.FloatProgress(min=1, max=n, description='training...')
display(pg)

tf.reset_default_graph()

weights = tf.get_variable(dtype=np.float64, 
                      name='weights', 
                      shape=(8, 1),
                      initializer=tf.truncated_normal_initializer(mean=0.0, stddev=1.0))

features = tf.placeholder(shape=(None, 8), dtype=np.float64, name='features')
labels = tf.placeholder(shape=None, dtype=np.int64, name='labels')

# dot(w,  X)
w_times_f = tf.matmul(features, weights, name='w_dot_x')

# \hat{P(X)} = 1 / 1 + exp(dot(w, X))
probs = tf.squeeze(1.0 / (1.0 + tf.exp(w_times_f)), name='logistic_function')

# 1-\hat{P(X)}
one = tf.constant(1.0, dtype=np.float64)
one_minus_probs = tf.subtract(one, probs, name='one_minus_probs')
one_minus_labels = tf.subtract(one, tf.cast(labels, np.float64), name='one_minus_y')

# Cross-enropy loss = - [ P(X) * ln \hat{P(X)}  + (1-P(X)) * ln \hat{1-P(X)} ]
cross_entropy_loss = -tf.add(tf.multiply(tf.cast(labels, np.float64), tf.log(probs)), 
                            tf.multiply(one_minus_labels, tf.log(one_minus_probs)), 
                                   name='cross_entropy_loss')
loss_function = tf.reduce_mean(cross_entropy_loss, name='loss_function')

optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.02)
train_op = optimizer.minimize(loss_function, name='loss_minimizer')

losses = []

with tf.Session() as s:
    s.run(tf.global_variables_initializer())
    
    train_input = build_input(features, labels, train_set)
    
    for i in range(1, n+1):
        _, current_loss = s.run([train_op, loss_function], feed_dict=train_input)
        pg.value += 1
        losses.append(current_loss)
    
    pg.bar_style = 'success'
    pg.description = 'done.'
    
    # Evaluate our trained model on the test set
    test_input = build_input(features, labels, test_set)
    test_probs = probs.eval(session=s, feed_dict=test_input)


Once the training loop is complete, we plot our loss function as a function of number of training steps.


In [14]:
loss_vals = pd.Series(losses)
loss_vals.rolling(10).mean().plot()


Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x2235c1ce278>

Finally, calculate accuracy of the predictions on the test set.


In [15]:
from sklearn.metrics import accuracy_score

test_preds = np.where(test_probs > 0.5, 1.0, 0.0).astype(np.int64)
test_labels = test_input[labels]
accuracy_score(test_labels, test_preds)


Out[15]:
0.80000000000000004

In [ ]: