In this notebook, we illustrate the basics of Logistic Regression using TensorFlow, on the Pima Indian Diabetes dataset from UCI Machine Learning Archive.
In [1]:
import numpy as np
import pandas as pd
%pylab inline
pylab.style.use('ggplot')
In [2]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
pima_df = pd.read_csv(url, header=None)
For Each Attribute: (all numeric-valued)
First, we read from the UCI archive into a DataFrame.
In [3]:
pima_df.columns = ['n_pregnant', 'glucose_conc', 'bp',
'skin_fold_thickness', 'serum_insulin', 'bmi', 'diabetes_ped_func',
'age', 'has_diabetes']
In [4]:
pima_df.head()
Out[4]:
In [5]:
# Map the values in the target column: 1 -> 'yes', 0 -> 'no'
pima_df = pima_df.assign(has_diabetes=pima_df.has_diabetes.map(lambda v: 'yes' if v == 1 else 'no'))
In [6]:
pima_df.head()
Out[6]:
As always, check for class imbalance.
In [7]:
pima_df.has_diabetes.value_counts().plot(kind='bar')
Out[7]:
In [8]:
len(pima_df)
Out[8]:
Next, we set aside 20 positive and 20 negative cases as our test set. We'll use the rest of the data as our training set.
In [9]:
test_set = pima_df.groupby(pima_df.has_diabetes).apply(lambda g: g.sample(20))
# Groupby creates a multi-index with the label name as the first level
test_set.index = test_set.index.droplevel(0)
train_set = pima_df.loc[pima_df.index.difference(test_set.index)]
print(len(test_set), len(train_set))
Next, we scale the training set, so all features have zero mean and unit variance.
In [10]:
from sklearn.preprocessing import StandardScaler
def build_input(f_key, l_key, df):
"""Return a `feed_dict` suitable for tensorflow consumption."""
features = df.drop('has_diabetes', axis=1).astype(np.float64).values
scaled_features = StandardScaler().fit_transform(features)
labels = df.has_diabetes.map(lambda v: 1 if v == 'yes' else 0).astype(np.int64).values
return {f_key: scaled_features, l_key: labels}
In [11]:
import tensorflow as tf
from IPython.display import display
import ipywidgets
In [13]:
n = 2000
pg = ipywidgets.FloatProgress(min=1, max=n, description='training...')
display(pg)
tf.reset_default_graph()
weights = tf.get_variable(dtype=np.float64,
name='weights',
shape=(8, 1),
initializer=tf.truncated_normal_initializer(mean=0.0, stddev=1.0))
features = tf.placeholder(shape=(None, 8), dtype=np.float64, name='features')
labels = tf.placeholder(shape=None, dtype=np.int64, name='labels')
# dot(w, X)
w_times_f = tf.matmul(features, weights, name='w_dot_x')
# \hat{P(X)} = 1 / 1 + exp(dot(w, X))
probs = tf.squeeze(1.0 / (1.0 + tf.exp(w_times_f)), name='logistic_function')
# 1-\hat{P(X)}
one = tf.constant(1.0, dtype=np.float64)
one_minus_probs = tf.subtract(one, probs, name='one_minus_probs')
one_minus_labels = tf.subtract(one, tf.cast(labels, np.float64), name='one_minus_y')
# Cross-enropy loss = - [ P(X) * ln \hat{P(X)} + (1-P(X)) * ln \hat{1-P(X)} ]
cross_entropy_loss = -tf.add(tf.multiply(tf.cast(labels, np.float64), tf.log(probs)),
tf.multiply(one_minus_labels, tf.log(one_minus_probs)),
name='cross_entropy_loss')
loss_function = tf.reduce_mean(cross_entropy_loss, name='loss_function')
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.02)
train_op = optimizer.minimize(loss_function, name='loss_minimizer')
losses = []
with tf.Session() as s:
s.run(tf.global_variables_initializer())
train_input = build_input(features, labels, train_set)
for i in range(1, n+1):
_, current_loss = s.run([train_op, loss_function], feed_dict=train_input)
pg.value += 1
losses.append(current_loss)
pg.bar_style = 'success'
pg.description = 'done.'
# Evaluate our trained model on the test set
test_input = build_input(features, labels, test_set)
test_probs = probs.eval(session=s, feed_dict=test_input)
Once the training loop is complete, we plot our loss function as a function of number of training steps.
In [14]:
loss_vals = pd.Series(losses)
loss_vals.rolling(10).mean().plot()
Out[14]:
Finally, calculate accuracy of the predictions on the test set.
In [15]:
from sklearn.metrics import accuracy_score
test_preds = np.where(test_probs > 0.5, 1.0, 0.0).astype(np.int64)
test_labels = test_input[labels]
accuracy_score(test_labels, test_preds)
Out[15]:
In [ ]: