This post will be mostly Python code with implementation and examples of the Logistic Regression theory we have been discussing in the last few posts. Examples include fitting to 2 feature data using an arbitrary order multinomial model and a simple 2 class image classification problem using the MNIST digits data.
I've done three earlier posts on Logistic Regression, Logistic Regression Theory and Logistic and Linear Regression Regularization, Logistic Regression Implementation. Those posts include discussion of logistic regression theory and derivation of the equations with the purpose of giving insight on how it works. The last post included a simple Python implementation of the equations and a simple example.
This will be a "calculator" style implementation using Python in this Jupyter notebook. This will hopefully lift some of the veil off of black-box implementations in packages like scikit-learn. The code should be fairly easy to "tinker" with.
I'll also exercise a "best practice" in data science by dividing the MNIST dataset into training-validation-test subsets for analysis.
This posts along with all of the others in this series were converted to html from Jupyter notebooks. The notebooks are available at https://github.com/dbkinghorn/blog-jupyter-notebooks
In [1]:
import pandas as pd # data handeling
import numpy as np # numerical computing
from scipy.optimize import minimize # optimization code
import matplotlib.pyplot as plt # plotting
import seaborn as sns
%matplotlib inline
sns.set()
import itertools # combinatorics functions for multinomial code
In [2]:
#
# Main Logistic Regression Equations
#
def g(z) : # sigmoid function
return 1.0/(1.0 + np.exp(-z))
def h_logistic(X,a) : # Model function
return g(np.dot(X,a))
def J(X,a,y) : # Cost Function
m = y.size
return -(np.sum(np.log(h_logistic(X,a))) + np.dot((y-1).T,(np.dot(X,a))))/m
def J_reg(X,a,y,reg_lambda) : # Cost Function with Regularization
m = y.size
return J(X,a,y) + reg_lambda/(2.0*m) * np.dot(a[1:],a[1:])
def gradJ(X,a,y) : # Gradient of Cost Function
m = y.size
return (np.dot(X.T,(h_logistic(X,a) - y)))/m
def gradJ_reg(X,a,y,reg_lambda) : # Gradient of Cost Function with Regularization
m = y.size
return gradJ(X,a,y) + reg_lambda/(2.0*m) * np.concatenate(([0], a[1:])).T
In [3]:
#
# Some model checking functions
#
def to_0_1(h_prob) : # convert probabilites to true (1) or false (0) at cut-off 0.5
return np.where(h_prob >= 0.5, 1, 0)
def model_accuracy(h,y) : # Overall accuracy of model
return np.sum(h==y)/y.size * 100
def model_accuracy_pos(h,y) : # Accuracy on positive cases
return np.sum(y[h==1] == 1)/y[y==1].size * 100
def model_accuracy_neg(h,y) : # Accuracy on negative cases
return np.sum(y[h==0] == 0)/y[y==0].size * 100
def false_pos(h,y) : # Number of false positives
return np.sum((y==0) & (h==1))
def false_neg(h,y) : # Number of false negatives
return np.sum((y==1) & (h==0))
def true_pos(h,y) : # Number of true positives
return np.sum((y==1) & (h==1))
def true_neg(h,y) : # Number of true negatives
return np.sum((y==0) & (h==0))
def model_precision(h,y) : # Precision = TP/(TP+FP)
return true_pos(h,y)/(true_pos(h,y) + false_pos(h,y))
def model_recall(h,y) : # Recall = TP/(TP+FN)
return true_pos(h,y)/(true_pos(h,y) + false_neg(h,y))
def print_model_quality(title, h, y) : # Print the results of the functions above
print( '\n# \n# {} \n#'.format(title) )
print( 'Total number of data points = {}'.format(y.size))
print( 'Number of Positive values(1s) = {}'.format(y[y==1].size))
print( 'Number of Negative values(0s) = {}'.format(y[y==0].size))
print( '\nNumber of True Positives = {}'.format(true_pos(h,y)) )
print( 'Number of False Positives = {}'.format(false_pos(h,y)) )
print( '\nNumber of True Negatives = {}'.format(true_neg(h,y)) )
print( 'Number of False Negatives = {}'.format(false_neg(h,y)) )
print( '\nModel Accuracy = {:.2f}%'.format( model_accuracy(h,y) ) )
print( 'Model Accuracy Positive Cases = {:.2f}%'.format( model_accuracy_pos(h,y) ) )
print( 'Model Accuracy Negative Cases = {:.2f}%'.format( model_accuracy_neg(h,y) ) )
print( '\nModel Precision = {}'.format(model_precision(h,y)) )
print( '\nModel Recall = {}'.format(model_recall(h,y)) )
In [4]:
def multinomial_partitions(n, k):
"""returns an array of length k sequences of integer partitions of n"""
nparts = itertools.combinations(range(1, n+k), k-1)
tmp = [(0,) + p + (n+k,) for p in nparts]
sequences = np.diff(tmp) - 1
return sequences[::-1] # reverse the order
def make_multinomial_features(fvecs,order=[1,2]) :
'''Make multinomial feature matrix
fvecs is a matrix of feature vectors (columns)
"order" is a set of multinomial degrees to create
default is [1,2] meaning for example: given f1, f2 in fvecs
return a matrix made up of a [1's column, f1,f2,f1**2,f1*f2,f2**2] '''
Xtmp = np.ones_like(fvecs[:,0])
for ord in order :
if ord==1 :
fstmp = fvecs
else :
pwrs = multinomial_partitions(ord,fvecs.shape[1])
fstmp = np.column_stack( ( np.prod(fvecs**pwrs[i,:], axis=1) for i in range(pwrs.shape[0]) ))
Xtmp = np.column_stack((Xtmp,fstmp))
return Xtmp
def mean_normalize(X):
'''apply mean normalization to each column of the matrix X'''
X_mean=X.mean(axis=0)
X_std=X.std(axis=0)
return (X-X_mean)/X_std
def apply_normalizer(X,X_mean,X_std) :
return (X-X_mean)/X_std
This first example is a variation of data that I used when describing "decision boundaries". I have made it a little more complicated and will use multinomial "feature expansion" on the two feature variables. This will also include some over-fitting and Regularization cases.
In [5]:
# Generate some "interesting" random 2-D data
np.random.seed(42)
px1, px2 = np.random.multivariate_normal([0,0], [[1,0.2],[0.8,1]] , 100).T
t = np.linspace(0,2*np.pi,100)
nx1, nx2 = (3+px1)*np.sin(t), (3+px2)*np.cos(t)
In [6]:
# Plot the data
fig, ax = plt.subplots()
ax.plot(nx1,nx2, "o", label='neg data' )
ax.plot(px1,px2, "P", label='pos data')
ax.axis('equal')
plt.title("2-D dataset", fontsize=24)
ax.legend();
In [7]:
f1 = np.concatenate((nx1,px1))
f2 = np.concatenate((nx2,px2))
fs = np.column_stack((f1,f2))
y = np.concatenate( (np.zeros(100), (np.ones(100))) )
order = np.array([1,2])
X = make_multinomial_features(fs,order=order)
X[:,1:] = mean_normalize(X[:,1:]) # normalize but leave the bias term alone
print('Number of multinolial features = {}'.format(X.shape[1]))
In [8]:
def opt_J_reg(a) :
return J_reg(X,a,y,reg)
def opt_gradJ_reg(a) :
return gradJ_reg(X,a,y,reg)
In [9]:
reg = 0.0 # no regularization on this run
def opt_J_reg(a) :
return J_reg(X,a,y,reg)
def opt_gradJ_reg(a) :
return gradJ_reg(X,a,y,reg)
aguess = np.ones(X.shape[1]) # make a random starting guess
res = minimize(opt_J_reg, aguess, method='BFGS', jac=opt_gradJ_reg, tol=1e-4, options={'disp': True})
In [10]:
a_opt = res.x
print(a_opt)
In [11]:
h_prob = h_logistic(X,a_opt) # get the probabilities of the from the model using a_opt
h_predict = to_0_1(h_prob) # convert the probabilities to 0,1 predictions at cutoff 0.5
print_model_quality('Traing-data fit', h_predict, y)
Lets take a look at a plot of the fit,
In [12]:
fig, ax = plt.subplots()
ax.plot(X[0:100,1],X[0:100,2], "o", label='neg data' )
ax.plot(X[100:,1],X[100:,2], "P", label='pos data')
xb1 = np.linspace(-1.5, 1.5, 100)
xb2 = np.linspace(-1.5, 1.5, 100)
#Xb1, Xb2 = np.meshgrid(mean_normalize(xb1), mean_normalize(xb2))
Xb1, Xb2 = np.meshgrid(xb1, xb2)
BX = make_multinomial_features(np.column_stack((Xb1.flatten(),Xb2.flatten())),order=order)
BX[:,1:] = mean_normalize(BX[:,1:])
B = np.sum(a_opt*BX, axis=1).reshape((100,100))
ax.contour(Xb1,Xb2,B, colors='r', levels=[0])
ax.axis('equal')
plt.title("Decision Boundary From Fit", fontsize=24)
plt.legend();
In [13]:
order = np.array([1,2,3,4,5,6])
X = make_multinomial_features(fs,order=order)
X[:,1:] = mean_normalize(X[:,1:]) # normalize but leave the bias term alone
print('Number of multinomial features = {}'.format(X.shape[1]))
In [14]:
reg = 0.0 # no regularization on this run
aguess = np.ones(X.shape[1]) # make a random starting guess
res = minimize(opt_J_reg, aguess, method='BFGS', jac=opt_gradJ_reg, tol=1e-4, options={'disp': True})
In [15]:
a_opt = res.x
print(a_opt)
In [16]:
h_prob = h_logistic(X,a_opt) # get the probabilities of the from the model using a_opt
h_predict = to_0_1(h_prob) # convert the probabilities to 0,1 predictions at cutoff 0.5
print_model_quality('Traing-data fit', h_predict, y)
In [17]:
fig, ax = plt.subplots()
ax.plot(X[0:100,1],X[0:100,2], "o", label='neg data' )
ax.plot(X[100:,1],X[100:,2], "P", label='pos data')
xb1 = np.linspace(-1.5, 1.5, 100)
xb2 = np.linspace(-1.5, 1.5, 100)
#Xb1, Xb2 = np.meshgrid(mean_normalize(xb1), mean_normalize(xb2))
Xb1, Xb2 = np.meshgrid(xb1, xb2)
BX = make_multinomial_features(np.column_stack((Xb1.flatten(),Xb2.flatten())),order=order)
BX[:,1:] = mean_normalize(BX[:,1:])
B = np.sum(a_opt*BX, axis=1).reshape((100,100))
ax.contour(Xb1,Xb2,B, colors='r', levels=[0])
ax.axis('equal')
plt.title("Decision Boundary From Fit", fontsize=24)
plt.legend();
In [18]:
reg = 1.0 # no regularization on this run
aguess = np.ones(X.shape[1]) # make a random starting guess
res = minimize(opt_J_reg, aguess, method='BFGS', jac=opt_gradJ_reg, tol=1e-4, options={'disp': True})
In [19]:
a_opt = res.x
print(a_opt)
In [20]:
h_prob = h_logistic(X,a_opt) # get the probabilities of the from the model using a_opt
h_predict = to_0_1(h_prob) # convert the probabilities to 0,1 predictions at cutoff 0.5
print_model_quality('Traing-data fit', h_predict, y)
In [21]:
fig, ax = plt.subplots()
ax.plot(X[0:100,1],X[0:100,2], "o", label='neg data' )
ax.plot(X[100:,1],X[100:,2], "P", label='pos data')
xb1 = np.linspace(-1.5, 1.5, 100)
xb2 = np.linspace(-1.5, 1.5, 100)
#Xb1, Xb2 = np.meshgrid(mean_normalize(xb1), mean_normalize(xb2))
Xb1, Xb2 = np.meshgrid(xb1, xb2)
BX = make_multinomial_features(np.column_stack((Xb1.flatten(),Xb2.flatten())),order=order)
BX[:,1:] = mean_normalize(BX[:,1:])
B = np.sum(a_opt*BX, axis=1).reshape((100,100))
ax.contour(Xb1,Xb2,B, colors='r', levels=[0])
ax.axis('equal')
plt.title("Decision Boundary From Fit", fontsize=24)
plt.legend();
This is really kind of fun to play around with! If you want to try the code and experiment yourself feel free to grad the Jupyter notebook from the GitHub link I gave near the top of the post.
In the next section we'll look at a simple image classification example.
For this example I pulled the MNIST training set from Kaggle. For information on the dataset itself see Yann Lecun's site http://yann.lecun.com/exdb/mnist/index.html. I used the data from Kaggel since it is in an easy to use format. (Kaggle is a great site for Machine Learning and there are many useful datasets and example problems there.)
The reason I only pulled the training set from Kaggel is because I want to explicitly show splitting up a dataset into separate sets for training. validation, and testing. The training set should be the largest set since that is where the model fitting takes place. The validation set is used for evaluating the effect of making adjustments to "hyper-parameters" of the method used to construct the model. The test set is the finial judge of the "generalization" or usefulness of the model for making predictions.
The first thing to do is load up the data and have a look at it,
In [22]:
data_full = pd.read_csv("./data/kg-mnist/train.csv")
data_full.head(10)
Out[22]:
I'll convert the dataframe to a matrix and show the image in the 6th row (a 7). There are 42000 images in this dataset.
In [23]:
data_full_matrix=data_full.as_matrix()
print(data_full_matrix.shape)
In [24]:
plt.imshow(data_full_matrix[6,1:].reshape((28,28)))
Out[24]:
The image is shown fairly large on the screen. You can clearly see the pixels and their relative gray-scale values.
In [25]:
idx = np.logical_or(data_full_matrix[:,0]==0, data_full_matrix[:,0]==1)
y_01, data_01 = data_full_matrix[idx,0], data_full_matrix[idx,1:]
print(data_01.shape)
There are now 8816 images of 0's and 1's in (data_01). Some of the feature columns in data_01 have all 0 gray-scale values i.e. they are empty boarders. Let's remove those "features". We will have 588 columns in data_01 after their removal.
In [26]:
data_01 = data_01[:,data_01.sum(axis=0)!=0]
print(data_01.shape)
In [27]:
data_train_01,y_train_01 = data_01[0:6170,:], y_01[0:6170]
data_val_01, y_val_01 = data_01[6170:7493,:], y_01[6170:7493]
data_test_01, y_test_01 = data_01[7493:,:], y_01[7493:]
print(data_train_01.shape,y_train_01.shape)
print(data_val_01.shape, y_val_01.shape)
print(data_test_01.shape, y_test_01.shape)
At the same time I'll setup the augmented matrix with ones in the first column for the bias term using the make_multinomial_features function using "order" set to [1]. [I could add multinomial features but that gets to be VERY large ... I tried it and am still experimenting with subsets :-)]
In [28]:
X_mean = data_01.mean(axis=0)
X_std = data_01.std(axis=0)
X_std[X_std==0]=1.0 # if there are any 0 values in X_std set them to 1
order = np.array([1])
X_train = make_multinomial_features(data_train_01, order=order)
X_train[:,1:] = apply_normalizer(X_train[:,1:],X_mean,X_std)
y_train = y_train_01
X_val = make_multinomial_features(data_val_01, order=order)
X_val[:,1:] = apply_normalizer(X_val[:,1:],X_mean,X_std)
y_val = y_val_01
X_test = make_multinomial_features(data_test_01, order=order)
X_test[:,1:] = apply_normalizer(X_test[:,1:],X_mean,X_std)
y_test = y_test_01
In [29]:
def opt_J_reg(a) :
return J(X_train,a,y_train)
def opt_gradJ_reg(a) :
return gradJ_reg(X_train,a,y_train,reg)
In [30]:
reg =1.0
aguess = np.random.randn(X_train.shape[1])
%time res = minimize(opt_J_reg, aguess, method='CG', jac=opt_gradJ_reg, tol=1e-4, options={'disp': True})
In [31]:
a_opt = res.x
h_prob = h_logistic(X_train,a_opt)
h_predict = to_0_1(h_prob)
print_model_quality('Traing-data fit', h_predict, y_train)
In [32]:
a_opt = res.x
h_prob = h_logistic(X_val,a_opt)
h_predict = to_0_1(h_prob)
print_model_quality('Validation-data fit', h_predict, y_val)
Maybe our "perfect" fit on the training set was too good. I can try adjusting the regularization term to see if we can get a model that is still good on the training set but does a better job on the validation set. (this is what the validation set is for!)
In [33]:
reg = 15.0
np.random.seed(42) # just make sure I'm using the same random guess every time I run
aguess = np.random.randn(X_train.shape[1])
%time res = minimize(opt_J_reg, aguess, method='CG', jac=opt_gradJ_reg, tol=1e-4, options={'disp': True})
a_opt = res.x
h_prob = h_logistic(X_train,a_opt)
h_predict = to_0_1(h_prob)
print_model_quality('Traing-data fit', h_predict, y_train)
a_opt = res.x
h_prob = h_logistic(X_val,a_opt)
h_predict = to_0_1(h_prob)
print_model_quality('Validation-data fit', h_predict, y_val)
Now lets see how the test set looks,
In [34]:
a_opt = res.x
h_prob = h_logistic(X_test,a_opt)
h_predict = to_0_1(h_prob)
print_model_quality('Test-data fit', h_predict, y_test)
Nest time I extend this to Multi-Class Logistic Regression using a method called one-vs-all. With that we can do the whole set of digits 0-9.
Happy computing! --dbk