Logistic Regression (Classification)

Some basic theory

$\textbf{x}$ is a vector of $n$ input features/variables and $y$ is the binary label. Here $\textbf{x} \in \mathbb{R}^n$ and $y \in \{0,1\}$. There are $m$ training examples of type $(\textbf{x}^{(i)},y^{(i)})$.

Hypothesis function $h_{\theta}(\textbf{x}) = g(\theta^T \mathbf{\hat{x}})$, where $\theta \in \mathbb{R}^{n+1}$ are the parameters and $\mathbf{\hat{x}} \in \mathbb{R}^{n+1}$ is obtained by preappending $1$ to $\textbf{x}$. $g(z)$ is the logistic or sigmoid function given by $$g(z) = \frac{1}{1+e^{-z}}$$

Cost function $J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}\left(y^{(i)}log( h_\theta(\textbf{x}^{(i)}) + (1-y^{(i)})log(1-h_\theta(\textbf{x}^{(i)}))\right)$.

Minimizing this cost function gives us the parameters $\theta$ that best fit the given data.

Gradient descent in vector form

$\theta := \theta - \frac{\alpha}{m}X^T\left(g(X\theta) - \textbf{y}\right)$, where $X$ has $\mathbf{\hat{x}}$ as rows so that $X\theta$ gives the value of $h_\theta$ as a column vector of dimension $m$. $\alpha \in \mathbb{R}$ is called the learning rate. $g(z)$ is evaluated component wise.

This update is continued for a fixed number of iterations rather than until the cost function is reduced to a certain value. The reason for this can be seen in the multivariate linear regression example below where the cost is pretty high but still the fitting cannot be improved since the result is the same as that obtained by the normal equation.

Dimensions of all the variables

$X \in \mathbb{R}^{m\times (n+1)}$, is a matrix of training inputs preappended with $1$s.

$\textbf{y} \in \mathbb{R}^m$, is the vector of training outputs.

$\textbf{x} \in \mathbb{R}^n$, is a vector of a single input feature.

Ex2.1 Logistic Regression

In this part of the exercise, you will build a logistic regression model to predict whether a student gets admitted into a university.

Suppose that you are the administrator of a university department and you want to determine each applicant’s chance of admission based on their results on two exams. You have historical data from previous applicants that you can use as a training set for logistic regression. For each training example, you have the applicant’s scores on two exams and the admissions decision.

Your task is to build a classification model that estimates an applicant’s probability of admission based the scores from those two exams.



In [36]:

    
# Import necessary modules
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as opt
np.set_printoptions(precision=5)



In [37]:

    
##################################
# Sigmoid or Logistic function
##################################
# Input
# z        - Input scalar/vector/matrix
# Output
# element-wise evaluation
def sigmoid(z):
    return 1./(1. + np.exp(-z))

##################################
# Define the cost function
# Cross entropy
##################################
# Input
# theta    - The parameters fitted for linear regression
# X        - Input feature with 1s preappended
# y        - Binary labels
# reg      - To regularize the cost function or not
# lamda    - regularization parameter
# Output
# J        - The value of the cost function
def computeCostFunc(theta, X, y, reg=False, lamda=1.0):
    m = len(X)
    h = sigmoid(X.dot(theta))
    J = -(y.dot(np.log(h)) + (1.-y).dot(np.log(1.-h)))/m
    if reg:
        J = J + lamda*theta[1:].T.dot(theta[1:])/2./m
    return J

##################################
# Define the gradient of cost function
# Cross entropy
##################################
# Input
# theta    - The parameters fitted for linear regression
# X        - Input feature with 1s preappended
# y        - Binary labels
# reg      - To regularize the cost function or not
# lamda    - regularization parameter
# Output
# grad     - Gradient of the cost function
def computeGrad(theta, X, y, reg=False, lamda=1.0):
    m = len(X)
    h = sigmoid(X.dot(theta))
    grad = X.T.dot(h-y)/m
    if reg:
        grad[1:] = grad[1:] + lamda*theta[1:]/m
    return grad

##################################
# Gradient descent algorithm
# for Logistic regression
##################################
# Input
# X        - Input features with 1s preappended
# y        - Binary labels
# theta    - The parameters to be fit
# alpha    - Learning rate
# num_iter - Number of iterations to run
# Output
# theta    - The fitted parameters
# J_hist   - Value of cost function for each iteration
def gradientDescent(X, y, theta, alpha, num_iter, res=False, lamda=1.0):
    m = len(X)
    J_hist = np.zeros(num_iter)
    for i in range(num_iter):
        h = sigmoid(X.dot(theta)) - y
        theta = theta - alpha*computeGrad(theta,X,y,reg,lamda)
        J_hist[i] = computeCostFunc(theta,X,y,reg,lamda)
    return theta, J_hist

##################################
# Feature normalization using 
# mean and standard deviation 
##################################
# Input
# X        - Input features without 1s preappended
# Output
# X        - Normalized features
# mu       - Vector of means of all features
# sigma    - Vector of standard deviations of all features
def normalizeFeatures(X):
    m = X.shape[0]
    mu = np.mean(X,axis=0)
    sigma = np.std(X,axis=0)
    for i,(m,s) in enumerate(zip(mu,sigma)):
        X[:,i] = (X[:,i]-m)/s
    return X, mu, sigma

##################################
# Map features to higher order
##################################
# Input
# x1,x2    - The two input features
# order    - Till what order to generate new features from X
# Output
# out      - New features
def mapFeatures(x1, x2, order):
    m = 1
    if type(x1) == np.ndarray:
        m = len(x1)
    n = np.sum(np.arange(2,order+2))
    out = np.zeros([m,n])
    c = 0
    for i in range(1,order+1):
        for j in range(0,i+1):
            out[:,c] = (x1**(i-j))*(x2**j)
            c = c + 1
    return out,n

##################################
# Plot given data
##################################
# Input
# X        - Input features with 1s preappended
# y        - Output or target variables
# labels   - Data labels
# axLabels - Axis labels
# show     - True = plt.show() else nothing
# Output
# plot
def plotData(X,y,labels,axLabels,show=False):
    m,n = X.shape
    ix = np.nonzero(y)[0]
    iy = np.setdiff1d(np.arange(m),ix)
    if show:
        plt.figure(dpi=120)
    plt.plot(X[ix,0],X[ix,1],'r+',label=labels[0])
    plt.plot(X[iy,0],X[iy,1],'bo',label=labels[1])
    plt.title('Given data')
    plt.xlabel(axLabels[0])
    plt.ylabel(axLabels[1])
    plt.legend()
    if show:
        plt.show()

##################################
# Plot decision boundary
# obtained by setting h_theta = 0.5, thus X*theta = 0
##################################
# Input
# X        - Input features with 1s preappended
# y        - Output or target variables
# theta    - The fitted parameters
# labels   - Data labels
# axLabels - Axis labels
# Output
# plot
def plotDecisionBoundary(theta,X,y,labels,axLabels):
    plt.figure(dpi=120)
    plotData(X[:,1:3],y,labels,axLabels)
    m,n = X.shape
    if n <= 3:
        plot_x = np.array([np.min(X[:,1])-2, np.max(X[:,1])+2])
        plot_y = (-1./theta[2])*(theta[1]*plot_x + theta[0])
        plt.plot(plot_x, plot_y, 'k',label='Decision boundary')
    else:
        u = np.linspace(-1.,1.5,50)
        v = np.linspace(-1.,1.5,50)
        z = np.zeros([len(u), len(v)])
        # Evaluate z = X*theta over the grid
        for i in range(len(u)):
            for j in range(len(v)):
                tmp, n = mapFeatures(u[i], v[j], 6)
                z[i,j] = tmp.dot(theta[1:]) + theta[0]
        z = z.T
        c = plt.contour(u,v,z,levels=[0],colors='k')
        c.collections[0].set_label('Decision boundary')
    plt.legend()
    plt.show()

##################################
# Predict the class 0 or 1
##################################
# Input
# X        - Input features with 1s preappended
# theta    - The fitted parameters
# Output
# p        - array of binary classification
def predict(theta,X):
    return (sigmoid(X.dot(theta)) >= 0.5).astype(np.int)



In [38]:

    
# Load the data
data = np.loadtxt('ex2data1.txt', delimiter=',')
X = data[:,0:2] # Results of 2 exams
y = data[:,2] # Labels
m,n = X.shape
del data

featureLabels = ['Exam 1 score', 'Exam 2 score']
labels = ['Admitted', 'Not admitted']

# Plot the data
plotData(X,y,labels,featureLabels,True)

# Preappend 1s to X
tmp = np.ones([m,n+1])
tmp[:,1:n+1] = X
X = tmp
del tmp



In [39]:

    
# Test the cost function
test_theta = np.zeros(n+1)
J = computeCostFunc(test_theta,X,y)
grad = computeGrad(test_theta,X,y)
print('With theta = ',test_theta,', Cost computed = %.5f'%J)
print('Expected cost value (approx) 0.693');
print('Gradient computed = ', grad)
print('Expected gradient (approx) = ',np.array([-0.1000, -12.0092, -11.2628]),'\n')

# Test some more
test_theta = [-24., 0.2, 0.2]
J = computeCostFunc(test_theta,X,y)
grad = computeGrad(test_theta,X,y)
print('With theta = ',test_theta,', Cost computed = %.5f'%J)
print('Expected cost value (approx) 0.218');
print('Gradient computed = ', grad)
print('Expected gradient (approx) = ',np.array([0.04, 2.566, 2.647]))









    



With theta =  [ 0.  0.  0.] , Cost computed = 0.69315
Expected cost value (approx) 0.693
Gradient computed =  [ -0.1     -12.00922 -11.26284]
Expected gradient (approx) =  [ -0.1    -12.0092 -11.2628] 

With theta =  [-24.0, 0.2, 0.2] , Cost computed = 0.21833
Expected cost value (approx) 0.218
Gradient computed =  [ 0.0429   2.56623  2.6468 ]
Expected gradient (approx) =  [ 0.04   2.566  2.647]



In [40]:

    
# Optimize the cost function using BFGS
initial_theta = np.zeros(n+1)
with np.errstate(divide='ignore'): # There is some divide by zero error
    res = opt.fmin_bfgs(computeCostFunc,initial_theta,computeGrad,(X,y),maxiter=50,full_output=True)

# Test the output
print('\nCost at optimal theta found by fmin_bfgs: %.5f'%res[1])
print('Expected cost (approx): 0.203')
print('Optimal theta computed: ', res[0])
print('Expected theta (approx): [-25.161   0.206   0.201]')









    



Optimization terminated successfully.
         Current function value: 0.203498
         Iterations: 23
         Function evaluations: 31
         Gradient evaluations: 31

Cost at optimal theta found by fmin_bfgs: 0.20350
Expected cost (approx): 0.203
Optimal theta computed:  [-25.16133   0.20623   0.20147]
Expected theta (approx): [-25.161   0.206   0.201]



In [41]:

    
# Plot the decision boundary
plotDecisionBoundary(res[0],X,y,labels,featureLabels)



In [42]:

    
# Predict probability for a student with score 45 on exam 1 and score 85 on exam 2 
prob = sigmoid(np.array([1., 45., 85.]).dot(res[0]));
print('For a student with scores 45 and 85, we predict an admission ',\
         'probability of %.5f'%prob);
print('Expected value: 0.775 +/- 0.002\n')

# Compute accuracy on our training set
p = predict(res[0], X)

print('Train Accuracy: %.5f'% (np.mean((p == y).astype(np.int)) * 100.));
print('Expected accuracy (approx): 89.0')









    



For a student with scores 45 and 85, we predict an admission  probability of 0.77629
Expected value: 0.775 +/- 0.002

Train Accuracy: 89.00000
Expected accuracy (approx): 89.0

Ex2.2 Regularized Logistic Regression

In this part of the exercise, you will implement regularized logistic regression to predict whether microchips from a fabrication plant passes quality assurance (QA). During QA, each microchip goes through various tests to ensure it is functioning correctly.

Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests. From these two tests, you would like to determine whether the microchips should be accepted or rejected. To help you make the decision, you have a dataset of test results on past microchips, from which you can build a logistic regression model.



In [43]:

    
# Load the data
data = np.loadtxt('ex2data2.txt', delimiter=',')
X = data[:,0:2] # Results of 2 exams
y = data[:,2] # Labels
m,n = X.shape
del data

labels = ['QA passed', 'QA failed']
featureLabels = ['Microchip Test 1', 'Microchip Test 2']

# Plot the data
plotData(X,y,labels,featureLabels,True)

# generate new features
X,n = mapFeatures(X[:,0],X[:,1],6)

# Preappend 1s to X
tmp = np.ones([m,n+1])
tmp[:,1:n+1] = X
X = tmp
del tmp



In [44]:

    
# Test the cost function
print('Showing first 5 values of %d\n'%(n+1))

test_theta = np.zeros(n+1)
J = computeCostFunc(test_theta,X,y,True)
grad = computeGrad(test_theta,X,y,True)
print('With theta all zeros, Cost computed = %.5f'%J)
print('Expected cost value (approx) 0.693');
print('Gradient computed = ', grad[0:5])
print('Expected gradient (approx) = ',\
      np.array([0.0085, 0.0188, 0.0001, 0.0503, 0.0115]),'\n')

# Test some more
test_theta = np.ones(n+1)
J = computeCostFunc(test_theta,X,y,True,10.)
grad = computeGrad(test_theta,X,y,True,10.)
print('With theta = all ones and lambda = 10, Cost computed = %.5f'%J)
print('Expected cost value (approx) 3.16');
print('Gradient computed = ', grad[0:5])
print('Expected gradient (approx) = ',\
      np.array([0.3460, 0.1614, 0.1948, 0.2269, 0.0922]))









    



Showing first 5 values of 28

With theta all zeros, Cost computed = 0.69315
Expected cost value (approx) 0.693
Gradient computed =  [  8.47458e-03   1.87881e-02   7.77712e-05   5.03446e-02   1.15013e-02]
Expected gradient (approx) =  [ 0.0085  0.0188  0.0001  0.0503  0.0115] 

With theta = all ones and lambda = 10, Cost computed = 3.16451
Expected cost value (approx) 3.16
Gradient computed =  [ 0.34605  0.16135  0.1948   0.22686  0.09219]
Expected gradient (approx) =  [ 0.346   0.1614  0.1948  0.2269  0.0922]



In [45]:

    
# Optimize the cost function using BFGS
initial_theta = np.zeros(n+1)
with np.errstate(divide='ignore'): # There is some divide by zero error
    res = opt.fmin_bfgs(computeCostFunc,initial_theta,computeGrad,(X,y,True),maxiter=400,full_output=True)

# Test the output
print('\nCost at optimal theta found by fmin_bfgs: %.5f'%res[1])
#print('Expected cost (approx): 0.203')
#print('Optimal theta computed: ', res[0])
#print('Expected theta (approx): [-25.161   0.206   0.201]')









    



Optimization terminated successfully.
         Current function value: 0.529003
         Iterations: 47
         Function evaluations: 48
         Gradient evaluations: 48

Cost at optimal theta found by fmin_bfgs: 0.52900



In [46]:

    
plotDecisionBoundary(res[0],X,y,labels,featureLabels)



In [47]:

    
# Compute accuracy on our training set
p = predict(res[0], X)

print('Train Accuracy: %.5f'% (np.mean((p == y).astype(np.int)) * 100.));
print('Expected accuracy (approx): 83.1')









    



Train Accuracy: 83.05085
Expected accuracy (approx): 83.1