$\textbf{x}$ is a vector of $n$ input features/variables and $y$ is the binary label. Here $\textbf{x} \in \mathbb{R}^n$ and $y \in \{0,1\}$. There are $m$ training examples of type $(\textbf{x}^{(i)},y^{(i)})$.
Hypothesis function $h_{\theta}(\textbf{x}) = g(\theta^T \mathbf{\hat{x}})$, where $\theta \in \mathbb{R}^{n+1}$ are the parameters and $\mathbf{\hat{x}} \in \mathbb{R}^{n+1}$ is obtained by preappending $1$ to $\textbf{x}$. $g(z)$ is the logistic or sigmoid function given by $$g(z) = \frac{1}{1+e^{-z}}$$
Cost function $J(\theta) = -\frac{1}{m}\sum_{i=1}^{m}\left(y^{(i)}log( h_\theta(\textbf{x}^{(i)}) + (1-y^{(i)})log(1-h_\theta(\textbf{x}^{(i)}))\right)$.
Minimizing this cost function gives us the parameters $\theta$ that best fit the given data.
$\theta := \theta - \frac{\alpha}{m}X^T\left(g(X\theta) - \textbf{y}\right)$, where $X$ has $\mathbf{\hat{x}}$ as rows so that $X\theta$ gives the value of $h_\theta$ as a column vector of dimension $m$. $\alpha \in \mathbb{R}$ is called the learning rate. $g(z)$ is evaluated component wise.
This update is continued for a fixed number of iterations rather than until the cost function is reduced to a certain value. The reason for this can be seen in the multivariate linear regression example below where the cost is pretty high but still the fitting cannot be improved since the result is the same as that obtained by the normal equation.
$X \in \mathbb{R}^{m\times (n+1)}$, is a matrix of training inputs preappended with $1$s.
$\textbf{y} \in \mathbb{R}^m$, is the vector of training outputs.
$\textbf{x} \in \mathbb{R}^n$, is a vector of a single input feature.
In this part of the exercise, you will build a logistic regression model to predict whether a student gets admitted into a university.
Suppose that you are the administrator of a university department and you want to determine each applicant’s chance of admission based on their results on two exams. You have historical data from previous applicants that you can use as a training set for logistic regression. For each training example, you have the applicant’s scores on two exams and the admissions decision.
Your task is to build a classification model that estimates an applicant’s probability of admission based the scores from those two exams.
In [36]:
# Import necessary modules
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as opt
np.set_printoptions(precision=5)
In [37]:
##################################
# Sigmoid or Logistic function
##################################
# Input
# z - Input scalar/vector/matrix
# Output
# element-wise evaluation
def sigmoid(z):
return 1./(1. + np.exp(-z))
##################################
# Define the cost function
# Cross entropy
##################################
# Input
# theta - The parameters fitted for linear regression
# X - Input feature with 1s preappended
# y - Binary labels
# reg - To regularize the cost function or not
# lamda - regularization parameter
# Output
# J - The value of the cost function
def computeCostFunc(theta, X, y, reg=False, lamda=1.0):
m = len(X)
h = sigmoid(X.dot(theta))
J = -(y.dot(np.log(h)) + (1.-y).dot(np.log(1.-h)))/m
if reg:
J = J + lamda*theta[1:].T.dot(theta[1:])/2./m
return J
##################################
# Define the gradient of cost function
# Cross entropy
##################################
# Input
# theta - The parameters fitted for linear regression
# X - Input feature with 1s preappended
# y - Binary labels
# reg - To regularize the cost function or not
# lamda - regularization parameter
# Output
# grad - Gradient of the cost function
def computeGrad(theta, X, y, reg=False, lamda=1.0):
m = len(X)
h = sigmoid(X.dot(theta))
grad = X.T.dot(h-y)/m
if reg:
grad[1:] = grad[1:] + lamda*theta[1:]/m
return grad
##################################
# Gradient descent algorithm
# for Logistic regression
##################################
# Input
# X - Input features with 1s preappended
# y - Binary labels
# theta - The parameters to be fit
# alpha - Learning rate
# num_iter - Number of iterations to run
# Output
# theta - The fitted parameters
# J_hist - Value of cost function for each iteration
def gradientDescent(X, y, theta, alpha, num_iter, res=False, lamda=1.0):
m = len(X)
J_hist = np.zeros(num_iter)
for i in range(num_iter):
h = sigmoid(X.dot(theta)) - y
theta = theta - alpha*computeGrad(theta,X,y,reg,lamda)
J_hist[i] = computeCostFunc(theta,X,y,reg,lamda)
return theta, J_hist
##################################
# Feature normalization using
# mean and standard deviation
##################################
# Input
# X - Input features without 1s preappended
# Output
# X - Normalized features
# mu - Vector of means of all features
# sigma - Vector of standard deviations of all features
def normalizeFeatures(X):
m = X.shape[0]
mu = np.mean(X,axis=0)
sigma = np.std(X,axis=0)
for i,(m,s) in enumerate(zip(mu,sigma)):
X[:,i] = (X[:,i]-m)/s
return X, mu, sigma
##################################
# Map features to higher order
##################################
# Input
# x1,x2 - The two input features
# order - Till what order to generate new features from X
# Output
# out - New features
def mapFeatures(x1, x2, order):
m = 1
if type(x1) == np.ndarray:
m = len(x1)
n = np.sum(np.arange(2,order+2))
out = np.zeros([m,n])
c = 0
for i in range(1,order+1):
for j in range(0,i+1):
out[:,c] = (x1**(i-j))*(x2**j)
c = c + 1
return out,n
##################################
# Plot given data
##################################
# Input
# X - Input features with 1s preappended
# y - Output or target variables
# labels - Data labels
# axLabels - Axis labels
# show - True = plt.show() else nothing
# Output
# plot
def plotData(X,y,labels,axLabels,show=False):
m,n = X.shape
ix = np.nonzero(y)[0]
iy = np.setdiff1d(np.arange(m),ix)
if show:
plt.figure(dpi=120)
plt.plot(X[ix,0],X[ix,1],'r+',label=labels[0])
plt.plot(X[iy,0],X[iy,1],'bo',label=labels[1])
plt.title('Given data')
plt.xlabel(axLabels[0])
plt.ylabel(axLabels[1])
plt.legend()
if show:
plt.show()
##################################
# Plot decision boundary
# obtained by setting h_theta = 0.5, thus X*theta = 0
##################################
# Input
# X - Input features with 1s preappended
# y - Output or target variables
# theta - The fitted parameters
# labels - Data labels
# axLabels - Axis labels
# Output
# plot
def plotDecisionBoundary(theta,X,y,labels,axLabels):
plt.figure(dpi=120)
plotData(X[:,1:3],y,labels,axLabels)
m,n = X.shape
if n <= 3:
plot_x = np.array([np.min(X[:,1])-2, np.max(X[:,1])+2])
plot_y = (-1./theta[2])*(theta[1]*plot_x + theta[0])
plt.plot(plot_x, plot_y, 'k',label='Decision boundary')
else:
u = np.linspace(-1.,1.5,50)
v = np.linspace(-1.,1.5,50)
z = np.zeros([len(u), len(v)])
# Evaluate z = X*theta over the grid
for i in range(len(u)):
for j in range(len(v)):
tmp, n = mapFeatures(u[i], v[j], 6)
z[i,j] = tmp.dot(theta[1:]) + theta[0]
z = z.T
c = plt.contour(u,v,z,levels=[0],colors='k')
c.collections[0].set_label('Decision boundary')
plt.legend()
plt.show()
##################################
# Predict the class 0 or 1
##################################
# Input
# X - Input features with 1s preappended
# theta - The fitted parameters
# Output
# p - array of binary classification
def predict(theta,X):
return (sigmoid(X.dot(theta)) >= 0.5).astype(np.int)
In [38]:
# Load the data
data = np.loadtxt('ex2data1.txt', delimiter=',')
X = data[:,0:2] # Results of 2 exams
y = data[:,2] # Labels
m,n = X.shape
del data
featureLabels = ['Exam 1 score', 'Exam 2 score']
labels = ['Admitted', 'Not admitted']
# Plot the data
plotData(X,y,labels,featureLabels,True)
# Preappend 1s to X
tmp = np.ones([m,n+1])
tmp[:,1:n+1] = X
X = tmp
del tmp
In [39]:
# Test the cost function
test_theta = np.zeros(n+1)
J = computeCostFunc(test_theta,X,y)
grad = computeGrad(test_theta,X,y)
print('With theta = ',test_theta,', Cost computed = %.5f'%J)
print('Expected cost value (approx) 0.693');
print('Gradient computed = ', grad)
print('Expected gradient (approx) = ',np.array([-0.1000, -12.0092, -11.2628]),'\n')
# Test some more
test_theta = [-24., 0.2, 0.2]
J = computeCostFunc(test_theta,X,y)
grad = computeGrad(test_theta,X,y)
print('With theta = ',test_theta,', Cost computed = %.5f'%J)
print('Expected cost value (approx) 0.218');
print('Gradient computed = ', grad)
print('Expected gradient (approx) = ',np.array([0.04, 2.566, 2.647]))
In [40]:
# Optimize the cost function using BFGS
initial_theta = np.zeros(n+1)
with np.errstate(divide='ignore'): # There is some divide by zero error
res = opt.fmin_bfgs(computeCostFunc,initial_theta,computeGrad,(X,y),maxiter=50,full_output=True)
# Test the output
print('\nCost at optimal theta found by fmin_bfgs: %.5f'%res[1])
print('Expected cost (approx): 0.203')
print('Optimal theta computed: ', res[0])
print('Expected theta (approx): [-25.161 0.206 0.201]')
In [41]:
# Plot the decision boundary
plotDecisionBoundary(res[0],X,y,labels,featureLabels)
In [42]:
# Predict probability for a student with score 45 on exam 1 and score 85 on exam 2
prob = sigmoid(np.array([1., 45., 85.]).dot(res[0]));
print('For a student with scores 45 and 85, we predict an admission ',\
'probability of %.5f'%prob);
print('Expected value: 0.775 +/- 0.002\n')
# Compute accuracy on our training set
p = predict(res[0], X)
print('Train Accuracy: %.5f'% (np.mean((p == y).astype(np.int)) * 100.));
print('Expected accuracy (approx): 89.0')
In this part of the exercise, you will implement regularized logistic regression to predict whether microchips from a fabrication plant passes quality assurance (QA). During QA, each microchip goes through various tests to ensure it is functioning correctly.
Suppose you are the product manager of the factory and you have the test results for some microchips on two different tests. From these two tests, you would like to determine whether the microchips should be accepted or rejected. To help you make the decision, you have a dataset of test results on past microchips, from which you can build a logistic regression model.
In [43]:
# Load the data
data = np.loadtxt('ex2data2.txt', delimiter=',')
X = data[:,0:2] # Results of 2 exams
y = data[:,2] # Labels
m,n = X.shape
del data
labels = ['QA passed', 'QA failed']
featureLabels = ['Microchip Test 1', 'Microchip Test 2']
# Plot the data
plotData(X,y,labels,featureLabels,True)
# generate new features
X,n = mapFeatures(X[:,0],X[:,1],6)
# Preappend 1s to X
tmp = np.ones([m,n+1])
tmp[:,1:n+1] = X
X = tmp
del tmp
In [44]:
# Test the cost function
print('Showing first 5 values of %d\n'%(n+1))
test_theta = np.zeros(n+1)
J = computeCostFunc(test_theta,X,y,True)
grad = computeGrad(test_theta,X,y,True)
print('With theta all zeros, Cost computed = %.5f'%J)
print('Expected cost value (approx) 0.693');
print('Gradient computed = ', grad[0:5])
print('Expected gradient (approx) = ',\
np.array([0.0085, 0.0188, 0.0001, 0.0503, 0.0115]),'\n')
# Test some more
test_theta = np.ones(n+1)
J = computeCostFunc(test_theta,X,y,True,10.)
grad = computeGrad(test_theta,X,y,True,10.)
print('With theta = all ones and lambda = 10, Cost computed = %.5f'%J)
print('Expected cost value (approx) 3.16');
print('Gradient computed = ', grad[0:5])
print('Expected gradient (approx) = ',\
np.array([0.3460, 0.1614, 0.1948, 0.2269, 0.0922]))
In [45]:
# Optimize the cost function using BFGS
initial_theta = np.zeros(n+1)
with np.errstate(divide='ignore'): # There is some divide by zero error
res = opt.fmin_bfgs(computeCostFunc,initial_theta,computeGrad,(X,y,True),maxiter=400,full_output=True)
# Test the output
print('\nCost at optimal theta found by fmin_bfgs: %.5f'%res[1])
#print('Expected cost (approx): 0.203')
#print('Optimal theta computed: ', res[0])
#print('Expected theta (approx): [-25.161 0.206 0.201]')
In [46]:
plotDecisionBoundary(res[0],X,y,labels,featureLabels)
In [47]:
# Compute accuracy on our training set
p = predict(res[0], X)
print('Train Accuracy: %.5f'% (np.mean((p == y).astype(np.int)) * 100.));
print('Expected accuracy (approx): 83.1')