In this note, I am going to train a logistic regression model with gradient decent
estimation.
Logistic regression model can be thought as a neural network without hidden layer and hence a good entry of learning deep learning model.
Here we are generating 20 data points from 2 class distributions: blue $(t=1)$ and red $(t=-1)$
In [1]:
import numpy as np # Matrix and vector computation package
np.seterr(all='ignore') # ignore numpy warning like multiplication of inf
import matplotlib.pyplot as plt # Plotting library
from matplotlib.colors import colorConverter, ListedColormap # some plotting functions
from matplotlib import cm # Colormaps
# Allow matplotlib to plot inside this notebook
%matplotlib inline
# Set the seed of the numpy random number generator so that the tutorial is reproducable
np.random.seed(seed=1)
# Define and generate the samples
nb_of_samples_per_class = 20 # The number of sample in each class
red_mean = [-1,0] # The mean of the red class
blue_mean = [1,0] # The mean of the blue class
std_dev = 1.2 # standard deviation of both classes
# Generate samples from both classes
x_red = np.random.randn(nb_of_samples_per_class, 2) * std_dev + red_mean
x_blue = np.random.randn(nb_of_samples_per_class, 2) * std_dev + blue_mean
# Merge samples in set of input variables x, and corresponding set of output variables t
X = np.vstack((x_red, x_blue)) # 20x2
t = np.vstack((np.zeros((nb_of_samples_per_class,1)), np.ones((nb_of_samples_per_class,1)))) # 20 x1
In [2]:
# Plot both classes on the x1, x2 plane
plt.plot(x_red[:,0], x_red[:,1], 'ro', label='class red')
plt.plot(x_blue[:,0], x_blue[:,1], 'bo', label='class blue')
plt.grid()
plt.legend(loc=2)
plt.xlabel('$x_1$', fontsize=15)
plt.ylabel('$x_2$', fontsize=15)
plt.axis([-4, 4, -4, 4])
plt.title('red vs. blue classes in the input space')
plt.show()
Model can be described as: $$ y = \sigma(\mathbf{x} * \mathbf{w}^T) $$ $$\sigma(z) = \frac{1}{1+e^{-z}}$$
The parameter set $w$ can be optimized by maximizing the likelihood: $$\underset{\theta}{\text{argmax}}\; \mathcal{L}(\theta|t,z) = \underset{\theta}{\text{argmax}} \prod_{i=1}^{n} \mathcal{L}(\theta|t_i,z_i)$$
The likelihood can be described as join distribution of $t\;and\;z\;$given $\theta$: $$P(t,z|\theta) = P(t|z,\theta)P(z|\theta)$$ We don't care the probability of $z$ so $$\mathcal{L}(\theta|t,z) = P(t|z,\theta) = \prod_{i=1}^{n} P(t_i|z_i,\theta)$$ and $t_i$ is a Bernoulli variable. so $$\begin{split} P(t|z) & = \prod_{i=1}^{n} P(t_i=1|z_i)^{t_i} * (1 - P(t_i=1|z_i))^{1-t_i} \\ & = \prod_{i=1}^{n} y_i^{t_i} * (1 - y_i)^{1-t_i} \end{split}$$ The cross entropy cost function can be defined as (by taking negative $log$):
$$\begin{split} \xi(t,y) & = - log \mathcal{L}(\theta|t,z) \\ & = - \sum_{i=1}^{n} \left[ t_i log(y_i) + (1-t_i)log(1-y_i) \right] \\ & = - \sum_{i=1}^{n} \left[ t_i log(\sigma(z) + (1-t_i)log(1-\sigma(z)) \right] \end{split}$$and $t$ can be only 0 or 1 so above can be expressed as: $$\xi(t,y) = -t * log(y) - (1-t) * log(1-y)$$
The grandient decent
can be defined as:
$$w(k+1) = w(k) - \Delta w(k)$$
$$\Delta w(k) = \mu\frac{\partial \xi}{\partial w} \;\;\; where \;\mu\; is\; learning\; rate$$
simplely apply chain rule
here:
$$\frac{\partial \xi_i}{\partial \mathbf{w}} = \frac{\partial z_i}{\partial \mathbf{w}} \frac{\partial y_i}{\partial z_i} \frac{\partial \xi_i}{\partial y_i}$$
(1) $$\begin{split} \frac{\partial \xi}{\partial y} & = \frac{\partial (-t * log(y) - (1-t)* log(1-y))}{\partial y} = \frac{\partial (-t * log(y))}{\partial y} + \frac{\partial (- (1-t)*log(1-y))}{\partial y} \\ & = -\frac{t}{y} + \frac{1-t}{1-y} = \frac{y-t}{y(1-y)} \end{split}$$
(2) $$\frac{\partial y}{\partial z} = \frac{\partial \sigma(z)}{\partial z} = \frac{\partial \frac{1}{1+e^{-z}}}{\partial z} = \frac{-1}{(1+e^{-z})^2} *e^{-z}*-1 = \frac{1}{1+e^{-z}} \frac{e^{-z}}{1+e^{-z}} = \sigma(z) * (1- \sigma(z)) = y (1-y)$$
(3) $$\frac{\partial z}{\partial \mathbf{w}} = \frac{\partial (\mathbf{x} * \mathbf{w})}{\partial \mathbf{w}} = \mathbf{x}$$
So combine (1) - (3): $$\frac{\partial \xi_i}{\partial \mathbf{w}} = \frac{\partial z_i}{\partial \mathbf{w}} \frac{\partial y_i}{\partial z_i} \frac{\partial \xi_i}{\partial y_i} = \mathbf{x} * y_i (1 - y_i) * \frac{y_i - t_i}{y_i (1-y_i)} = \mathbf{x} * (y_i-t_i)$$
Finally, we get: $$\Delta w_j = \mu * \sum_{i=1}^{N} x_{ij} (y_i - t_i)$$
First of all, define the logistic function logistic
and the model nn
. The cost
function is the sum of the cross entropy of all training samples.
In [3]:
# Define the logistic function
def logistic(z):
return 1 / (1 + np.exp(-z))
# Define the neural network function y = 1 / (1 + numpy.exp(-x*w))
# x:20x2 and w: 1x2 so use w.T here
def nn(x, w):
return logistic(x.dot(w.T)) # 20x1 -> this is y
# Define the neural network prediction function that only returns
# 1 or 0 depending on the predicted class
def nn_predict(x,w):
return np.around(nn(x,w))
# Define the cost function
def cost(y, t):
return - np.sum(np.multiply(t, np.log(y)) + np.multiply((1-t), np.log(1-y))) # y and t all 20x1
Plot the cost function and as you can see it's convex and has global optimal minimum.
In [4]:
# Plot the cost in function of the weights
# Define a vector of weights for which we want to plot the cost
nb_of_ws = 100 # compute the cost nb_of_ws times in each dimension
ws1 = np.linspace(-5, 5, num=nb_of_ws) # weight 1
ws2 = np.linspace(-5, 5, num=nb_of_ws) # weight 2
ws_x, ws_y = np.meshgrid(ws1, ws2) # generate grid
cost_ws = np.zeros((nb_of_ws, nb_of_ws)) # initialize cost matrix
# Fill the cost matrix for each combination of weights
for i in range(nb_of_ws):
for j in range(nb_of_ws):
cost_ws[i,j] = cost(nn(X, np.asmatrix([ws_x[i,j], ws_y[i,j]])) , t)
# Plot the cost function surface
plt.contourf(ws_x, ws_y, cost_ws, 20, cmap=cm.pink)
cbar = plt.colorbar()
cbar.ax.set_ylabel('$\\xi$', fontsize=15)
plt.xlabel('$w_1$', fontsize=15)
plt.ylabel('$w_2$', fontsize=15)
plt.title('Cost function surface')
plt.grid()
plt.show()
The grandinet
and delta_w
is just simple equations we produced above.
In [5]:
# define the gradient function.
def gradient(w, x, t):
return (nn(x, w) - t).T * x
# define the update function delta w which returns the
# delta w for each weight in a vector
def delta_w(w_k, x, t, learning_rate):
return learning_rate * gradient(w_k, x, t)
Start trining and just interating for 10 steps. w = w-dw
is key point we update w during each integration.
In [6]:
# Set the initial weight parameter
w = np.asmatrix([-4, -2])
# Set the learning rate
learning_rate = 0.05
# Start the gradient descent updates and plot the iterations
nb_of_iterations = 10 # Number of gradient descent updates
w_iter = [w] # List to store the weight values over the iterations
for i in range(nb_of_iterations):
dw = delta_w(w, X, t, learning_rate) # Get the delta w update
w = w-dw # Update the weights
w_iter.append(w) # Store the weights for plotting
Plot just 4 itegrations and you can see it toward to global minimum.
In [7]:
# Plot the first weight updates on the error surface
# Plot the error surface
plt.contourf(ws_x, ws_y, cost_ws, 20, alpha=0.9, cmap=cm.pink)
cbar = plt.colorbar()
cbar.ax.set_ylabel('cost')
# Plot the updates
for i in range(1, 4):
w1 = w_iter[i-1]
w2 = w_iter[i]
# Plot the weight-cost value and the line that represents the update
plt.plot(w1[0,0], w1[0,1], 'bo') # Plot the weight cost value
plt.plot([w1[0,0], w2[0,0]], [w1[0,1], w2[0,1]], 'b-')
plt.text(w1[0,0]-0.2, w1[0,1]+0.4, '$w({})$'.format(i), color='b')
w1 = w_iter[3]
# Plot the last weight
plt.plot(w1[0,0], w1[0,1], 'bo')
plt.text(w1[0,0]-0.2, w1[0,1]+0.4, '$w({})$'.format(4), color='b')
# Show figure
plt.xlabel('$w_1$', fontsize=15)
plt.ylabel('$w_2$', fontsize=15)
plt.title('Gradient descent updates on cost surface')
plt.grid()
plt.show()
The sample codes in this note come from peterroelants.github.io where providing more details on neural netwrok and deep learning. It's very informative and highly recommanded. Here is more like my personal memo.