In this assignment, we will walk you through the process of implementing
and training your own word vectors with stochastic gradient descent (SGD) for a sentiment analysis task.
In [10]:
using PyPlot;
If you want the outputs of a network to be interpretable as posterior probabilities for a categorical target variable, it is highly desirable for those outputs to lie between zero and one and to sum to one. The purpose of the softmax activation function is to enforce these constraints on the outputs.
http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-12.html
$$softmax(x) = softmax(x + c)$$where $x + c$ means adding the constant $c$ to every dimension of $x$.
Note: In practice, we make use of this property and choose $c = − max_ix_i$ when computing softmax probabil- ities for numerical stability (i.e. subtracting its maximum element from all elements of x).
Hence you can always pick one of the output units, and add an appropriate constant to each net input to produce any desired net input for the selected output unit, which you can choose to be zero or whatever is convenient. You can use the same trick to make sure that none of the exponentials overflows.
Given an input matrix of N rows and d columns, compute the softmax prediction for each row. That is, when the input is
[[1,2],
[3,4]]
the output of your functions should be
[[0.2689, 0.7311],
[0.2689, 0.7311]]
In [11]:
function softmax(x)
# Softmax function #
###################################################################
# Compute the softmax function for the input here. #
# It is crucial that this function is optimized for speed because #
# it will be used frequently in later code. #
# You might find numpy functions np.exp, np.sum, np.reshape, #
# np.max, and numpy broadcasting useful for this task. (numpy #
# broadcasting documentation: #
# http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) #
# You should also make sure that your code works for one #
# dimensional inputs (treat the vector as a row), you might find #
# it helpful for your later problems.
#
# http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression
###################################################################
### YOUR CODE HERE
# find max element per row
row = size(x,1);
xMax = zeros(size(x));
for r = 1:row
xMax[r,:] = exp(x[r,:] - maximum(x[r,:]))/sum(exp(x[r,:]-maximum(x[r,:]))) ;
end
x = xMax;
### END YOUR CODE
return x;
end
Out[11]:
In [12]:
# Verify your softmax implementation
println("=== For autograder ===");
println(softmax([[1 2];[3 4]]));
println(softmax([[1001 1002];[3 4]]));
println(softmax([[-1001 -1002]]));
In [4]:
function sigmoid(x)
# Sigmoid function #
###################################################################
# Compute the sigmoid function for the input here. #
###################################################################
### YOUR CODE HERE
x = 1.0./(1.0+exp(-x));
### END YOUR CODE
return x;
end
Out[4]:
In [5]:
function sigmoid_grad(f)
# Sigmoid gradient function #
###################################################################
# Compute the gradient for the sigmoid function here. Note that #
# for this implementation, the input f should be the sigmoid #
# function value of your original input x. #
###################################################################
### YOUR CODE HERE
f = sigmoid(f).*(1.0-sigmoid(f));
### END YOUR CODE
return f;
end
Out[5]:
In [6]:
# Check your sigmoid implementation
x = [[1 2], [-1 -2]];
f = sigmoid(x);
g = sigmoid_grad(f);
println("=== For autograder ===");
println(f);
println(g);
Using the functions implemented above to implement a neural network with one sigmoid hidden layer.
To calculate the numerical gradient we'll use: $$\frac{df(x)}{dx} = \frac{f(x + h) - f(x - h)}{2h} \hspace{0.1in}$$
In [7]:
# First implement a gradient checker by filling in the following functions
function gradcheck_naive(f, x)
###
# Gradient check for a function f
# - f should be a function that takes a single argument and outputs the cost and its gradients
# - x is the point (numpy array) to check the gradient at
###
#rndstate = random.getstate()
#random.setstate(rndstate)
fx, grad = f(x) # Evaluate function value at original point
h = 1e-4
# Iterate over all indexes in x
for ix in eachindex(grad)
#println(x[ix])
### YOUR CODE HERE: try modifying x[ix] with h defined above to compute numerical gradients
### make sure you call random.setstate(rndstate) before calling f(x) each time, this will make it
### possible to test cost functions with built in randomness later
fxBefore = f(x[ix]-h)[1];
fxAfter = f(x[ix]+h)[1];
numgrad = (fxAfter - fxBefore)/(2*h);
### END YOUR CODE
# Compare gradients
reldiff = abs(numgrad - grad[ix]) / max(1, abs(numgrad), abs(grad[ix]))
if(reldiff > 1e-5)
println("Gradient check failed.")
println("First gradient error found at index ",ix)
println("Your gradient: ", grad[ix] ," \t Numerical gradient:", numgrad)
return
end
end
println("Gradient check passed!");
end
Out[7]:
In [8]:
function quad(x)
return sum(x.^2),x*2
end
Out[8]:
In [9]:
# Sanity check for the gradient checker
#quad = lambda x: (np.sum(x ** 2), x * 2)
println("=== For autograder ===")
gradcheck_naive(quad, collect(123.456)) # scalar test
gradcheck_naive(quad, randn(3,)) # 1-D test
gradcheck_naive(quad, randn(4,5)) # 2-D test
In [37]:
# Set up fake data and parameters for the neural network
N = 20
dimensions = [10, 5, 10]
data = randn(N, dimensions[1]) # each row will be a datum
labels = zeros((N, dimensions[3]))
for i=1:N
labels[i, rand(1:dimensions[3])] = 1
end
params = randn((dimensions[1] + 1) * dimensions[2] + (dimensions[2] + 1) * dimensions[3], )
Out[37]:
In [97]:
function forward_backward_prop(data, labels, params)
#""" Forward and backward propagation for a two-layer sigmoidal network """
###################################################################
# Compute the forward propagation and for the cross entropy cost, #
# and backward propagation for the gradients for all parameters. #
###################################################################
### Unpack network parameters (do not modify)
t = 1
W1 = reshape(params[t:dimensions[1]*dimensions[2]], (dimensions[1], dimensions[2]))
t += dimensions[1]*dimensions[2]
b1 = reshape(params[t:t-1+dimensions[2]], (1, dimensions[2]))
t += dimensions[2]
W2 = reshape(params[t:t-1+dimensions[2]*dimensions[3]], (dimensions[2], dimensions[3]))
t += dimensions[2]*dimensions[3]
b2 = reshape(params[t:t-1+dimensions[3]], (1, dimensions[3]))
### YOUR CODE HERE: forward propagation
N, D = size(data)
z1 = (data*W1).+b1
activation1 = sigmoid(z1)
scores = softmax(activation1*W2 .+ b2)
#cost = sum(- log(scores[labels == 1])) / N
### END YOUR CODE
### YOUR CODE HERE: backward propagation
#dscores = scores - labels
#dscores /= N
# gradb2 = np.sum(dscores, axis=0)
# gradW2 = np.dot(h.T, dscores)
#grad_h = np.dot(dscores, W2.T)
#grad_h = sigmoid_grad(h) * grad_h
#gradb1 = np.sum(grad_h, axis=0)
#gradW1 = np.dot(data.T, grad_h)
### END YOUR CODE
### Stack gradients (do not modify)
#grad = np.concatenate((gradW1.flatten(), gradb1.flatten(), gradW2.flatten(), gradb2.flatten()))
#return cost, grad
end
Out[97]:
In [98]:
# Perform gradcheck on your neural network
println("=== For autograder ===")
forward_backward_prop(data, labels, params)
#gradcheck_naive(forward_backward_prop(data, labels, params), params)
Out[98]:
In [ ]:
In [ ]: