In this assignment, we will walk you through the process of implementing

  • A softmax function
  • A simple neural network
  • Back propagation
  • Word2vec models

and training your own word vectors with stochastic gradient descent (SGD) for a sentiment analysis task.


In [10]:
using PyPlot;

1. Softmax

If you want the outputs of a network to be interpretable as posterior probabilities for a categorical target variable, it is highly desirable for those outputs to lie between zero and one and to sum to one. The purpose of the softmax activation function is to enforce these constraints on the outputs.

http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-12.html

$$softmax(x) = softmax(x + c)$$

where $x + c$ means adding the constant $c$ to every dimension of $x$.

Note: In practice, we make use of this property and choose $c = − max_ix_i$ when computing softmax probabil- ities for numerical stability (i.e. subtracting its maximum element from all elements of x).

Hence you can always pick one of the output units, and add an appropriate constant to each net input to produce any desired net input for the selected output unit, which you can choose to be zero or whatever is convenient. You can use the same trick to make sure that none of the exponentials overflows.

Given an input matrix of N rows and d columns, compute the softmax prediction for each row. That is, when the input is

[[1,2],
[3,4]]

the output of your functions should be

[[0.2689, 0.7311],
[0.2689, 0.7311]]

In [11]:
function softmax(x)
    # Softmax function #
    ###################################################################
    # Compute the softmax function for the input here.                #
    # It is crucial that this function is optimized for speed because #
    # it will be used frequently in later code.                       #
    # You might find numpy functions np.exp, np.sum, np.reshape,      #
    # np.max, and numpy broadcasting useful for this task. (numpy     #
    # broadcasting documentation:                                     #
    # http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)  #
    # You should also make sure that your code works for one          #
    # dimensional inputs (treat the vector as a row), you might find  #
    # it helpful for your later problems.
    #
    # http://ufldl.stanford.edu/wiki/index.php/Softmax_Regression
    ###################################################################
    ### YOUR CODE HERE
    # find max element per row
    row = size(x,1);
    xMax = zeros(size(x));
    for r = 1:row
        xMax[r,:] = exp(x[r,:] - maximum(x[r,:]))/sum(exp(x[r,:]-maximum(x[r,:]))) ;
    end
    x = xMax;
    ### END YOUR CODE
    return x;
end


Out[11]:
softmax (generic function with 1 method)

In [12]:
# Verify your softmax implementation

println("=== For autograder ===");
println(softmax([[1 2];[3 4]]));
println(softmax([[1001 1002];[3 4]]));
println(softmax([[-1001 -1002]]));


=== For autograder ===
[0.2689414213699951 0.7310585786300049
 0.2689414213699951 0.7310585786300049]
[0.2689414213699951 0.7310585786300049
 0.2689414213699951 0.7310585786300049]
[0.7310585786300049 0.2689414213699951]

2. Neural network basics

In this part, we're going to implement

  • A sigmoid activation function and its gradient
  • A forward propagation for a simple neural network with cross-entropy cost
  • A backward propagation algorithm to compute gradients for the parameters
  • Gradient / derivative check

In [4]:
function sigmoid(x)
    # Sigmoid function #
    ###################################################################
    # Compute the sigmoid function for the input here.                #
    ###################################################################
    
    ### YOUR CODE HERE
    x = 1.0./(1.0+exp(-x));
    ### END YOUR CODE
    
    return x;
end


Out[4]:
sigmoid (generic function with 1 method)

In [5]:
function sigmoid_grad(f)
    # Sigmoid gradient function #
    ###################################################################
    # Compute the gradient for the sigmoid function here. Note that   #
    # for this implementation, the input f should be the sigmoid      #
    # function value of your original input x.                        #
    ###################################################################
    
    ### YOUR CODE HERE
    f = sigmoid(f).*(1.0-sigmoid(f));
    ### END YOUR CODE
    
    return f;
end


Out[5]:
sigmoid_grad (generic function with 1 method)

In [6]:
# Check your sigmoid implementation
x = [[1 2], [-1 -2]];
f = sigmoid(x);
g = sigmoid_grad(f);
println("=== For autograder ===");
println(f);
println(g);


=== For autograder ===
[0.7310585786300049 0.8807970779778823
 0.2689414213699951 0.11920292202211755]
[0.2193618640098077 0.20715622948838924
 0.24553334917258515 0.24911401541513817]

Using the functions implemented above to implement a neural network with one sigmoid hidden layer.

To calculate the numerical gradient we'll use: $$\frac{df(x)}{dx} = \frac{f(x + h) - f(x - h)}{2h} \hspace{0.1in}$$


In [7]:
# First implement a gradient checker by filling in the following functions
function gradcheck_naive(f, x)
    ###
    # Gradient check for a function f 
    # - f should be a function that takes a single argument and outputs the cost and its gradients
    # - x is the point (numpy array) to check the gradient at
    ### 

    #rndstate = random.getstate()
    #random.setstate(rndstate)  
    fx, grad = f(x) # Evaluate function value at original point
    h = 1e-4

    # Iterate over all indexes in x
    for ix in eachindex(grad)
        #println(x[ix])
        ### YOUR CODE HERE: try modifying x[ix] with h defined above to compute numerical gradients
        ### make sure you call random.setstate(rndstate) before calling f(x) each time, this will make it 
        ### possible to test cost functions with built in randomness later
        fxBefore = f(x[ix]-h)[1];
        fxAfter = f(x[ix]+h)[1];
        numgrad = (fxAfter - fxBefore)/(2*h);
        
        ### END YOUR CODE

        # Compare gradients
        reldiff = abs(numgrad - grad[ix]) / max(1, abs(numgrad), abs(grad[ix]))
        if(reldiff > 1e-5)
            println("Gradient check failed.")
            println("First gradient error found at index ",ix)
            println("Your gradient: ", grad[ix] ," \t Numerical gradient:", numgrad)
            return
        end
    
    end
        println("Gradient check passed!");
end


Out[7]:
gradcheck_naive (generic function with 1 method)

In [8]:
function quad(x)
    return sum(x.^2),x*2
end


Out[8]:
quad (generic function with 1 method)

In [9]:
# Sanity check for the gradient checker
#quad = lambda x: (np.sum(x ** 2), x * 2)

println("=== For autograder ===")
gradcheck_naive(quad, collect(123.456))      # scalar test
gradcheck_naive(quad, randn(3,))    # 1-D test
gradcheck_naive(quad, randn(4,5))   # 2-D test


=== For autograder ===
eachindex not defined
while loading In[9], in expression starting on line 5

 in gradcheck_naive at In[7]:15

In [37]:
# Set up fake data and parameters for the neural network
N = 20
dimensions = [10, 5, 10]
data = randn(N, dimensions[1])   # each row will be a datum
labels = zeros((N, dimensions[3]))
for i=1:N
    labels[i, rand(1:dimensions[3])] = 1
end
params = randn((dimensions[1] + 1) * dimensions[2] + (dimensions[2] + 1) * dimensions[3], )


Out[37]:
115-element Array{Float64,1}:
 -2.72695   
  0.612086  
 -0.0590889 
 -1.36927   
 -0.00648123
  1.34737   
  1.92513   
  0.0629283 
  0.46731   
 -1.29867   
  1.0087    
 -0.0265673 
 -0.0486206 
  ⋮         
 -0.360809  
  0.277802  
 -0.521234  
  0.759818  
  0.325637  
 -0.477799  
  0.514196  
 -1.6942    
 -0.319777  
 -1.26717   
  0.443363  
 -0.585058  

In [97]:
function forward_backward_prop(data, labels, params)
    #""" Forward and backward propagation for a two-layer sigmoidal network """
    ###################################################################
    # Compute the forward propagation and for the cross entropy cost, #
    # and backward propagation for the gradients for all parameters.  #
    ###################################################################
    
    ### Unpack network parameters (do not modify)
    t = 1
    W1 = reshape(params[t:dimensions[1]*dimensions[2]], (dimensions[1], dimensions[2]))
    t += dimensions[1]*dimensions[2]
    b1 = reshape(params[t:t-1+dimensions[2]], (1, dimensions[2]))
    t += dimensions[2]
    W2 = reshape(params[t:t-1+dimensions[2]*dimensions[3]], (dimensions[2], dimensions[3]))
    t += dimensions[2]*dimensions[3]
    b2 = reshape(params[t:t-1+dimensions[3]], (1, dimensions[3]))
    
    ### YOUR CODE HERE: forward propagation
    N, D = size(data)
    
    
    z1 = (data*W1).+b1
    activation1 = sigmoid(z1)
    scores = softmax(activation1*W2 .+ b2)
    #cost = sum(- log(scores[labels == 1])) / N
    ### END YOUR CODE
    
    ### YOUR CODE HERE: backward propagation

    #dscores = scores - labels
    
    #dscores /= N
    
   # gradb2 = np.sum(dscores, axis=0)
   # gradW2 = np.dot(h.T, dscores)
    
    
    #grad_h = np.dot(dscores, W2.T)
    #grad_h = sigmoid_grad(h) * grad_h
    
    #gradb1 = np.sum(grad_h, axis=0)
    #gradW1 = np.dot(data.T, grad_h)
    
    ### END YOUR CODE
    
    ### Stack gradients (do not modify)
    #grad = np.concatenate((gradW1.flatten(), gradb1.flatten(), gradW2.flatten(), gradb2.flatten()))
    
    #return cost, grad
end


Out[97]:
forward_backward_prop (generic function with 1 method)

In [98]:
# Perform gradcheck on your neural network
println("=== For autograder ===")
forward_backward_prop(data, labels, params)
#gradcheck_naive(forward_backward_prop(data, labels, params), params)


=== For autograder ===
Out[98]:
20x10 Array{Float64,2}:
 0.05982     0.144687   0.482001  …  0.00900514  0.0316486   0.00555769
 0.0203127   0.50733    0.139886     0.013758    0.0312936   0.0335686 
 0.0832262   0.244695   0.359195     0.0024399   0.0335761   0.00445099
 0.0431302   0.413596   0.159337     0.00932879  0.0531728   0.0302763 
 0.0234169   0.24347    0.336009     0.0101099   0.0118017   0.0133297 
 0.0693191   0.366145   0.219871  …  0.00471082  0.0540632   0.0131887 
 0.00953332  0.0672128  0.648148     0.0218245   0.00204121  0.00337138
 0.0764868   0.118373   0.498219     0.00773054  0.0327679   0.00441254
 0.00977199  0.213009   0.461104     0.0213747   0.00841669  0.00668187
 0.0505441   0.388085   0.221576     0.00871854  0.0444769   0.0180464 
 0.0141623   0.390797   0.299265  …  0.0181372   0.0105226   0.0145866 
 0.00378141  0.0890384  0.516085     0.00609009  0.00181097  0.00157955
 0.0198691   0.403058   0.216321     0.0198355   0.0297633   0.0249765 
 0.0640671   0.394918   0.185823     0.00587248  0.0636349   0.0182199 
 0.0837464   0.161011   0.44775      0.00342245  0.0325771   0.00367183
 0.0664835   0.372294   0.20566   …  0.00643564  0.0620791   0.0170993 
 0.0169086   0.351943   0.298195     0.0221271   0.0164677   0.0179072 
 0.110937    0.306235   0.180314     0.0024226   0.11372     0.0106886 
 0.0426967   0.0585877  0.674377     0.00391904  0.0102842   0.00123172
 0.0845404   0.233123   0.383206     0.00237399  0.0291285   0.00371747

In [ ]:


In [ ]: