In [11]:
using PyPlot
In [12]:
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = zeros(N*K,D) # data matrix (each row = single example)
y = zeros(Int64, N*K) # class labels
for j in range(1,K)
idx = range(1+N*(j-1), N); #index for X and Y
r = linspace(0.0,1,N); # radius
t = linspace((j-1)*4,(j)*4,N) + randn(N)*0.2 # theta
X[idx,:] = [r.*sin(t) r.*cos(t)]
y[idx] = j;
end
In [13]:
# lets visualize the data:
scatter(X[:, 1], X[:, 2], s=40, c=y, alpha=0.5)
Out[13]:
In [14]:
# initialize parameters randomly
h = 100 # size of hidden layer
W1 = 0.01 * randn(D,h)
b1 = zeros(1,h)
W2 = 0.01 * randn(h,K)
b2 = zeros(1,K)
Out[14]:
In [15]:
# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength
Out[15]:
we have to backpropagate the ReLU non-linearity. This turns out to be easy because ReLU during the backward pass is effectively a switch. Since $r = max(0, x)$, we have that $\frac{dr}{dx} = 1(x > 0)$. Combined with the chain rule, we see that the ReLU unit lets the gradient pass through unchanged if its input was greater than 0, but kills it if its input was less than zero during the forward pass.
The derivative of hard ReLU is constant over two ranges x<0 and x>=0, for x>0, g’=1, and x<0, g’=0.
In [ ]:
function relu(x)
return max(0,x);
end
In [18]:
# gradient descent loop
num_examples = size(X,1)
numIterations = 10000
J = zeros(numIterations,1);
for i in 1:numIterations
# evaluate class scores, [N x K]
hidden_layer = relu(X*W1 .+ b1) # note, ReLU activation
scores = hidden_layer*W2 .+ b2
# compute the class probabilities (softmax)
exp_scores = exp(scores);
probs = exp_scores ./ sum(exp_scores, 2) # [N x K]
# compute the loss: average cross-entropy loss and regularization
corect_logprobs = zeros(num_examples)
for j in 1:num_examples
corect_logprobs[j] = -log(probs[j,y[j]]);
end
data_loss = sum(corect_logprobs)/num_examples
reg_loss = 0.5*reg*sum(W1.^2) + 0.5*reg*sum(W2.^2)
J[i,:] = data_loss + reg_loss
if i==1 || i % 1000 == 0
println("iteration: ", i," loss: ", J[i,:])
end
# compute the gradient on scores
dscores = probs;
for j in 1:size(dscores,1)
dscores[j,y[j]] -= 1;
end
dscores /= num_examples;
# backpropate the gradient to the parameters
# first backprop into parameters W2 and b2
dW2 = hidden_layer'*dscores
db2 = sum(dscores, 1)
# next backprop into hidden layer
dhidden = dscores*W2'
#println(size(hidden_layer))
#println(size(dhidden))
# backprop the ReLU non-linearity
dhidden[hidden_layer .<= 0] = 0
# finally into W,b
dW1 = X'*dhidden
db1 = sum(dhidden, 1)
# add regularization gradient contribution
dW2 += reg * W2
dW1 += reg * W1
# perform a parameter update
W1 += -step_size * dW1
b1 += -step_size * db1
W2 += -step_size * dW2
b2 += -step_size * db2
end
Kinks in the objective. One source of inaccuracy to be aware of during gradient checking is the problem of kinks. Kinks refer to non-differentiable parts of an objective function, introduced by functions such as ReLU ($max(0,x)$), or the SVM loss, Maxout neurons, etc. Consider gradient checking the ReLU function at $x=−1e6$. Since $x<0$, the analytic gradient at this point is exactly zero. However, the numerical gradient would suddenly compute a non-zero gradient because $f(x+h)$ might cross over the kink (e.g. if $h>1e−6$) and introduce a non-zero contribution. You might think that this is a pathological case, but in fact this case can be very common. For example, an SVM for CIFAR-10 contains up to 450,000 $max(0,x)$ terms because there are 50,000 examples and each example yields 9 terms to the objective. Moreover, a Neural Network with an SVM classifier will contain many more kinks due to ReLUs.
Note that it is possible to know if a kink was crossed in the evaluation of the loss. This can be done by keeping track of the identities of all "winners" in a function of form $max(x,y)$; That is, was x or y higher during the forward pass. If the identity of at least one winner changes when evaluating $f(x+h)$ and then $f(x−h)$, then a kink was crossed and the numerical gradient will not be exact.
In [19]:
# plot the cost per iteration
plot(1:length(J), J)
xlabel("Iterations")
ylabel("Cost")
grid("on")
This loss function looks reasonable (it might indicate a slightly too small learning rate based on its speed of decay, but it's hard to say), and also indicates that the batch size might be a little too low (since the cost is a little too noisy).
In [20]:
# evaluate training set accuracy
hidden_layer = max(0, X*W1 .+ b1)
scores = hidden_layer*W2 .+ b2
predicted_class = zeros(size(scores,1))
for i in 1:size(scores,1)
predicted_class[i] = indmax(scores[i,:])
end
#println(predicted_class)
correct = 0;
for i in 1:length(y)
if y[i] == predicted_class[i]
correct = correct + 1;
end
end
println("training accuracy: ", correct/length(y))
In [21]:
# plot the resulting classifier
h = 0.02;
x_min = minimum(X[:, 1]) - 1;
x_max = maximum(X[:, 1]) + 1;
y_min = minimum(X[:, 2]) - 1;
y_max = maximum(X[:, 2]) + 1;
numX = convert(Int, floor((x_max - x_min)/h));
xx = zeros(numX);
xx[1] = x_min;
yy = zeros(numX);
yy[1] = y_min;
for i in 2:numX
xx[i] = xx[i-1] + h;
yy[i] = yy[i-1] + h;
end
grid_x = [i for i in xx, j in yy];
grid_y = [j for i in xx, j in yy];
xy = [grid_x[:] grid_y[:]];
z0 = xy*W1 .+ b1
z0[z0 .< 0] = 0
z = z0*W2 .+ b2
zz = zeros(size(z,1));
for i in 1:size(z,1)
zz[i] = indmax(z[i,:])
end
zz = reshape(zz, size(grid_x));
In [22]:
contourf(grid_x, grid_y, zz, cmap=get_cmap("Spectral"), alpha=0.8)
scatter(X[:, 1], X[:, 2], c=y, s=40)
Out[22]:
In [ ]: