Interest in neural nets, and in particular those with more than than one hidden layer, has been surging in recent years. In this notebook we will be revisiting the problem of digit classification on the MNIST data. We will be introducing a new Python library, Theano, for working with neural nets. Theano is a popular choice as the same code can be run on either CPUs or GPUs. GPUs greatly speed up the training and prediction, and easily available (Amazon even offers GPU machines on EC2).
In part 1, we will review the basics of neural nets, and introduce Theano. In part 2, we will investigate more advanced topics in neural nets, including deep learning. I'd encourage you to read this paper as well as a supplementary explanation of Theano (http://www.iro.umontreal.ca/~lisa/pointeurs/theano_scipy2010.pdf).
In [1]:
%matplotlib inline
# Familiar libraries.
import numpy as np
from sklearn.datasets import fetch_mldata
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
import time
# Take a moment to install Theano. We will use it for building neural networks.
import theano
from theano import tensor as T
from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
print theano.config.device # We're using CPUs (for now)
print theano.config.floatX # Should be 64 bit for CPUs
np.random.seed(0)
In [3]:
# Repeating steps from Project 1 to prepare mnist dataset.
mnist = fetch_mldata('MNIST original', data_home='~/datasets/mnist')
X, Y = mnist.data, mnist.target
X = X / 255.0
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]
numExamples = 2000
test_data, test_labels = X[70000-numExamples:], Y[70000-numExamples:]
train_data, train_labels = X[:numExamples], Y[:numExamples]
numFeatures = train_data[1].size
numTrainExamples = train_data.shape[0]
numTestExamples = test_data.shape[0]
print 'Features = %d' %(numFeatures)
print 'Train set = %d' %(numTrainExamples)
print 'Test set = %d' %(numTestExamples)
In [4]:
# Convert labels into a set of binary variables, one for each class (sometimes called a 1-of-n encoding).
# This makes working with NNs easier: there will be one output node for each class.
def binarizeY(data):
binarized_data = np.zeros((data.size,10))
for j in range(0,data.size):
feature = data[j:j+1]
i = feature.astype(np.int64)
binarized_data[j,i]=1
return binarized_data
train_labels_b = binarizeY(train_labels)
test_labels_b = binarizeY(test_labels)
numClasses = train_labels_b[1].size
print 'Classes = %d' %(numClasses)
In [5]:
# Lets start with a simple KNN model to establish a baseline accuracy.
# Question: You've seen a number of different machine learning algos. What's your intuition about KNN scaling and
# accuracy characteristics vs. other algos?
neighbors = 1
# we'll be waiting quite a while if we use 60K
knn = KNeighborsClassifier(neighbors)
mini_train_data, mini_train_labels = X[:numExamples], Y[:numExamples]
start_time = time.time()
knn.fit(mini_train_data, mini_train_labels)
print 'Train time = %.2f' %(time.time() - start_time)
start_time = time.time()
accuracy = knn.score(test_data, test_labels)
print 'Accuracy = %.4f' %(accuracy)
print 'Prediction time = %.2f' %(time.time() - start_time)
In [6]:
# We'll start in Theano with implententing logistic regression.
# Recall the four key components: (1) parms, (2) model, (3) cost, and (4) objective.
## (1) Parms
# Init weights to small, but non-zero, values.
w = theano.shared(np.asarray((np.random.randn(*(numFeatures, numClasses))*.01)))
## (2) Model
# Theano objects accessed with standard Python variables
X = T.matrix()
Y = T.matrix()
# Two things to note here.
# First, logistic regression can be thought of as a neural net with no hidden layers. So the output values are
# just the dot product of the inputs and the edge weights.
# Second, we have 10 classes. So we can either train separate 1 vs all classification using sigmoid activation,
# which would be a hassle, or we can use the softmax activation, which is essentially a multi-class version of sigmoid.
def model(X, w):
return T.nnet.softmax(T.dot(X, w))
y_hat = model(X, w)
## (3) Cost
# Cross entropy only considers the error between the true class and the prediction, and not the errors for the false
# classes. This tends to cause the network to converge faster.
cost = T.mean(T.nnet.categorical_crossentropy(y_hat, Y))
## (4) Objective
# Minimization using gradient descent.
alpha = 0.01
gradient = T.grad(cost=cost, wrt=w)
update = [[w, w - gradient * alpha]]
train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True) # computes cost, then runs update
y_pred = T.argmax(y_hat, axis=1) # select largest probability as prediction
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)
def gradientDescent(epochs):
trainTime = 0.0
predictTime = 0.0
for i in range(epochs):
start_time = time.time()
cost = train(train_data[0:len(train_data)], train_labels_b[0:len(train_data)])
trainTime = trainTime + (time.time() - start_time)
print '%d) accuracy = %.4f' %(i+1, np.mean(np.argmax(test_labels_b, axis=1) == predict(test_data)))
print 'train time = %.2f' %(trainTime)
gradientDescent(50)
start_time = time.time()
predict(test_data)
print 'predict time = %.2f' %(time.time() - start_time)
In [20]:
## Let's switch to SGD and observe the impact.
## (1) Parms
w = theano.shared(np.asarray((np.random.randn(*(numFeatures, numClasses))*.01)))
## (2) Model
X = T.matrix()
Y = T.matrix()
def model(X, w):
return T.nnet.softmax(T.dot(X, w))
y_hat = model(X, w)
## (3) Cost
cost = T.mean(T.nnet.categorical_crossentropy(y_hat, Y))
## (4) Objective
alpha = 0.01
gradient = T.grad(cost=cost, wrt=w)
update = [[w, w - gradient * alpha]]
train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
y_pred = T.argmax(y_hat, axis=1)
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)
## Play with this value and notice the impact.
miniBatchSize = 1
def gradientDescentStochastic(epochs):
trainTime = 0.0
predictTime = 0.0
start_time = time.time()
for i in range(epochs):
for start, end in zip(range(0, len(train_data), miniBatchSize), range(miniBatchSize, len(train_data), miniBatchSize)):
cost = train(train_data[start:end], train_labels_b[start:end])
trainTime = trainTime + (time.time() - start_time)
print '%d) accuracy = %.4f' %(i+1, np.mean(np.argmax(test_labels_b, axis=1) == predict(test_data)))
print 'train time = %.2f' %(trainTime)
gradientDescentStochastic(10)
start_time = time.time()
predict(test_data)
print 'predict time = %.2f' %(time.time() - start_time)
In [21]:
## Now let's add a hidden layer (two layer neural net).
## (1) Parms
# Try playing with this value.
numHiddenNodes = 600
w_1 = theano.shared(np.asarray((np.random.randn(*(numFeatures, numHiddenNodes))*.01)))
w_2 = theano.shared(np.asarray((np.random.randn(*(numHiddenNodes, numClasses))*.01)))
params = [w_1, w_2]
## (2) Model
X = T.matrix()
Y = T.matrix()
# Two notes:
# First, feed forward is the composition of layers (dot product + activation function)
# Second, activation on the hidden layer still uses sigmoid
def model(X, w_1, w_2):
return T.nnet.softmax(T.dot(T.nnet.sigmoid(T.dot(X, w_1)), w_2))
y_hat = model(X, w_1, w_2)
## (3) Cost...same as logistic regression
cost = T.mean(T.nnet.categorical_crossentropy(y_hat, Y))
## (4) Minimization. Update rule changes to backpropagation.
alpha = 0.01
def backprop(cost, w):
grads = T.grad(cost=cost, wrt=w)
updates = []
for w1, grad in zip(w, grads):
updates.append([w1, w1 - grad * alpha])
return updates
update = backprop(cost, params)
train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
y_pred = T.argmax(y_hat, axis=1)
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)
miniBatchSize = 1
def gradientDescentStochastic(epochs):
trainTime = 0.0
predictTime = 0.0
start_time = time.time()
for i in range(epochs):
for start, end in zip(range(0, len(train_data), miniBatchSize), range(miniBatchSize, len(train_data), miniBatchSize)):
cost = train(train_data[start:end], train_labels_b[start:end])
trainTime = trainTime + (time.time() - start_time)
print '%d) accuracy = %.4f' %(i+1, np.mean(np.argmax(test_labels_b, axis=1) == predict(test_data)))
print 'train time = %.2f' %(trainTime)
gradientDescentStochastic(10)
start_time = time.time()
predict(test_data)
print 'predict time = %.2f' %(time.time() - start_time)
As interest in bigger and deeper networks has increased, a couple of tricks have emerged and become standard practice. Let's look at two of those--rectifier activation and dropout noise--that we'll use with deep networks.
For a more in-depth examination of the topic, check out this 1-day tutorial from KDD2014:
Part 1: http://videolectures.net/kdd2014_bengio_deep_learning/
Part 2: http://videolectures.net/tcmm2014_taylor_deep_learning/
In [22]:
## A curiousity: what happens if we simply add a third layer?
## (1) Parms
numHiddenNodes = 600
w_1 = theano.shared(np.asarray((np.random.randn(*(numFeatures, numHiddenNodes))*.01)))
w_2 = theano.shared(np.asarray((np.random.randn(*(numHiddenNodes, numHiddenNodes))*.01)))
w_3 = theano.shared(np.asarray((np.random.randn(*(numHiddenNodes, numClasses))*.01)))
params = [w_1, w_2, w_3]
## (2) Model
X = T.matrix()
Y = T.matrix()
def model(X, w_1, w_2, w_3):
return T.nnet.softmax(T.dot(T.nnet.sigmoid(T.dot(T.nnet.sigmoid(T.dot(X, w_1)), w_2)), w_3))
y_hat = model(X, w_1, w_2, w_3)
## (3) Cost...same as logistic regression
cost = T.mean(T.nnet.categorical_crossentropy(y_hat, Y))
## (4) Minimization. Update rule changes to backpropagation.
alpha = 0.01
def backprop(cost, w):
grads = T.grad(cost=cost, wrt=w)
updates = []
for w1, grad in zip(w, grads):
updates.append([w1, w1 - grad * alpha])
return updates
update = backprop(cost, params)
train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
y_pred = T.argmax(y_hat, axis=1)
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)
miniBatchSize = 1
def gradientDescentStochastic(epochs):
trainTime = 0.0
predictTime = 0.0
start_time = time.time()
for i in range(epochs):
for start, end in zip(range(0, len(train_data), miniBatchSize), range(miniBatchSize, len(train_data), miniBatchSize)):
cost = train(train_data[start:end], train_labels_b[start:end])
trainTime = trainTime + (time.time() - start_time)
print '%d) accuracy = %.4f' %(i+1, np.mean(np.argmax(test_labels_b, axis=1) == predict(test_data)))
print 'train time = %.2f' %(trainTime)
gradientDescentStochastic(10)
start_time = time.time()
predict(test_data)
print 'predict time = %.2f' %(time.time() - start_time)
Before we revisit adding layers, let's look at a recent idea around activation closely associated with deep learning. In 2010, in a paper published at NIPS (https://www.utc.fr/~bordesan/dokuwiki/_media/en/glorot10nipsworkshop.pdf), Yoshua Bengio showed that rectifier activation works better empirically than sigmoid activation when used in the hidden layers.
The rectifier activation is simple: f(x)=max(0,x). Intuitively, the difference is that as a sigmoid activated node approaches 1 it stops learning even if error continues to be propagated to it, whereas the rectifier activated node continue to learn (at least in the positive direction). It is not completely understood (per Yoshua Bengio) why this helps, but there are some theories being explored including as related to the benefits of sparse representations in networks. (http://www.iro.umontreal.ca/~bengioy/talks/KDD2014-tutorial.pdf). Rectifiers also speed up training.
Although the paper was published in 2010, the technique didn't gain widespread adoption until 2012 when members of Hinton's group spread the word, including with this Kaggle entry: http://blog.kaggle.com/2012/11/01/deep-learning-how-i-did-it-merck-1st-place-interview/
In [23]:
## 2-layer NN with rectify activation on the hidden layer.
## (1) Parms
numHiddenNodes = 600
w_1 = theano.shared(np.asarray((np.random.randn(*(numFeatures, numHiddenNodes))*.01)))
w_2 = theano.shared(np.asarray((np.random.randn(*(numHiddenNodes, numClasses))*.01)))
params = [w_1, w_2]
## (2) Model
X = T.matrix()
Y = T.matrix()
def model(X, w_1, w_2):
return T.nnet.softmax(T.dot(T.maximum(T.dot(X, w_1), 0.), w_2))
y_hat = model(X, w_1, w_2)
## (3) Cost...same as logistic regression
cost = T.mean(T.nnet.categorical_crossentropy(y_hat, Y))
## (4) Minimization. Update rule changes to backpropagation.
alpha = 0.01
def backprop(cost, w):
grads = T.grad(cost=cost, wrt=w)
updates = []
for w1, grad in zip(w, grads):
updates.append([w1, w1 - grad * alpha])
return updates
update = backprop(cost, params)
train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
y_pred = T.argmax(y_hat, axis=1)
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)
miniBatchSize = 1
def gradientDescentStochastic(epochs):
trainTime = 0.0
predictTime = 0.0
start_time = time.time()
for i in range(epochs):
for start, end in zip(range(0, len(train_data), miniBatchSize), range(miniBatchSize, len(train_data), miniBatchSize)):
cost = train(train_data[start:end], train_labels_b[start:end])
trainTime = trainTime + (time.time() - start_time)
print '%d) accuracy = %.4f' %(i+1, np.mean(np.argmax(test_labels_b, axis=1) == predict(test_data)))
print 'train time = %.2f' %(trainTime)
gradientDescentStochastic(10)
start_time = time.time()
predict(test_data)
print 'predict time = %.2f' %(time.time() - start_time)
As an exercise, switch to Maxout (or Max Pooling) activiation. Maxout activation just selects the max input as the output. Maxout is a type of pooling, a technique which performs particularly well for vision problems. (http://jmlr.org/proceedings/papers/v28/goodfellow13.pdf, http://www.quora.com/What-is-impact-of-different-pooling-methods-in-convolutional-neural-networks).
A second trick closely associated with deep learning, and that is now commonplace, is called 'Dropouts'. The idea is that instead of (or in addition to) adding noise to our inputs, we add noise by having each node return 0 with a certain probability during training. This trick both improves generalization in large networks and speeds up training.
Hinton introduced the idea in 2012 and gave an explanation of why it's similar to bagging (http://arxiv.org/pdf/1207.0580v1.pdf)
In [24]:
# Dropouts
from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
## (1) Parms
numHiddenNodes = 600
w_1 = theano.shared(np.asarray((np.random.randn(*(numFeatures, numHiddenNodes))*.01)))
w_2 = theano.shared(np.asarray((np.random.randn(*(numHiddenNodes, numClasses))*.01)))
params = [w_1, w_2]
## (2) Model
X = T.matrix()
Y = T.matrix()
srng = RandomStreams()
def dropout(X, p=0.):
if p > 0:
X *= srng.binomial(X.shape, p=1 - p)
X /= 1 - p
return X
def model(X, w_1, w_2, p_1, p_2):
return T.nnet.softmax(T.dot(dropout(T.maximum(T.dot(dropout(X, p_1), w_1),0.), p_2), w_2))
y_hat_train = model(X, w_1, w_2, 0.2, 0.5)
y_hat_predict = model(X, w_1, w_2, 0., 0.)
## (3) Cost...same as logistic regression
cost = T.mean(T.nnet.categorical_crossentropy(y_hat_train, Y))
## (4) Minimization. Update rule changes to backpropagation.
alpha = 0.01
def backprop(cost, w):
grads = T.grad(cost=cost, wrt=w)
updates = []
for w1, grad in zip(w, grads):
updates.append([w1, w1 - grad * alpha])
return updates
update = backprop(cost, params)
train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
y_pred = T.argmax(y_hat_predict, axis=1)
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)
miniBatchSize = 1
def gradientDescentStochastic(epochs):
trainTime = 0.0
predictTime = 0.0
start_time = time.time()
for i in range(epochs):
for start, end in zip(range(0, len(train_data), miniBatchSize), range(miniBatchSize, len(train_data), miniBatchSize)):
cost = train(train_data[start:end], train_labels_b[start:end])
trainTime = trainTime + (time.time() - start_time)
print '%d) accuracy = %.4f' %(i+1, np.mean(np.argmax(test_labels_b, axis=1) == predict(test_data)))
print 'train time = %.2f' %(trainTime)
gradientDescentStochastic(10)
start_time = time.time()
predict(test_data)
print 'predict time = %.2f' %(time.time() - start_time)
In [25]:
# Let's add back in that third layer
from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
## (1) Parms
numHiddenNodes = 600
w_1 = theano.shared(np.asarray((np.random.randn(*(numFeatures, numHiddenNodes))*.01)))
w_2 = theano.shared(np.asarray((np.random.randn(*(numHiddenNodes, numHiddenNodes))*.01)))
w_3 = theano.shared(np.asarray((np.random.randn(*(numHiddenNodes, numClasses))*.01)))
params = [w_1, w_2, w_3]
## (2) Model
X = T.matrix()
Y = T.matrix()
srng = RandomStreams()
def dropout(X, p=0.):
if p > 0:
X *= srng.binomial(X.shape, p=1 - p)
X /= 1 - p
return X
def model(X, w_1, w_2, w_3, p_1, p_2, p_3):
return T.nnet.softmax(T.dot(dropout(T.maximum(T.dot(dropout(T.maximum(T.dot(dropout(X, p_1), w_1),0.), p_2), w_2),0.), p_3), w_3))
y_hat_train = model(X, w_1, w_2, w_3, 0.2, 0.5,0.5)
y_hat_predict = model(X, w_1, w_2, w_3, 0., 0.,0.)
## (3) Cost...same as logistic regression
cost = T.mean(T.nnet.categorical_crossentropy(y_hat_train, Y))
## (4) Minimization. Update rule changes to backpropagation.
alpha = 0.01
def backprop(cost, w):
grads = T.grad(cost=cost, wrt=w)
updates = []
for w1, grad in zip(w, grads):
updates.append([w1, w1 - grad * alpha])
return updates
update = backprop(cost, params)
train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
y_pred = T.argmax(y_hat_predict, axis=1)
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)
miniBatchSize = 1
def gradientDescentStochastic(epochs):
trainTime = 0.0
predictTime = 0.0
start_time = time.time()
for i in range(epochs):
for start, end in zip(range(0, len(train_data), miniBatchSize), range(miniBatchSize, len(train_data), miniBatchSize)):
cost = train(train_data[start:end], train_labels_b[start:end])
trainTime = trainTime + (time.time() - start_time)
print '%d) accuracy = %.4f' %(i+1, np.mean(np.argmax(test_labels_b, axis=1) == predict(test_data)))
print 'train time = %.2f' %(trainTime)
gradientDescentStochastic(10)
start_time = time.time()
predict(test_data)
print 'predict time = %.2f' %(time.time() - start_time)
Today, when the phrase 'deep learning' is used to describe a system, it most likely is a convolution net (or convonet). The convonet architechture was largely developed in the late 90's at Bell Labs, but only very recently popularized. It was developed for image recognition, and is described and implemented with 2d representations in mind.
Geoffrey Hinton has an excellent two-part lecture on the topic:
https://www.youtube.com/watch?v=6oD3t6u5EPs
https://www.youtube.com/watch?v=fueIAeAsGzA
Also, this code was partly taken from these tutorials, which are worth referring back to:
http://deeplearning.net/tutorial/lenet.html
http://ufldl.stanford.edu/tutorial/supervised/FeatureExtractionUsingConvolution/
In [ ]:
# Let's add back in that third layer
from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
from theano.tensor.nnet.conv import conv2d
from theano.tensor.signal.downsample import max_pool_2d
## (1) Parms
numHiddenNodes = 600
patchWidth = 3
patchHeight = 3
featureMapsLayer1 = 32
featureMapsLayer2 = 64
featureMapsLayer3 = 128
# For convonets, we will work in 2d rather than 1d. The MNIST images are 28x28 in 2d.
imageWidth = 28
train_data = train_data.reshape(-1, 1, imageWidth, imageWidth)
test_data = test_data.reshape(-1, 1, imageWidth, imageWidth)
# Convolution layers.
w_1 = theano.shared(np.asarray((np.random.randn(*(featureMapsLayer1, 1, patchWidth, patchHeight))*.01)))
w_2 = theano.shared(np.asarray((np.random.randn(*(featureMapsLayer2, featureMapsLayer1, patchWidth, patchHeight))*.01)))
w_3 = theano.shared(np.asarray((np.random.randn(*(featureMapsLayer3, featureMapsLayer2, patchWidth, patchHeight))*.01)))
# Fully connected NN.
w_4 = theano.shared(np.asarray((np.random.randn(*(featureMapsLayer3 * 3 * 3, numHiddenNodes))*.01)))
w_5 = theano.shared(np.asarray((np.random.randn(*(numHiddenNodes, numClasses))*.01)))
params = [w_1, w_2, w_3, w_4, w_5]
## (2) Model
X = T.tensor4() # conv2d works with tensor4 type
Y = T.matrix()
srng = RandomStreams()
def dropout(X, p=0.):
if p > 0:
X *= srng.binomial(X.shape, p=1 - p)
X /= 1 - p
return X
# Theano provides built-in support for add convolutional layers
def model(X, w_1, w_2, w_3, w_4, w_5, p_1, p_2):
l1 = dropout(max_pool_2d(T.maximum(conv2d(X, w_1, border_mode='full'),0.), (2, 2)), p_1)
l2 = dropout(max_pool_2d(T.maximum(conv2d(l1, w_2), 0.), (2, 2)), p_1)
l3 = dropout(T.flatten(max_pool_2d(T.maximum(conv2d(l2, w_3), 0.), (2, 2)), outdim=2), p_1) # flatten to switch back to 1d layers
l4 = dropout(T.maximum(T.dot(l3, w_4), 0.), p_2)
return T.nnet.softmax(T.dot(l4, w_5))
y_hat_train = model(X, w_1, w_2, w_3, w_4, w_5, 0.2, 0.5)
y_hat_predict = model(X, w_1, w_2, w_3, w_4, w_5, 0., 0.)
y_x = T.argmax(y_hat, axis=1)
## (3) Cost
cost = T.mean(T.nnet.categorical_crossentropy(y_hat_train, Y))
## (4) Minimization.
def backprop(cost, w, alpha=0.001, rho=0.9, epsilon=1e-6):
grads = T.grad(cost=cost, wrt=w)
updates = []
for w1, grad in zip(w, grads):
# adding gradient scaling
acc = theano.shared(w1.get_value() * 0.)
acc_new = rho * acc + (1 - rho) * grad ** 2
gradient_scaling = T.sqrt(acc_new + epsilon)
grad = grad / gradient_scaling
updates.append((acc, acc_new))
updates.append((w1, w1 - grad * alpha))
return updates
update = backprop(cost, params)
train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
y_pred = T.argmax(y_hat_predict, axis=1)
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)
miniBatchSize = 1
def gradientDescentStochastic(epochs):
trainTime = 0.0
predictTime = 0.0
start_time = time.time()
for i in range(epochs):
for start, end in zip(range(0, len(train_data), miniBatchSize), range(miniBatchSize, len(train_data), miniBatchSize)):
cost = train(train_data[start:end], train_labels_b[start:end])
trainTime = trainTime + (time.time() - start_time)
print '%d) accuracy = %.4f' %(i+1, np.mean(np.argmax(test_labels_b, axis=1) == predict(test_data)))
print 'train time = %.2f' %(trainTime)
gradientDescentStochastic(10)
start_time = time.time()
predict(test_data)
print 'predict time = %.2f' %(time.time() - start_time)
The architechture of the convonet was inspired by the visual cortext in the human brain. If you are interested in learning more, check out: http://www-psych.stanford.edu/~ashas/Cognition%20Textbook/chapter2.pdf
CES 2015 parts 6,7,9
https://www.youtube.com/watch?v=-vKGkxeflGw
https://www.youtube.com/watch?v=zsVsUvx8ieo
https://www.youtube.com/watch?v=RvQVyGOynFY
GTC 2015 parts 4,5,7,9
https://www.youtube.com/watch?v=pqvdZ2jp1NA
https://www.youtube.com/watch?v=GGxdP_JWhwI
In [ ]: