Artificial neural networks are computational systems that can learn to perform tasks by considering examples, generally without being programmed with any task-specific rules. It is supposed to mimic a biological system, wherein neurons interact by sending signals in the form of mathematical functions between layers. All layers can contain an arbitrary number of neurons, and each connection is represented by a weight variable.
In this tutorial we will build a feed-forward neural network, where information moves in only in direction: forward through the layers. Each neuron or node is represented by a circle, while arrows display the connections between the nodes and indicate the direction of information flow. Each node in a layer is connected to all nodes in the subsequent layer, which makes this a so-called fully-connected feed-forward neural network.
Via Wikipedia
To follow this tutorial we require an installation of Python with the numerical package numpy, either:
1) Python 2.7.x
2) Python 3.5.x or greater
With a version of Numpy 1.0.x or greater.
We will also use the packages matplotlib, scikit-learn, Tensorflow and Keras, though these are not strictly necessary.
To open and run this notebook you also need an installation of IPython and Jupyter Notebook.
Anaconda is a free and open source Python and R distribution, that aims to simplify package management and deployment. Anaconda comes with more than 1000 data packages, as well as the Conda package and package and virtual environment manager. Anaconda is available on Linux, OS X and Windows systems, and contains nearly all prerequisite software, it comes highly recommended.
If Anaconda is installed you can install Tensorflow and Keras using:
conda install tensorflow
conda install keras
(You may run into minor problems with conflicting package versions).
If you do not wish to install Anaconda you may download Python from here,
or you can use package managers like brew, apt, pacman,...
Python distributions come with their own package manager, pip, and once you have Python installed
you can run the following command:
pip install numpy matplotlib scikit-learn ipython jupyter
To install Tensorflow follow the instructions here.
After you have installed tensorflow you can install keras:
pip install keras
Via xkcd
One can identify a set of key steps when using neural networks to solve supervised learning problems:
1) Collect and pre-process data
2) Define model and architecture
3) Choose cost function and optimizer
4) Train the model
5) Evaluate model performance on test data
6) Adjust hyperparameters (if necessary, network architecture)
In this tutorial we will be using the MNIST dataset, which is readily available through the scikit-learn
package. You may also find it for example here.
The MNIST (Modified National Institute of Standards and Technology) database is a large database
of handwritten digits that is commonly used for training various image processing systems.
The MNIST dataset consists of 70 000 images of size 28x28 pixels, each labeled from 0 to 9.
To feed data into a feed-forward neural network we need to represent the inputs as a feature matrix $X = [n_{inputs},
n_{features}]$.
Each row represents an input, in this case a handwritten digit, and each column represents a feature, in this case a pixel.
The correct answers, also known as labels or targets are represented as a 1D array of integers $Y = [n_{inputs}] = [5, 3, 1, 8,...]$.
Say I wanted to build a neural network using supervised learning to predict Body-Mass Index (BMI) from
measurements of height (in m)
and weight (in kg). If I had measurements of 5 people the feature matrix could be for example:
and the targets would be:
$$ Y = (23.7, 22.2, 27.1, 17.5, 21.1) $$Since each input image is a 2D matrix, we need to flatten the image (i.e. "unravel" the 2D matrix into a 1D array)
to turn the data into a feature matrix. This means we lose all spatial information in the image, such as locality and translational invariance (explanation).
More complicated architectures such as Convolutional Neural Networks can take advantage
of such information, and are most commonly applied when analyzing images.
In [1]:
# import necessary packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
#ensure the same random numbers appear every time
np.random.seed(0)
# display images in notebook
%matplotlib inline
plt.rcParams['figure.figsize'] = (10,10)
# download MNIST dataset
digits = datasets.load_digits()
# define inputs and labels
inputs = digits.images
labels = digits.target
print("inputs = (n_inputs, pixel_width, pixel_height) = " + str(inputs.shape))
print("labels = (n_inputs) = " + str(labels.shape))
# flatten the image
# the value -1 means dimension is inferred from the remaining dimensions: 8x8 = 64
n_inputs = len(inputs)
inputs = inputs.reshape(n_inputs, -1)
print("X = (n_inputs, n_features) = " + str(inputs.shape))
# choose some random images to display
indices = np.arange(n_inputs)
random_indices = np.random.choice(indices, size=5)
for i, image in enumerate(digits.images[random_indices]):
plt.subplot(1, 5, i+1)
plt.axis('off')
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title("Label: %d" % digits.target[random_indices[i]])
plt.show()
Performing analysis before partitioning the dataset is a major error, that can lead to incorrect conclusions
(see "Bias-Variance Tradeoff", for example here).
We will reserve $80 \%$ of our dataset for training and $20 \%$ for testing.
It is important that the train and test datasets are drawn randomly from our dataset, to ensure
no bias in the sampling.
Say you are taking measurements of weather data to predict the weather in the coming 5 days.
You don't want to train your model on measurements taken from the hours 00.00 to 12.00, and then test it on data
collected from 12.00 to 24.00.
In [2]:
from sklearn.model_selection import train_test_split
# one-liner from scikit-learn library
train_size = 0.8
test_size = 1 - train_size
X_train, X_test, Y_train, Y_test = train_test_split(inputs, labels, train_size=train_size,
test_size=test_size)
# equivalently in numpy
def train_test_split_numpy(inputs, labels, train_size, test_size):
n_inputs = len(inputs)
inputs_shuffled = inputs.copy()
labels_shuffled = labels.copy()
np.random.shuffle(inputs_shuffled)
np.random.shuffle(labels_shuffled)
train_end = int(n_inputs*train_size)
X_train, X_test = inputs_shuffled[:train_end], inputs_shuffled[train_end:]
Y_train, Y_test = labels_shuffled[:train_end], labels_shuffled[train_end:]
return X_train, X_test, Y_train, Y_test
#X_train, X_test, Y_train, Y_test = train_test_split_numpy(inputs, labels, train_size, test_size)
Our simple feed-forward neural network will consist of an input layer, a single hidden layer and an output layer. The activation $y$ of each neuron is a weighted sum of inputs, passed through an activation function:
$$ z = \sum_{i=1}^n w_i a_i ,$$$$ y = f(z) ,$$where $f$ is the activation function, $a_i$ represents input from neuron $i$ in the preceding layer
and $w_i$ is the weight to neuron $i$.
The activation of the neurons in the input layer is just the features (e.g. a pixel value).
The simplest activation function for a binary classifier (e.g. two classes, 0 or 1, cat or not cat) is the Heaviside function:
$$ f(z) = \begin{cases} 1, & z > 0\\ 0, & \text{otherwise} \end{cases} $$A feed-forward neural network with this activation is known as a perceptron.
This activation can be generalized to $k$ classes (using e.g. the one-against-all strategy),
and we call these architectures multiclass perceptrons.
However, it is now common to use the terms Single Layer Perceptron (SLP) (1 hidden layer) and
Multilayer Perceptron (MLP) (2 or more hidden layers) to refer to feed-forward neural networks with any activation function.
Typical choices for activation functions include the sigmoid function, hyperbolic tangent, and Rectified Linear Unit (ReLU).
We will be using the sigmoid function $\sigma(x)$:
which is inspired by probability theory (see logistic regression) and was most commonly used until about 2011.
Input:
Since each input image has 8x8 = 64 pixels or features, we have an input layer of 64 neurons.
Hidden layer:
We will use 50 neurons in the hidden layer receiving input from the neurons in the input layer.
Since each neuron in the hidden layer is connected to the 64 inputs we have 64x50 = 3200 weights to the hidden layer.
Output:
If we were building a binary classifier, it would be sufficient with a single neuron in the output layer,
which could output 0 or 1 according to the Heaviside function. This would be an example of a hard classifier, meaning it outputs the class of the input directly. However, if we are dealing with noisy data it is often beneficial to use a soft classifier, which outputs the probability of being in class 0 or 1.
For a soft binary classifier, we could use a single neuron and interpret the output as either being the probability of being in class 0 or the probability of being in class 1. Alternatively we could use 2 neurons, and interpret each neuron as the probability of being in each class.
Since we are doing multiclass classification, with 10 categories, it is natural to use 10 neurons in the output layer. We number the neurons $j = 0,1,...,9$. The activation of each output neuron $j$ will be according to the softmax function:
$$ P(\text{class $j$} \mid \text{input $\boldsymbol{a}$}) = \frac{e^{\boldsymbol{a}^T \boldsymbol{w}_j}} {\sum_{k=0}^{9} e^{\boldsymbol{a}^T \boldsymbol{w}_k}} ,$$i.e. each neuron $j$ outputs the probability of being in class $j$ given an input from the hidden layer $\boldsymbol{a}$, with $\boldsymbol{w}_j$ the weights of neuron $j$ to the inputs.
The denominator is a normalization factor to ensure the outputs sum up to 1.
The exponent is just the weighted sum of inputs as before:
Since each neuron in the output layer is connected to the 50 inputs from the hidden layer we have 50x10 = 500 weights to the output layer.
Typically weights are initialized with small values distributed around zero, drawn from a uniform or normal distribution. Setting all weights to zero means all neurons give the same output, making the network useless.
Adding a bias value to the weighted sum of inputs allows the neural network to represent a greater range of values. Without it, any input with the value 0 will be mapped to zero (before being passed through the activation). The bias unit has an output of 1, and a weight to each neuron $j$, $b_j$:
$$ z_j = \sum_{i=1}^n w_ {ij} a_i + 1\cdot b_j = \boldsymbol{a}^T \boldsymbol{w}_j + b_j .$$The bias weights $\boldsymbol{b}$ are often initialized to zero, but a small value like $0.01$ ensures all neurons have some output which can be backpropagated in the first training cycle.
Via Stanford UFLDL
In [3]:
# building our neural network
n_inputs, n_features = X_train.shape
n_hidden_neurons = 50
n_categories = 10
# we make the weights normally distributed using numpy.random.randn
# weights and bias in the hidden layer
hidden_weights = np.random.randn(n_features, n_hidden_neurons)
hidden_bias = np.zeros(n_hidden_neurons) + 0.01
# weights and bias in the output layer
output_weights = np.random.randn(n_hidden_neurons, n_categories)
output_bias = np.zeros(n_categories) + 0.01
For each input image we calculate a weighted sum of input features (pixel values) to each neuron $j$ in the hidden layer:
$$ z_{j}^{hidden} = \sum_{i=1}^{n_{features}} w_{ij}^{hidden} x_i + b_{j}^{hidden} = \boldsymbol{x}^T \boldsymbol{w}_{j}^{hidden} + b_{j}^{hidden} ,$$this is then passed through our activation function
$$ a_{j}^{hidden} = f(z_{j}^{hidden}) .$$We calculate a weighted sum of inputs (activations in the hidden layer) to each neuron $j$ in the output layer:
$$ z_{j}^{output} = \sum_{i=1}^{n_{hidden}} w_{ij}^{output} a_{i}^{hidden} + b_{j}^{output} = (\boldsymbol{a}^{hidden})^T \boldsymbol{w}_{j}^{output} + b_{j}^{output} .$$Finally we calculate the output of neuron $j$ in the output layer using the softmax function:
$$ a_{j}^{output} = \frac{\exp{(z_j^{output})}} {\sum_{k=1}^{n_{categories}} \exp{(z_k^{output})}} .$$Since our data has the dimensions $X = (n_{inputs}, n_{features})$ and our weights to the hidden
layer have the dimensions
$W_{hidden} = (n_{features}, n_{hidden})$,
we can easily feed the network all our training data in one go by taking the matrix product
and obtain a matrix that holds the weighted sum of inputs to the hidden layer
for each input image.
We also add the bias to obtain a matrix of weighted sums $Z^{hidden}$:
meaning the same bias (1D array) is added to each input image.
This is then passed through the activation
This is fed to the output layer:
$$ Z^{output} = A^{hidden} W^{output} + B^{output} .$$Finally we receive our output values for each image and each category by passing it through the softmax function:
$$ output = softmax (Z^{output}) = (n_{inputs}, n_{categories}) .$$
In [4]:
# setup the feed-forward pass
def sigmoid(x):
return 1/(1 + np.exp(-x))
def feed_forward(X):
# weighted sum of inputs to the hidden layer
z1 = np.matmul(X, hidden_weights) + hidden_bias
# activation in the hidden layer
a1 = sigmoid(z1)
# weighted sum of inputs to the output layer
z2 = np.matmul(a1, output_weights) + output_bias
# softmax output
# axis 0 holds each input and axis 1 the probabilities of each category
exp_term = np.exp(z2)
probabilities = exp_term / np.sum(exp_term, axis=1, keepdims=True)
return probabilities
probabilities = feed_forward(X_train)
print("probabilities = (n_inputs, n_categories) = " + str(probabilities.shape))
print("probability that image 0 is in category 0,1,2,...,9 = \n" + str(probabilities[0]))
print("probabilities sum up to: " + str(probabilities[0].sum()))
print()
# we obtain a prediction by taking the class with the highest likelihood
def predict(X):
probabilities = feed_forward(X)
return np.argmax(probabilities, axis=1)
predictions = predict(X_train)
print("predictions = (n_inputs) = " + str(predictions.shape))
print("prediction for image 0: " + str(predictions[0]))
print("correct label for image 0: " + str(Y_train[0]))
To measure how well our neural network is doing we need to introduce a cost function.
We will call the function that gives the error of a single sample output the loss function, and the function
that gives the total error of our network across all samples the cost function.
A typical choice for multiclass classification is the cross-entropy loss, also known as the negative log likelihood.
In multiclass classification it is common to treat each integer label as a so called one-hot vector:
i.e. a binary bit string of length $K$, where $K = 10$ is the number of classes.
If $\boldsymbol{x}_i$ is the $i$-th input (image), $y_{ik}$ refers to the $k$-th component of the $i$-th
output vector $\boldsymbol{y}_i$. The probability of $\boldsymbol{x}_i$ being in class $k$ is given by the softmax function:
where $\boldsymbol{a}_i^{hidden}$ is the activation in the hidden layer from input $\boldsymbol{x}_i$.
The vector $\boldsymbol{\theta}$ represents the weights and biases of our network.
The probability of not being in class $k$ is just $1 - P(y_{ik} = 1 \mid \boldsymbol{x}_i)$.
For Maximum Likelihood Estimation (MLE) we choose the label with the largest probability.
Denote the output label $\hat{y}$ and the correct label $y$, for example $\hat{y} = 5$ and $y = 8$. The likelihood that input $\boldsymbol{x}$
gives an output $\hat{y} = k'$ is then
where $y_k$ is the $k$-th component of the one-hot vector of (correct) labels.
A perfect classifier should give a $100 \%$ probability of the correct label, so the product
should just be 1 if $y = k$ and 0 otherwise. If the network is not a perfect classifier, the likelihood should be a number between 0 and 1.
If we take the log of this we can turn the product into a sum, which is often simpler to compute:
$$ \log P(\hat{y} = k' \mid \boldsymbol{x}, \boldsymbol{\theta}) = \sum_{k=0}^{K-1} y_{k} \log P(y_{k} = 1 \mid \boldsymbol{x}, \boldsymbol{\theta}) + (1-y_{k})\log (1 - P(y_{k} = 1 \mid \boldsymbol{x}, \boldsymbol{\theta}))$$For a perfect classifier this should just be $\log 1 = 0$. Otherwise we get a negative number.
Since it is easier to think in terms of minimizing a positive number, we take our loss function
to be the negative log-likelihood:
We then take the average of the loss function over all input samples to define the cost function:
$$ \begin{split} \mathcal{C}(\boldsymbol{\theta}) &= \frac{1}{N} \sum{i=1}^N \mathcal{L}(\boldsymbol{w}) \
&= -\frac{1}{N}\sum{i=1}^N \sum{k=0}^{K-1} y{k} \log P(y_{k} = 1 \mid \boldsymbol{x}, \boldsymbol{\theta})
The network is trained by finding the weights and biases that minimize the cost function. One of the most widely used classes of methods is gradient descent and its generalizations. The idea behind gradient descent
is simply to adjust the weights in the direction where the gradient of the cost function is large and negative. This ensures we flow toward a local minimum of the cost function.
Each parameter $\theta$ is iteratively adjusted according to the rule
where $\eta$ is known as the learning rate, which controls how big a step we take towards the minimum.
This update can be repeated for any number of iterations, or until we are satisfied with the result.
A simple and effective improvement is a variant called Stochastic Gradient Descent (SGD).
Instead of calculating the gradient on the whole dataset, we calculate an approximation of the gradient
on a subset of the data called a minibatch.
If there are $N$ data points and we have a minibatch size of $M$, the total number of batches
is $n/M$.
We denote each minibatch $B_k$, with $k = 1, 2,...,n/M$. The gradient then becomes:
i.e. instead of averaging the loss over the entire dataset, we average over a minibatch.
This has two important benefits:
1) Introducing stochasticity decreases the chance that the algorithm becomes stuck in a local minima.
2) It significantly speeds up the calculation, since we do not have to use the entire dataset to calculate the gradient.
It is common to add an extra term to the cost function, proportional to the size of the weights.
This is equivalent to constraining the size of the weights, so that they do not grow out of control.
Constraining the size of the weights means that the weights cannot grow arbitrarily large to fit the training data, and in this way reduces overfitting.
We will measure the size of the weights using the so called L2-norm, meaning our cost function becomes:
$$ \nabla \mathcal{C}(\theta) = \frac{1}{N} \sum_{i=1}^N \nabla \mathcal{L}(\theta) \quad \rightarrow \quad \frac{1}{N} \sum_{i=1}^N \nabla \mathcal{L}(\theta) + \lambda \lvert \lvert \boldsymbol{w}_2^2 \rvert \rvert = \frac{1}{N} \sum_{i=1}^N \nabla \mathcal{L}(\theta) + \lambda \sum_{ij} w_{ij}^2,$$i.e. we sum up all the weights squared. The factor $\lambda$ is known as a regularization parameter.
In order to train the model, we need to calculate the derivative of the cost function with respect
to every bias and weight in the network. Using an approximation to the derivative (e.g. using the finite difference method) is much too costly.
In total our network has $(64 + 1) \times 50 = 3250$ weights in the hidden layer and $(50 + 1) \times 10 = 510$ weights to the output layer ($ + 1$ for the bias), and the gradient must be calculated for every parameter.
The backpropagation algorithm is a clever use of the chain rule that allows us to calculate gradient efficently. Here we will simply state the backpropagation equations that we will use for our network, and then a derivation is given at the end of this tutorial.
The error $\delta_i^o$ at each output neuron $i$ is just the difference between the output probability $\hat{y}_i$ and the correct label $y_i$ (0 or 1 using one-hot vectors):
$$ \delta_i^o = \hat{y}_i - y_i .$$The gradient of the cost function with respect to each output weight $w_{i,j}^o$ is then
$$ \frac{\partial \mathcal{C}}{\partial w_{i,j}^o} = \delta_i^o a_j^h ,$$where $a_j^h$ is the activation at the $j$-th neuron in the hidden layer.
The gradient with respect to each output bias $b_i^o$ is
The error at each hidden layer neuron $\delta_i^h$ is given as
$$ \delta_i^h = \sum_{k=0}^{K-1} \delta_k^o w_{ki}^o f'(z_i^h) ,$$where $K$ is the number of output neurons or categories and $f'(z_i^h)$ is the derivative of the activation function:
$$ f'(z_i^h) = \sigma '(z_i^h) = \sigma(z_i^h)(1 - \sigma(z_i^h)) = a_i^h (1 - a_i^h) ,$$since our activation function is the sigmoid/logistic function.
The gradient with respect to each hidden layer weight is:
and the gradient with respect to the hidden bias
$$ \frac{\partial \mathcal{C}}{\partial b_i^h} = \delta_i^h .$$The regularization terms using the L2-norm are just
$$ \frac{\partial }{\partial w_{ij}} (\lambda \sum_{ij} w_{ij}^2) = 2 \lambda w_{ij} = \hat{\lambda} w_{ij} ,$$for the weights in both the output and hidden layers.
Text.
In [5]:
# to categorical turns our integer vector into a onehot representation
from keras.utils import to_categorical
# calculate the accuracy score of our model
from sklearn.metrics import accuracy_score
Y_train_onehot, Y_test_onehot = to_categorical(Y_train), to_categorical(Y_test)
# equivalently in numpy
def to_categorical_numpy(integer_vector):
n_inputs = len(integer_vector)
n_categories = np.max(integer_vector) + 1
onehot_vector = np.zeros((n_inputs, n_categories))
onehot_vector[range(n_inputs), integer_vector] = 1
return onehot_vector
#Y_train_onehot, Y_test_onehot = to_categorical_numpy(Y_train), to_categorical_numpy(Y_test)
def feed_forward_train(X):
# weighted sum of inputs to the hidden layer
z1 = np.matmul(X, hidden_weights) + hidden_bias
# activation in the hidden layer
a1 = sigmoid(z1)
# weighted sum of inputs to the output layer
z2 = np.matmul(a1, output_weights) + output_bias
# softmax output
# axis 0 holds each input and axis 1 the probabilities of each category
exp_term = np.exp(z2)
probabilities = exp_term / np.sum(exp_term, axis=1, keepdims=True)
return a1, probabilities
def backpropagation(X, Y):
a1, probabilities = feed_forward_train(X)
# error in the output layer
error_output = probabilities - Y
# error in the hidden layer
error_hidden = np.matmul(error_output, output_weights.T) * a1 * (1 - a1)
# gradients for the output layer
output_weights_gradient = np.matmul(a1.T, error_output)
output_bias_gradient = np.sum(error_output, axis=0)
# gradient for the hidden layer
hidden_weights_gradient = np.matmul(X.T, error_hidden)
hidden_bias_gradient = np.sum(error_hidden, axis=0)
return output_weights_gradient, output_bias_gradient, hidden_weights_gradient, hidden_bias_gradient
print("Old accuracy on training data: " + str(accuracy_score(predict(X_train), Y_train)))
eta = 0.01
lmbd = 0.01
for i in range(1000):
dWo, dBo, dWh, dBh = backpropagation(X_train, Y_train_onehot)
dWo += lmbd * output_weights
dWh += lmbd * hidden_weights
output_weights -= eta * dWo
output_bias -= eta * dBo
hidden_weights -= eta * dWh
hidden_bias -= eta * dBh
print("New accuracy on training data: " + str(accuracy_score(predict(X_train), Y_train)))
In [6]:
class NeuralNetwork:
def __init__(
self,
X_data,
Y_data,
n_hidden_neurons=50,
n_categories=10,
epochs=10,
batch_size=100,
eta=0.1,
lmbd=0.0,
):
self.X_data_full = X_data
self.Y_data_full = Y_data
self.n_inputs = X_data.shape[0]
self.n_features = X_data.shape[1]
self.n_hidden_neurons = n_hidden_neurons
self.n_categories = n_categories
self.epochs = epochs
self.batch_size = batch_size
self.iterations = self.n_inputs // self.batch_size
self.eta = eta
self.lmbd = lmbd
self.create_biases_and_weights()
def create_biases_and_weights(self):
self.hidden_weights = np.random.randn(self.n_features, self.n_hidden_neurons)
self.hidden_bias = np.zeros(self.n_hidden_neurons) + 0.01
self.output_weights = np.random.randn(self.n_hidden_neurons, self.n_categories)
self.output_bias = np.zeros(self.n_categories) + 0.01
def feed_forward(self):
self.z1 = np.matmul(self.X_data, self.hidden_weights) + self.hidden_bias
self.a1 = sigmoid(self.z1)
self.z2 = np.matmul(self.a1, self.output_weights) + self.output_bias
exp_term = np.exp(self.z2)
self.probabilities = exp_term / np.sum(exp_term, axis=1, keepdims=True)
def feed_forward_out(self, X):
z1 = np.matmul(X, self.hidden_weights) + self.hidden_bias
a1 = sigmoid(z1)
z2 = np.matmul(a1, self.output_weights) + self.output_bias
exp_term = np.exp(z2)
probabilities = exp_term / np.sum(exp_term, axis=1, keepdims=True)
return probabilities
def backpropagation(self):
error_output = self.probabilities - self.Y_data
error_hidden = np.matmul(error_output, self.output_weights.T) * self.a1 * (1 - self.a1)
self.output_weights_gradient = np.matmul(self.a1.T, error_output)
self.output_bias_gradient = np.sum(error_output, axis=0)
self.hidden_weights_gradient = np.matmul(self.X_data.T, error_hidden)
self.hidden_bias_gradient = np.sum(error_hidden, axis=0)
if self.lmbd > 0.0:
self.output_weights_gradient += self.lmbd * self.output_weights
self.hidden_weights_gradient += self.lmbd * self.hidden_weights
self.output_weights -= self.eta * self.output_weights_gradient
self.output_bias -= self.eta * self.output_bias_gradient
self.hidden_weights -= self.eta * self.hidden_weights_gradient
self.hidden_bias -= self.eta * self.hidden_bias_gradient
def predict(self, X):
probabilities = self.feed_forward_out(X)
return np.argmax(probabilities, axis=1)
def predict_probabilities(self, X):
probabilities = self.feed_forward_out(X)
return probabilities
def train(self):
data_indices = np.arange(self.n_inputs)
for i in range(self.epochs):
for j in range(self.iterations):
chosen_datapoints = np.random.choice(
data_indices, size=self.batch_size, replace=False
)
self.X_data = self.X_data_full[chosen_datapoints]
self.Y_data = self.Y_data_full[chosen_datapoints]
self.feed_forward()
self.backpropagation()
In [7]:
eta = 0.1
lmbd = 0.1
epochs = 10
batch_size = 100
dnn = NeuralNetwork(X_train, Y_train_onehot, eta=eta, lmbd=lmbd, epochs=epochs, batch_size=batch_size,
n_hidden_neurons=n_hidden_neurons, n_categories=n_categories)
dnn.train()
test_predict = dnn.predict(X_test)
print("Accuracy score on test set: ", accuracy_score(Y_test, test_predict))
In [8]:
eta_vals = np.logspace(-5, 0, 6)
lmbd_vals = np.logspace(-5, 0, 6)
In [9]:
DNN_numpy = np.zeros((len(eta_vals), len(lmbd_vals)), dtype=object)
for i, eta in enumerate(eta_vals):
for j, lmbd in enumerate(lmbd_vals):
dnn = NeuralNetwork(X_train, Y_train_onehot, eta=eta, lmbd=lmbd, epochs=epochs, batch_size=batch_size,
n_hidden_neurons=n_hidden_neurons, n_categories=n_categories)
dnn.train()
DNN_numpy[i][j] = dnn
test_predict = dnn.predict(X_test)
print("Learning rate = ", eta)
print("Lambda = ", lmbd)
print("Accuracy score on test set: ", accuracy_score(Y_test, test_predict))
print()
In [10]:
from sklearn.neural_network import MLPClassifier
DNN_scikit = np.zeros((len(eta_vals), len(lmbd_vals)), dtype=object)
for i, eta in enumerate(eta_vals):
for j, lmbd in enumerate(lmbd_vals):
dnn = MLPClassifier(hidden_layer_sizes=(n_hidden_neurons), activation='logistic',
alpha=lmbd, learning_rate_init=eta, max_iter=100)
dnn.fit(X_train, Y_train)
DNN_scikit[i][j] = dnn
print("Learning rate = ", eta)
print("Lambda = ", lmbd)
print("Accuracy score on test set: ", dnn.score(X_test, Y_test))
print()