In [1]:
# numpy is the main library used by the book
import numpy as np 

# cause graphs
%matplotlib inline
import matplotlib.pyplot as plt

This jupyter notebook contains my notes while reading Grokking Deep Learning by Andrew Trask. The book is sold as "a very gentle introduction to Deep Learning" and covers the intuition more than the theory.

Chapter 1: Introducing Deep Learning

Deep Learning (DL) is is an intersection of Machine Learning (ML) & Artificial Intelligence (AI). This book covers the science under the hood of the major DL frameworks so you can understand whats going on when you use popular DL frameworks like Torch, Tensorflow, Keras, etc.

The book covers everything past high school maths needed to grok DL.

To really grok this, find a personal problem I'm interested in which to apply DL to. This could be anything where there is a dataset to predict another. Trask (the author) used Twiter to predict the stock market, which led him from barely knowing programming to a job at a hedge fund in 18 months.

Some projects:

  • PAQI is monitoring air pollution in Pakistan. Grab a bunch of data and see if a NN can predict air pollution. Data sources: weather, factory output, what else is relevant to pollution? Target: Air pollution numbers for all the cities with pollution data.

Chapter 2: Fundamental Concepts

DL uses a subset of ML methods, primarily Artificial Neural Networks. ML is a subset of CS where machines learn tasks for which they weren't explicitly progeammed for.

Supervised vs Unsupervised ml

Two main types of ML:

  • Direct imitation, or formally supervised ml is basically a computer looking at a dataset A which predicts B, say weather sensor data predicting probability of rain, and trying to figure out the pattern b/w an input (sensor data) and a output set (actual weather), so when given a new input set it can apply the earlier learned pattern and come up with a prediction.
  • Indirect imitation, or fomally unsupervised ml looks at a not previously understoond dataset A and tries to find patterns in it. For example, it sorts data into a bunch of clusters. Clustering is the essensce of unsupervised ml. The computer doesn't know what the clusters mean but thats where the human comes in.

Parametric vs Non-Parametric Learning

A parametric model has a fixed number of parameters to change, while a non-parametric model has infinite parameters.

Supervised parametric dl models take in input data, process them based on a fixed number of adjustable parameters and makes a prediction. The ml model learns the optimium parameters by comparing its predictions to the actual truth, then going back and tinkering with the parameters.

Unsupervised parametric dl models are similar to the supervised since they also use parameters, but they cluster the data into groups and come up with as many parameters as needed.

DL algos can be either supervised or unsupervised, and either parametric or non-parametric.

Chp 3: Introduction to Neural Prediction

When using data to predict something, you need as many datapoints as you think the neural net needs to be accurate. For example, when trying to predict if something is in an image, you probabbly need to feed the neural net the entire image.

Always present enough information to the network, where "enough information" is definned loosely as how much a human might need to make the same prediction.

simplest possible neural net


In [2]:
# the network
weight = 0.1
def neural_network(input, weight):
    prediction = input * weight
    return prediction

# using the network to predict something
number_of_toes = [8.5, 9.5, 10, 9]
input = number_of_toes[0]
pred = neural_network(input,weight)
pred


Out[2]:
0.8500000000000001

this is super simple - the input is being multiplied by a weight and returned. The power of NN's lies in the weights and how we update them.

Now the above NN just takes in one input, but practically every thing has multiple inputs.


In [16]:
weights = [0.1, 0.2, 0]
def neural_network(input, weights):
    pred = w_sum(input, weights)
    return pred

toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]

def w_sum(a, b):
    """takes in 2 vectors of the same length,
    multiples ith item of each list with each other
    and adds them all up
    OR - we pair up each number in the two vectors, multiply
    every pair, then add up the results"""
    assert(len(a) == len(b))
    return sum([a[i]*b[i] for i in range(len(a))])

for input in zip(toes, wlrec, nfans):
    print(neural_network(input, weights))


0.9800000000000001
1.11
1.1500000000000001
1.08

Before we move on, being able to manipulate vectors is a cornerstone technique for Deep Learning. Some functions to do vector math:


In [30]:
def ele_mul(vec_a, vec_b):
    assert(len(vec_a) == len(vec_b))
    return [vec_a[i] * vec_b[i] for i in range(len(vec_a))]

def vector_sum(vec_a):
    return sum(vec_a)

def ele_add(vec_a, vec_b):
    assert(len(vec_a) == len(vec_b))
    return [vec_a[i] + vec_b[i] for i in range(len(vec_a))]
    
def vector_average(vec_a):
    return sum(vec_a) / len(vec_a)

a = [2,2,4]
b = [3,3,9]

# to get the dot product of a and b
vector_sum(ele_mul(a,b))


Out[30]:
48

In [22]:
a = [1, 3, -5]
b = [4, -2, -1]
print('my func:', vector_sum(elementwise_multiplication(a,b)))
print('numpy dot func:', np.dot(a,b))


my func: 3
numpy dot func: 3

So we've just made a function to do a dot product.

Weights, weights and more weights

Looking at the 3 input NN above, the most imp thing is the weights: weights = [0.1, 0.2, 0] Crudely speaking, the weights tell us the importance of each of the 3 inputs. We can see that the third weight is zero, thus all the third inputs are essentially meningaless, since the dot product of input,weights will always make the third input zero.

The position of weights is important - as NN's get more complex be careful the weights don't get moved around.

numpy version of the above NN

Numpy has fast implementations of all the vector and matrix math we'll need to do. So the below is a rewrite of the NN from above using numpy:


In [24]:
weights = np.array([0.1, 0.2, 0])
def neural_network(input, weights):
    pred = input.dot(weights)
    return pred

toes = np.array([8.5, 9.5, 9.9, 9.0])
wlrec = np.array([0.65, 0.8, 0.8, 0.9])
nfans = np.array([1.2, 1.3, 0.5, 1.0])

# input corresponds to every entry for the  first game of the season

# to go through all the inputs
for input in zip(toes,wlrec,nfans):
    print(neural_network(np.array(input), weights))


0.98
1.11
1.15
1.08

Weights are a superpower. We can make many predictions from just a single input:


In [43]:
weights = [0.3, 0.2, 0.9]
def neural_network(input, weights):
    pred = input * np.array(weights)
    return pred

wlrec = [0.65, 0.9, 0.8, 0.8]

for input in wlrec:
    print(neural_network(input, weights))


[ 0.195  0.13   0.585]
[ 0.27  0.18  0.81]
[ 0.24  0.16  0.72]
[ 0.24  0.16  0.72]

Note that the 3 predictions are completely seperate.

neural net with 3 inputs and 3 outputs

Now finally we get a big more complex.


In [45]:
#toes #win #fans
weights = [[0.1, 0.1, -0.3],
           [0.1, 0.2, 0.0],
           [0.0, 1.3, 0.1]]

def neural_network(input, weights):
    pred = vect_mat_mul(input,weights)
    return pred

def vect_mat_mul(vect,matrix):
    out = []
    for m in matrix:
        out.append(np.dot(vect,m))
    return out
        
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]

for input in zip(toes,wlrec,nfans):
    print(neural_network(input, weights))


[0.55500000000000005, 0.98000000000000009, 0.96500000000000008]
[0.64000000000000001, 1.1100000000000001, 1.1699999999999999]
[0.92000000000000004, 1.1500000000000001, 1.0900000000000001]
[0.68999999999999995, 1.0800000000000001, 1.2700000000000002]

In the above, each output has its own set of weights which it uses to take the dot product of with the input, calculating an output.

Each output node takes its own weighted sum of the input and makes a prediction.

The weights used here is a list of lists of weights. This is a matrix.

We can take the output of a neural net and use it as the input for o another net. This just means doing 2 back to back vector matrix multiplications. This is useful when the input data has patterns too conplex for a single weight matrix.

a stacked neural network

3 inputs and 3 outputs, with 2 layers


In [49]:
w_hid = [[0.1,0.2,-0.1],
         [-0.1,0.1,0.9],
         [0.1,0.4,0.1]]
w_out = [[0.3,1.1,-0.3],
         [-0.1,0.2,0.0],
         [0.0,1.3,0.1]]
weights = [w_hid, w_out]

def neural_net(input, weights):
    hid = np.dot(input, weights[0])
    # now we feed the output of the first layer into the next
    pred = np.dot(hid, weights[1])
    return pred

toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65,0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]

input = list(zip(toes, wlrec, nfans))[0]
neural_net(input,weights)


Out[49]:
array([ 0.047,  1.256, -0.286])

numpy

numpy automatically does the common math ops to numbers, vectors and matrixes. It can be a bit confusing, so try to keep the shapes of the inputs in yout head.


In [57]:
a = np.array([0,1,2,3])
b = np.array([4,5,6,7])
c = np.array([[0,1,2,3],
              [4,5,6,7]])
d = np.zeros((2,4))#(2x4 matrix of zeros)
e = np.random.rand(2,5) # random 2x5

a* c


Out[57]:
array([[ 0,  1,  4,  9],
       [ 0,  5, 12, 21]])

There is a lot more about numpy - see TK

to sum up chp 3, neural networks perform weighted sums of the input.

Chapter 4 - Gradient Descent

Above, we made simple NN's which used weights to make predictions. We need to set the weight values better so the NN makes better predictions. We do this by:

  • predict, then compare versus the target and see how much it 'missed' by. Then update the weights so the network predicts better next time.

First, we need to measure error. A popular technique is the mean squared error: (prediction - actual) ** 2. This amplifies big errors and makes small errors smaller. Which kinda makes sense as we want to make the NN pay more attention to big errors. We square the errors as we only care how much we missed vs negative/positive, and we avoid getting negative/postivie errors cancelling each other out.

The simplest possible way of learning weights:


In [93]:
weight = 0.5
input = 0.5
target = 0.8
step_amount = 0.001

for iteration in range(1101):
    pred = input * weight
    err = (pred - target) ** 2
    #print(f"error {err} Preds {pred}")
    
    up_pred = input * (weight + step_amount)
    up_error = (up_pred - target) ** 2
    
    down_pred = input * (weight - step_amount)
    down_error = (down_pred - target) ** 2
    
    if up_error < down_error:
        weight += step_amount
    else:
        weight -= step_amount

print(f'final prediction: {input * weight} | weight: {weight}')


final prediction: 0.8004999999999671 | weight: 1.6009999999999343

Now this way is inefficient and it can be impossible to predict the goal, if say we had a large step value.

Another way is to calculate both the direction and amount to update the weigths from the error.


In [178]:
weight = 0.5
input = 0.5
target = 0.8
alpha = .01

predictions, errors = [], []
for i in range(1500):
    pred = input * weight
    error = (pred - target) ** 2
    
    # delta measures by how much the target was missed
    delta = pred - target
    weight_delta = delta * input
    weight = weight - weight_delta * alpha
    predictions.append(pred)
    errors.append(error)
    
    #if i % 220 == 0:
    #    print(f'Step {i:2d} Error {error:f} Prediction {pred:f}')

plt.xlabel('iterations')
plt.grid()
plt.plot(errors, label="Errors")
plt.plot(predictions, label="Predictions")
plt.legend();


a neural net with descent


In [59]:
inputs = np.array([0.25,1.0,0.1])
target = 70 # my weight
weights = np.array([1,3,1])
learning_rate = 0.1

for epoch in range(40):
    pred = np.dot(inputs, weights)
    error = (pred - target)**2
    delta_error = input * (pred - target)
    weights = weights - (delta_error * learning_rate)
print('pred:', pred)
print('weights:', weights)


pred: 70.0000007766
weights: [ 103.47539901   43.9901596     1.4099016 ]

Chapter 5: Learning Multiple Weights at a Time

Gradient Descent with Multiple Inputs & Outputs


In [612]:
# step 1: empty network
weights = [ [0.1, 0.1, -0.3], #hurt
           [0.1, 0.2, 0.0], #win
           [0.0, 1.3, 0.1] ]#sad?

def neural_network(input, weights):
    pred = np.dot(input, weights)
    return pred

#Predict
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [0.65,0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]

hurt = [0.1, 0.0, 0.0, 0.1]
win =[1, 1, 0, 1]
sad = [0.1, 0.0, 0.1, 0.2]

alpha = 0.01
input = [toes[0],wlrec[0],nfans[0]]
true = [hurt[0], win[0], sad[0]]

pred = neural_network(input, weights)

error = [0,0,0]
delta = [0,0,0]

for i in range(len(true)):
    error[i] = (pred[i] - true[i]) ** 2
    delta[i] = pred[i] - true[i]

def outer_prod(vec_a, vec_b):
    out = np.zeros((len(vec_a), len(vec_b)))
        
    for i in range(len(vec_a)):
        for j in range(len(vec_b)):
            out[i][j] = vec_a[i]*vec_b[j]
    return out

#updating the weights
weight_deltas = outer_prod(input,delta) * alpha
weights = np.array(weights) - weight_deltas

neural_network([toes[1],wlrec[1],nfans[1]], weights), [hurt[1], win[1], sad[1]]


Out[612]:
(array([ 0.3549355,  1.524418 , -0.624401 ]), [0.0, 1, 0.0])

the below is a shitfight

The digits dataset contains 1,797 8x8 images.


In [423]:
# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics

# The digits dataset
digits = datasets.load_digits()
print(digits.data.shape)
digits.data # the actual data


(1797, 64)
Out[423]:
array([[  0.,   0.,   5., ...,   0.,   0.,   0.],
       [  0.,   0.,   0., ...,  10.,   0.,   0.],
       [  0.,   0.,   0., ...,  16.,   9.,   0.],
       ..., 
       [  0.,   0.,   1., ...,   6.,   0.,   0.],
       [  0.,   0.,   2., ...,  12.,   0.,   0.],
       [  0.,   0.,  10., ...,  12.,   1.,   0.]])

The dataset contains a visual representation of the image data:


In [428]:
plt.matshow(digits.images[0])


Out[428]:
<matplotlib.image.AxesImage at 0x11277d908>

digits.target contains what each datapoint represents:


In [427]:
print(digits.target.shape)
digits.target


(1797,)
Out[427]:
array([0, 1, 2, ..., 8, 9, 8])

So the image above, which looks like a zero, so digits.data[0] should be equal to digits.target[0], is equal to 0.

So now to build out a nerual net to classify this.


In [474]:
inputs = digits.data
targets = digits.target
#one weight per input pixel
hidden_weights = np.random.random([64,10])
output_weights = np.random.random(10)
learning_rate = 0.01
epochs = 10
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

So the dot product of each input vector and the weights gives us 10 outputs:


In [473]:
hidden_input = np.dot(inputs[0], hidden_weights)
hidden_output = sigmoid(hidden_input)
hidden_output.shape


Out[473]:
(10,)

And the dot product of the hi


In [471]:
np.dot(a, output_weights)


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-471-4176378f2761> in <module>()
----> 1 np.dot(a, output_weights)

ValueError: shapes (64,) and (10,) not aligned: 64 (dim 0) != 10 (dim 0)

Now, the output numbers seem pretty high, so we might want to normalize the incoming data


In [422]:
#inputs = np.random.random([len(digits.data),64]) # a blank input
#for i, item in enumerate(digits.data):
#    inputs[i] = ((item - digits.data.mean()) / digits.data.std())

In [386]:
print(inputs.shape)
inputs


(1797, 64)
Out[386]:
array([[-0.8117562 , -0.8117562 ,  0.01925204, ..., -0.8117562 ,
        -0.8117562 , -0.8117562 ],
       [-0.8117562 , -0.8117562 , -0.8117562 , ...,  0.85026027,
        -0.8117562 , -0.8117562 ],
       [-0.8117562 , -0.8117562 , -0.8117562 , ...,  1.84747015,
         0.68405863, -0.8117562 ],
       ..., 
       [-0.8117562 , -0.8117562 , -0.64555455, ...,  0.18545368,
        -0.8117562 , -0.8117562 ],
       [-0.8117562 , -0.8117562 , -0.4793529 , ...,  1.18266357,
        -0.8117562 , -0.8117562 ],
       [-0.8117562 , -0.8117562 ,  0.85026027, ...,  1.18266357,
        -0.64555455, -0.8117562 ]])

Now calculating the hidden layer


In [387]:
hidden_inputs = np.dot(inputs, hidden_weights)
print(hidden_inputs.shape)
hidden_inputs


(1797, 10)
Out[387]:
array([[ -3.06830336,  -4.76420192,  -4.85003619, ...,  -2.76972724,
          2.21539197,   0.93605365],
       [ -3.52793678,   0.08871104,  -3.77219696, ...,  -3.08459096,
          2.10678359,   0.32016976],
       [  1.51288988,   2.80620158,  -2.84952504, ...,  -0.77463449,
          6.00694037,   4.89467518],
       ..., 
       [  0.80117623,   3.1278505 ,  -1.25573866, ...,   3.34908273,
          8.75502707,   7.78887479],
       [ -1.41450206,   0.37686184,  -0.51405872, ...,   0.62209819,
          6.90003339,   4.66896045],
       [  2.31815972,   5.54513861,   1.42283582, ...,   4.83149732,
         10.66877677,   9.76844909]])

Using the sigmoid function to calculate the output of the hidden layer:


In [388]:
hidden_outputs = sigmoid(hidden_inputs)
print(hidden_outputs.shape)
hidden_outputs


(1797, 10)
Out[388]:
array([[ 0.04443381,  0.00845755,  0.00776729, ...,  0.05898215,
         0.90162322,  0.71830182],
       [ 0.02852771,  0.52216323,  0.0224843 , ...,  0.04374736,
         0.89156076,  0.57936562],
       [ 0.81948909,  0.94301003,  0.05470587, ...,  0.31547743,
         0.99754444,  0.99256929],
       ..., 
       [ 0.69022603,  0.95802704,  0.22170833, ...,  0.96607479,
         0.99984236,  0.99958585],
       [ 0.19552493,  0.593116  ,  0.37424255, ...,  0.6506956 ,
         0.99899326,  0.99070519],
       [ 0.91036989,  0.9961088 ,  0.8057826 , ...,  0.99208852,
         0.99997674,  0.99994277]])

The hidden layer gives 10 outputs for each of the 1797 data points.

Now to predict the output neural layer:


In [446]:
output_pred = np.dot(hidden_outputs, output_weights) 
final_outputs = output_pred.round(0)
print(final_outputs.shape)
final_outputs


(1797,)
Out[446]:
array([ 1.,  2.,  3., ...,  4.,  3.,  5.])

What are the output errors?


In [451]:
output_errors = (final_outputs - digits.target)**2
print(output_errors.shape)
output_errors


(1797,)
Out[451]:
array([  1.,   1.,   1., ...,  16.,  36.,   9.])

In [454]:
output_delta = final_outputs - digits.target
output_delta[:10]


Out[454]:
array([ 1.,  1.,  1., -3., -4., -2., -4., -6., -4., -7.])

Now how to adust the weights now that we have the output errors?

Now, to multiply each of the 1797 final errors with the output weights. So we need to iterate through each of the data arrays and corresponding targets:


In [560]:
#weights
ih_weight = np.random.random([64,10])
ho_weight = np.random.random(10)
learning_rate = 0.01
epochs = 20

for epoch in range(epochs):
    
    for input, target in zip(digits.data, digits.target):
        hidden_layer = np.dot(input, ih_weight)
        hidden_outputs = sigmoid(hidden_inputs)

        output_pred = np.dot(hidden_outputs, ho_weight) 
        output = output_pred.round(0)

        # so now we have a prediction. time to back propagate!

        #starting with the output errors:
        error = output - target

        output_error = error * output * (1 - output)

        output_weight_delta = np.dot(ho_weight, output_delta) * learning_rate

        ho_weight -= output_weight_delta * learning_rate
    
def neural_net(input)

In [561]:
output_weight_delta


Out[561]:
array([ -528.315657  ,  -498.88232425,  -965.34318155, -1118.71371346,
        -913.29027258,  -461.29240453,  -609.00620295, -1384.59160212,
        -428.01193233,  -584.50716073])

In [562]:
ho_weight


Out[562]:
array([ 17615.80505652,  16634.39963162,  32187.75948365,  37301.64425246,
        30452.14198882,  15381.02640839,  20306.2968269 ,  46166.89932013,
        14271.3445303 ,  19489.41709594])

In [563]:
ih_weight_delta.shape


Out[563]:
(64,)

In [ ]: