Welcome to Practical Deep Learning BootCamp!

Practical, hands-on and realtime - For beginners and the non-scientists

Rahul Atlury

Fabian David Tschopp

MODULE 1

Introduction to python

  1. Data
  2. Storing of Data
  3. Types of Data
  4. Collections of Data
  5. Accessing Data
  6. Decisions on Data
  7. Error handling and Functions
  8. Classes

Foundations of mathematics

What is differentiation?

Our study of deep learning begins with understanding differentiation. How the various quantities in the world change with respect to one another.

Differentiation is all about finding rates of change of one quantity compared to another.

One important application of differentiation is in the area of optimisation, which means finding the condition for a maximum (or minimum) to occur. This is important in business (cost reduction, profit increase) and engineering (maximum strength, minimum cost/loss.)

Rate of change

In geometry slope represents rate of change or the steepness of a line.

Mathematically it answers the question: how much does one quantity $y$ or $f(x)$ change given a specific change in another $x$.

Applications include:

  • Temperature change at a particular time
  • Velocity of a falling object at a particular time
  • Current through a circuit at a particular time
  • Variation in stockmarket prices at a particular time
  • Population growth at a particular time
  • Temperature increase as density increases in a gas

Slope

We can write:

change in $y$ as $\nabla y$ and change in $x$ as $\nabla x$

By definition, the slope is given by: $$ slope = \frac{change\; in\; y}{change\; in\; x} = \frac{\nabla y}{\nabla x} = \frac{y_2 - y_1}{x_2-x_1} $$

Constant rate of change

Consider an example of a car travelling at a constant 50 km/h. The distance-time graph would look like this:


In [0]:
%env EXPERIMENTAL_TQDM=1 
import numpy as np 
import matplotlib.pyplot as plt
y = [0,50,100,150,200,250,300] 
x = [0, 1, 2, 3, 4, 5,6] 
plt.title("Constant rate of change") 
plt.xlabel("Time in hours")
plt.ylabel("Distance in kilometers")
plt.text(0,260,'Notice that slope is always constant \n 300/6 = 50 for the whole graph.',fontsize=14, color='green')
plt.text(2.5,60,'There is a constant rate of change \n of the distance compared to the time.',fontsize=14, color='green')
plt.plot(x, y) 
plt.show()


env: EXPERIMENTAL_TQDM=1

Derivatives - Instantaneous rate of change

Calculus is applied where the rate of change is not constant.

A secant is a straight line that cuts a curve. It is the average rate of change, or simply the slope between two points.

Consider the graph below, where $ f(x) = x^2 + 3$

The slope between (1,4) and (3,12) would be: $$\frac{y2-y1}{x2-x1} = \frac{12-4}{3-1} = 4$$

But to calculate the slope at point (1,4) to reveal the change in slope at that specific point, we have three options:

Option 1: Find the two nearest points, calculate their slopes relative to x and take the average.

Option 2: Calculate the slope from x to another point an infinitesimally small distance away from x

Option 3: Calculate the derivative

Option 2 and 3 are essentially the same.

We introduce here the tangent which is a straight line that just touches a curve. The tangent line represents the instantaneous rate of change of the function at that one specific point. The slope of the tangent line at a point on the function is equal to the derivative of the function at the same point.

In other words derivative is the method for finding the slope of the tangent line.

The derivative is a scalar value.

What does a derivative tell us?

Imagine you have a powerful microscope and as you zoom into the graph of a function, the derivative of a function tells us specifically how steep the tangent line to the graph would be at the point we are zooming in on.

The steepness, or slope varies from point to point: some parts of the graph are flatter, and some parts slope up more steeply.

We also know that that steepness, or slope varies from point to point: some parts of the graph are flatter, and some parts slope up more steeply.

On the graph below, we have plotted a graph of a $y=f(x)=x^3-x$,

We have decorated the graph of this polynomial with a few of its tangent lines. You can see by looking at these lines that their slope changes from point to point.

The function starts out with positive slope which tells us that it must have grown for some time. Notice how the tangent lines are pointing towards increasing y values when their slope is positive.

Eventually, it flattened out, and precisely at the point where the slope of its tangent was zero, the function had a little "high". (We will call that place a local maximum.)

After that, the function had a negative slope. This tells us it was decreasing. After some time, the slope flattened out to zero and the function had a local minimum. After that it just grew and grew. (In fact, its growth got faster and faster, since the slope of its tangents got more and more steep)

We can summarize the findings as follows:

  1. A positive derivative means that the function is increasing
  2. A negative derivative means that the function is decreasing
  3. A zero derivative means that the function has some special behaviour at the given point. It may have a local maximum, a local minimum, (or in some cases, a "turning" point)

There are two most commonly used methods of derivatives for finding slopes.

Numerical method of finding slope

The slope of the curve $y = x^2$ at the point $(2,4)$ can be found using a numerical method.

Step1: We start with a point $Q(1,1)$ which is somewhere near $P(2,4)$ The slope of $PQ$ is given by:

${m}=\frac{{{y}_{{2}}-{y}_{{1}}}}{{{x}_{{2}}-{x}_{{1}}}} = \frac{{{4}-{1}}}{{{2}-{1}}}={3}$

Step2: Now we move $Q$ further around the curve so it is closer to $P$. Let's use ${Q}{\left({1.5},{2.25}\right)}$ which is closer to ${P}{\left({2},{4}\right)}$

The slope of $PQ$ is now given by:

${m}=\frac{{{y}_{{2}}-{y}_{{1}}}}{{{x}_{{2}}-{x}_{{1}}}}=\frac{{{4}-{2.25}}}{{{2}-{1.5}}}=3.5$

We see that this is already a pretty good approximation to the tangent at $P$, but not good enough.

Step3: Now we move $Q$ even closer to $P$, say ${Q}{\left({1.9},{3.61}\right)}$.

${m}=\frac{{{y}_{{2}}-{y}_{{1}}}}{{{x}_{{2}}-{x}_{{1}}}}=\frac{{{4}-{3.61}}}{{{2}-{1.9}}}={3.9}$

We can see that we are very close to the required slope.

Now if $Q$ is moved to ${\left({1.99},{3.9601}\right)}$, then slope $PQ$ is ${3.99}$.

If $Q$ is ${\left({1.999},{3.996001}\right)}$, then the slope is ${3.999}$.

Clearly, as ${x}\rightarrow{2}$, the slope of ${P}{Q}\rightarrow{4}$. But notice that we cannot actually let ${x}={2}$, since the fraction for $m$ would have ${0}$ on the bottom, and so it would be undefined.

We have found that the rate of change of y with respect to x at the point ${x}={2}$ is ${4}$ units .

Algebraic method of finding slope

Algebraic method of finding slope saves doing the numerical substitutions.

As we consider shorter and shorter sequence of secants, i.e, the sequence of slopes, the tangent then at any point is said to be the limit of that sequence of slopes. That slope, that limit, will be the value of what we will call the derivative.

A limit is defined as the output value; a function approaches as the input value approaches another value. In our case the target value is the specific point at which we want to calculate slope

Calculating the derivative is the same as calculating normal slope, however in this case we calculate the slope between our point and a point infinitesimally close to it. We use the variable h to represent this infinitesimally distance. Here are the steps:

  1. Given the function $$f(x) = x^2$$

  2. Increment $x$ by a very small value $h$ where $h = \nabla x$

$$ f(x + h) = (x + h)^2$$
  1. Apply the slope formula
$$\frac {\nabla y}{\nabla x} =\frac{f(x + h) - f(x)}{h}$$
  1. Simplify the equation
$$\frac{x^2+2xh+h^2-x^2}{h} = \frac{2xh+h^2}{h} = 2x+h$$
  1. Set $h$ to $0$ (the limit as $h$ heads toward $0$) $${2x + 0} = {2x}$$

So what does this mean? It means for $f(x) = x^2$, the slope at any point can be calculated using the function $2x$.

The formula is defined as:

$$\frac{df}{dx} = \lim_{h\to0}\frac{f(x+h) - f(x)}{h}$$

In [0]:
def xSquared(x):
    return x**2

def getDeriv(func, x):
  """Compute the derivative of `func` at the location `x`."""
  h = 0.0001 #step size
  return (func(x+h) - func(x)) / h

x = 3 # the location of interest
derivative = getDeriv(xSquared, x)
actual = 2*x

print("The derivative is: ")
print(derivative)
print("\nThe actual is: ")
print(actual)


The derivative is: 
6.000100000012054

The actual is: 
6

Scalar derivative rules

The following are rules for applying derivatives to functions

Rule $f(x)$ Scalar Derivative Example of Rule
Constant $c$ $0$ $ \frac{d}{dx} 99 = 0$
Multiplication $cf$ $c\frac{df}{dx}$ $ \frac{d}{dx} 3x = 3$
Power Rule $x^n$ $n x^{n-1}$ $ \frac{d}{dx} x^3 = 3x^2$
Sum Rule $f+g$ $\frac{df}{dx} + \frac{dg}{dx}$ $\frac{d}{dx} (x^2+3x) = 2x+3$
Difference Rule $f-g$ $\frac{df}{dx} - \frac{dg}{dx}$ $\frac{d}{dx} (x^2-3x) = 2x-3$
Product Rule $fg$ $f\frac{dg}{dx}+\frac{df}{dx}g$ $\frac{d}{dx} x^2x = x^2+x2x = 3x^2$
Chain Rule $f(g(x))$ $g(x)=u \;\frac{df(u)}{du}\frac{du}{dx}$ $\frac{d}{dx}ln(x^2)=\frac{1}{x^2} 2x=\frac2x$

Solving for single-variable scalar value

The Problem

Study time in hours (x) Marks scored (y)
1 hour 35
2 hours 39
3 hours 41
4 hours 44
5 hours 49
6 hours ?

We need to look at the existing data and predict the possible marks ($y$) that can be scored given 6 hours of study ($x$).

Input and output are ground truth hence cannot be changed.

Instead we multiply random values called $weights$ to the inputs. The reason for this is that, instead of adjusting the values of inputs directly, we adjust and tunes the weights.

This preserves the sanity of the inputs and lets us adjust only the weights during training. Conceptually multiplication helps us to control the value without directly changing it.

Mathematically weight association is represented as follows:
$$y_p = w \; x $$ Where $y_p$ is output, $w$ is the weight and $x$ is in the input

The above equation interestingly represents a line. Hence to solve the problem we need to find:

  • A line that fits the data
  • We then use the line to make the above prediction.

That line is nothing but a model (a function/equation), used to make predictions. This is called called linear regression.

The below is plot of all data points:


In [0]:
import numpy as np 
import matplotlib.pyplot as plt
x_study_time = np.array([1,2,3,4,5])  
y_marks = np.array([15,30,40,55,65])
plt.title("Report Card") 
plt.xlabel("Study time in hours")
plt.ylabel("Marks scored")
plt.scatter(x_study_time, y_marks)
plt.show()


The Cost Function

The process of finding the line starts with finding the cost function.

The weight is initially assigned some random value. After initializing the weight $w$ to random value and multiplying it with input, the output may not match the desired output. Hence an error or loss is observed.

$$Error = (Actual\;output - Predicted\;output) $$$$ Loss = (y-y_p) = {(y-y_p)^2}$$

Positive values are desired most times and to mitigate effects when a negative value is obtained, loss functions are squared.

$$y_p = w \; x$$$$ Loss = {(y - y_p)^2} = (y-(wx))^2$$
$x$ $y$ $y_p$ (w=2) Loss (w=2) $y_p$ (w=6) Loss (w=6) $y_p$ (w=10) Loss (w=10)
1 15 2 $13^2$ 6 $9^2$ 10 $5^2$
2 30 4 $26^2$ 12 $18^2$ 20 $10^2$
3 40 6 $34^2$ 18 $22^2$ 30 $10^2$
4 55 8 $47^2$ 24 $31^2$ 40 $15^2$
5 65 10 $55^2$ 30 $35^2$ 50 $15^2$
$Mean\; Loss$ $1447$ $615$ $135$

We can consider mean (as there are many data points per weight) in our loss function and call it the mean squared loss/error function. Popular loss functions include: Mean Squared Error, Root Mean Squared Error, and Log Loss. Thus MSE is the sum of all mean square loss.

$$MSE = \frac{1}{T} \sum_{i=1}^{T} (y - y_p)^2 $$

where $T$ is the total number of observations (data points)

where $ \sum_{i=1}^{T}$ is the mean

The cost function then is nothing but mean squared loss expressed as a function of input and weight variables.

$$MSE = \frac{1}{T} \sum_{i=1}^{T} (y - wx)^2 $$


In [0]:
def cost_function(x, y, w):
    N = len(x)
    #print("The given length is: ",N)
    total_error = 0.0
    for i in range(N):
        temp= (y[i] - (w*x[i]))**2 
        total_error+=temp
        #print("The current value of error is: ", temp)
    return total_error / N
print(cost_function([1,2,3,4,5],[15,30,40,55,65],2))  
print(cost_function([1,2,3,4,5],[15,30,40,55,65],8))
print(cost_function([1,2,3,4,5],[15,30,40,55,65],9))  
print(cost_function([1,2,3,4,5],[15,30,40,55,65],10))
print(cost_function([1,2,3,4,5],[15,30,40,55,65],11))  
print(cost_function([1,2,3,4,5],[15,30,40,55,65],12))  
print(cost_function([1,2,3,4,5],[15,30,40,55,65],14))  
print(cost_function([1,2,3,4,5],[15,30,40,55,65],15))  
print(cost_function([1,2,3,4,5],[15,30,40,55,65],16))
print(cost_function([1,2,3,4,5],[15,30,40,55,65],17))
print(cost_function([1,2,3,4,5],[15,30,40,55,65],18))
print(cost_function([1,2,3,4,5],[15,30,40,55,65],19))


1447.0
331.0
222.0
135.0
70.0
27.0
7.0
30.0
75.0
142.0
231.0
342.0

Reducing Cost

The next step involves solving the cost which includes iteratively finding the correct weight where for each iteration the initial weight is adjusted so that the error or loss tends to zero.

The equation $y = w \; x$ is similar to the equation of the straight line $y = m\; x+b$ where $m$ is the slope (to change the steepness), $b$ is the bias (to move the line up and down the graph), $x$ is the explanatory variable, and $y$ is the output.

Thus, the change in the value of $w$, changes the slope of the line.


In [0]:
import numpy as np 
import matplotlib.pyplot as plt
x_study_time = np.array([0,1,2,3,4,5])  
y_marks = np.array([0,15,30,40,55,65])
plt.title("Report Card") 
plt.xlabel("Study time in hours")
plt.ylabel("Marks scored")
plt.scatter(x_study_time, y_marks, color='g')
plt.plot(x_study_time*2, color='k')
plt.plot(x_study_time*6, color='m')
plt.plot(x_study_time*10, color='b')
#plt.plot(x_study_time*13.5, color='y')

plt.text(4,10,'w = 2',fontsize=14, color='black')
plt.text(4,5,'Loss = 1447',fontsize=14, color='black')
plt.text(4,27,'w = 6',fontsize=14, color='purple')
plt.text(4,21,'Loss = 615',fontsize=14, color='purple')
plt.text(4,46,'w = 10',fontsize=14, color='blue')
plt.text(4.2,39,'Loss = 135',fontsize=14, color='blue')
plt.show()


To reduce the overall cost, we first consider the cost function graph w.r.t weights so that we can arrive at the correct weights that make the cost zero.

Note that you could at start at zero from the right most side and then move to left most side on the graph below.


In [0]:
import numpy as np 
import matplotlib.pyplot as plt
w = np.array([8,9,10,11,12,14,15,16,17,18,19])  
mse = np.array([331,222,135,70,27,7,30,75,142,231,342])
plt.title("Derivative Descent") 
plt.xlabel("Weight")
plt.ylabel("Cost")
plt.text(9,250,'Down right',fontsize=14, color='red')
plt.text(15.8,200,'Down left',fontsize=14, color='orange')
plt.plot(w, mse)
plt.show()


As per stats.stackexchange.com/questions/332951/online-algorithm-for-the-mean-square-error where the above code can be mathematically correct, but leads to numerical problem. The numerical problem originates from the fact that we are calculating the difference of two larger number to get a small number, which is limited by the numerical precision. For better numerical accuracy consider solution in the above post.

Thus for 6 hours of study the marks obtained will be

$$y_p = wx = 14 \times 6 = 84$$

We consider the value 14 here for the weight as we deduced that the cost =7 is the lowest at this weight before it again starts increase.

These hand calculations are find for small values but if the dataset is large then we need to consider mathematical means to solve the same problem.

Calculate the derivative

To find the correct value of $w$ at which point the cost is the minimum. We require two things:

  1. First is the direction to move the weight $w$ to reduce it - When we look at the plot of a function, a negative slope means the function goes downward towards the right, so we want to move right to find the minimum, similarly a positive slope means the function goes upward as you move right, so we want to move left in order to find the minimum.

  2. Second is the amount of the step to take for $w$ - If the value of the slope is large we want to take a large step because we’re far from the minimum. If the slope is small we want to take a smaller step. Hence the algorithm can take increasingly smaller steps towards the minimum with each iteration if slope is small.

The derivative of the cost function gives us a scalar value (slope) with sign. Hence, to minimize the cost function, we move in the direction opposite to the derivative.

We consider the derivative of cost function since the slope of the cost function at current $w$ value tells the above required two things.

That is the instantaneous rate of change of cost at a certain point $w$.

The derivative with respect to the weight is calculated as follows:

$$\frac{dL}{dw} = \frac{1}{T} \sum_{i=1}^{T} (y - wx)^2 $$

$$\frac{dL}{dw} = \frac{1}{T} \sum_{i=1}^{T} \; \frac{d}{dw}(y - wx)^2 $$

Note: We use chain rule for composite functions where we differentiate outlayer square layer first, leaving $(y - wx)$ unchanged. Then differentiate $(y - wx)$.

$$\frac{dL}{dw} = \frac{1}{T} \sum_{i=1}^{T} \; 2(y - wx) \cdot \frac{d}{dw}[y - wx] $$

$$\frac{dL}{dw} = \frac{1}{T} \sum_{i=1}^{T} \; 2(y - wx)\; (\frac{d}{dw}[y] - x\cdot\frac{d}{dw}[w])$$ $$\frac{dL}{dw} = \frac{1}{T} \sum_{i=1}^{T} \; 2(y - wx)\; (0 - x \cdot 1)$$ $$\frac{dL}{dw} = \frac{1}{T} \sum_{i=1}^{T} \; -2x\left(y-wx\right)$$

The sum sign vanishes in the program below as it turns into the number of iterations.

The delta rule

Once we calculate the derivative we update the weight by an amount proportional to derivative $\frac{ dL}{ dw}$, i.e. $w=w−η⋅\frac{ dL}{dw} $

This is called the delta rule. With the delta-rule we increment or decrement the weight by the input scaled by the residual error and the learning rate.

The variable $η$ is the learning rate which determines the size of the steps we take to reach a minimum. The learning rate gives the engineer some additional control over how large of steps we make. Selecting the right learning rate is critical. If the learning rate is too large, you can overstep the minimum and even diverge. The only concern with using too small of a learning rate is that you will need to run more iterations of gradient descent, increasing your training time.


In [0]:
def calculate_derivative_update_weights(X, Y, w, learning_rate):
    m_deriv = 0
    N = len(X)
    for i in range(N):
        # Calculate derivatives
        # -2x(y - (mx))
        m_deriv = m_deriv - 2*X[i] * (Y[i] - (w*X[i])) 
    #We subtract because the derivatives point in direction of steepest ascent
    w = w - (m_deriv / float(N)) * learning_rate
    return w

Derivative Descent Algorithm

Step 1: Initialize model weight with a random number

Step 2: For each epoch/iteration

Step 2a: For each data point in training data:

Calculate the derivative: To calculate the derivative, we compute the prediction, then the cost, and then the derivative.

Calculate the prediction (forward): $$y_p=w\;x $$

Calculate the cost: $$MSE = \frac{1}{T} \sum_{i=1}^{T} (y - y_p)^2 $$

Calculate the derivative (backward): $$\frac{ dL}{dw} = \frac{1}{T} \sum_{i=1}^{T} - 2x \; (y - wx) $$

Update the weight (derivative descent) using delta rule: $$w=w + (−η⋅\frac{dL}{dw}) =w −η⋅\frac{dL}{dw} $$

Calculate cost or check if termination criteria is met (see if mean-squared-error on entire validation data reduced).

Step 4: Repeat above for epochs

You could repeat till weight stops reducing or other pre-defined termination criteria is met

Methods for deciding when to stop derivative descent are many, but most times, derivative descent is run for a large, fixed number of iterations (for example, 100 iterations), with no test for convergence.


In [0]:
def cost_function(x, y, w):
    N = len(x)
    #print("The given length is: ",N)
    total_error = 0.0
    for i in range(N):
        #consider changing it to w*x[i] - y[i] 
        temp= ((y[i] - w*x[i]))**2 
        total_error+=temp
        #print("The current value of error is: ", temp)
    #print("Total error: ",total_error/N)
    return total_error / N

    
def calculate_derivative_update_weights(X, Y, w, learning_rate):
    m_deriv = 0
    N = len(X)
    for i in range(N):
        # Calculate derivatives
        # -2x(y - (wx))
         m_deriv = m_deriv - 2*X[i] * (Y[i] - (w*X[i])) 
    # We subtract because the derivatives point in direction of steepest ascent
    w = w - (m_deriv / float(N)) * learning_rate
    return w
    
  
n_weight = 1 #some random weight
for epoch in range(10):
    #print("Iteration: ", epoch)
    error = cost_function([1,2,3,4,5],[15,30,40,55,65],n_weight)
    print("Error: ",error)
    n_weight = calculate_derivative_update_weights([1,2,3,4,5],[15,30,40,55,65],n_weight,0.06)
    print("Weight:", n_weight)
    #print("\n")


Error:  1710.0
Weight: 17.439999999999998
Error:  178.4495999999998
Weight: 12.179200000000002
Error:  21.618839039999955
Weight: 13.862656
Error:  5.559369117695985
Weight: 13.323950080000001
Error:  3.9148793976520637
Weight: 13.4963359744
Error:  3.7464836503195698
Weight: 13.441172488192
Error:  3.729239925792726
Weight: 13.45882480377856
Error:  3.7274741684011765
Weight: 13.453176062790861
Error:  3.727293354844285
Weight: 13.454983659906924
Error:  3.727274839536052
Weight: 13.454405228829785

Advanced mathematics

Partial derivatives

For functions with more than one changing parameters, we need to consider derivatives of the function with respect to all variables. This requires the use of partial derivatives.

Partial derivatives are like ordinary derivatives, except that we keep all other variables fixed when we consider the derivative of the function. If there are N variables, then there are partial derivatives with respect to all N variables.

For a multivariable function, like $f(x,y)=x^2 \cdot y$, computing partial derivatives looks something like this:

$$\frac {\partial f}{\partial x} = \frac {\partial}{\partial x}x^2 y = 2 x y$$

$$\frac {\partial f}{\partial y} = \frac {\partial}{\partial y}x^2 y = x^2\cdot 1$$

In the first case we treat $y$ as constant and take derivative. In second case we treat $x$ as constant and take derivative.

The Gradient

  • The derivative is what will get you a speeding ticket, not your average speed.

  • The temperature can rise gradually from 10C to 20C in 5 minutes and the same change can happen more rapidly in 1 minute, then the rate of change, "gradient" of temperature in the latter is more.

In mathematics, the gradient is a multi-variable generalization of the derivative.

While a derivative can be defined on functions of a single variable, for functions of several variables, the gradient takes its place. The gradient is a vector-valued function, as opposed to a derivative, which is scalar-valued.

Like the derivative, the gradient represents the slope of the tangent of the graph of the multi-variable function. More precisely, the gradient (vector) points in the direction of the greatest rate of increase of the function, and its magnitude is the slope of the graph in that direction.

The gradient is a vector.

Thus, if a function takes 3 variables then it will have a gradient with 3 components:

  • F(x) has one variable and a single derivative: dF/dx
  • F(x,y,z) has three variables and three derivatives: ($\partial$ F/$\partial$x, $\partial$F/$\partial$y, $\partial$F/$\partial$z)

The gradient of a multi-variable function has a component for each direction.

Thus the gradient of a function is the collection of all its partial derivatives into a vector.

When we move from derivatives of one function to derivatives of many functions, we move from the world of vector calculus to matrix calculus. We can organize partial derivatives into a horizontal vectors.

We call this vector the gradient of $f(x,y)$ and write it as follows:

Consider a multivariable function $$f(x, y) = x^2y$$

$$gradient \; f(x,y) = \nabla f(x,y) = \begin {bmatrix} \frac{\partial \mathbf{f(x,y)}} {\partial \mathbf{x}} \;\; \frac{\partial \mathbf{f(x,y)}}{\partial \mathbf{y}} \end {bmatrix} = \begin {bmatrix} \frac{\partial } {\partial \mathbf{x}} x^2y \;\; \frac{\partial }{\partial \mathbf{y}} x^2y \end {bmatrix} = \begin {bmatrix} 2xy \;\; x^2 \cdot 1 \end{bmatrix}$$

In a microwave with varying temperatures in different coordinates, the coordinates are the current location, measured on the x-y-z axis.

The gradient is a direction to move from the current location, that is, it points in the direction of greatest increase of a function. In this case, the greatest increase in temperature.

So, the gradient tells us which direction to move the bread to get it to a location with a higher temperature, to cook it even faster. Thus it gives us the direction to move to increase our temperature.

Note that the gradient does not give us the coordinates of where to go; it gives us the direction to move to increase our temperature.

Local minimums and maximums

Assume we start at a random point and check the gradient, follow this trajectory for a tiny bit, and then check the gradient again. We’d keep repeating this process: move a bit in the gradient direction, check the gradient, and move a bit in the new gradient direction. Every time we nudged along and follow the gradient, we’d get to a warmer and warmer location.

Eventually, we’d get to the hottest part of the oven and that’s where we’d stay, about to enjoy our fresh baked bread. Once we are at the maximum location, there is no direction of greatest increase. Any direction we follow will lead to a decrease in temperature.

A zero gradient tells us that we are at the max of the function, and can’t do better.

What if there are two nearby maximums ?

Finding the maximum in regular single variable functions means we find all the places where the derivative is zero: after which there is no further direction of greatest increase.

Similarily we must find multiple locations where the gradient is zero and test these points to see which one is the global maximum.

Directional Derivative

The gradient can be generalized to calculate the partial derivatives along any given direction.

The direction is specified by a unit vector with length, or magnitude, of 1 that points in the direction in which we want to compute the slope.

The directional derivative is then the dot product of the gradient and the unit vector $v$

Hence computing the directional derivative involves a dot product between the gradient and the unit vector.

$$\nabla f \cdot v⃗ \; = \begin{bmatrix} \frac{\partial f } {\partial \mathbf{x}} \\ \frac{\partial f }{\partial \mathbf{y}} \\ \frac{\partial f }{\partial \mathbf{z}} \end{bmatrix} \cdot \begin{bmatrix} v_1\\v_2\\v_3 \end{bmatrix} = v_1 \frac{\partial f } {\partial \mathbf{x}} (x,y,z) + v_2 \frac{\partial f } {\partial \mathbf{y}} (x,y,z) + v_3 \frac{\partial f } {\partial \mathbf{z}} (x,y,z) \; $$

Here, $v_1$, $v_2$ and $v_3$ are the components of unit vector $v$.

Example

What is the directional derivative of $f$ at the point $(2, -3)$ along the vector $v⃗ = 0.6\hat i + 0.8\hat j$ given $f$ as below.

$$f(x, y) = x^2 - xy$$$$Gradient \;{\nabla}f = \begin{bmatrix} \frac{\partial f } {\partial \mathbf{x}} \\ \frac{\partial f }{\partial \mathbf{y}} \\ \end{bmatrix} = \begin{bmatrix} \frac{\partial } {\partial \mathbf{x}} (x^2 - xy) \\ \frac{\partial }{\partial \mathbf{y}} (x^2 - xy) \end{bmatrix} = \begin{bmatrix} 2x - y \\ - x \end{bmatrix}$$

$$Gradient \; {\nabla}f (2,-3) = \begin{bmatrix} 2(2)−(−3) \\ -(2) \end{bmatrix} = \begin{bmatrix} 7 \\ -2 \end{bmatrix}$$

$$Directional \; derivative \; = {\nabla}f (2,-3) \cdot v⃗ = {\nabla}f (2,-3) \cdot (0.6\hat i + 0.8\hat j) = \begin{bmatrix} 7 \\ -2 \end{bmatrix} \cdot \begin{bmatrix} 0.6 \\ 0.8 \end{bmatrix} =7 \; (0.6) + (-2) \; (0.8) = 2.6$$

The directional derivative is computed by taking the dot product of the gradient of $f$ and a unit vector $\vec{v}$ of "tiny nudges" representing the direction. The unit vector describes the proportions we want to move in each direction. The output of this calculation is a scalar number representing how much $f$ will change if the current input moves with vector $\vec{v}$.

Solving for multiple-variable scalar values

The Model

We can extend the previous example to include two unknown parameters $w$ and $b$.

The second number tells us how high the line should be at the start.

The first number tells us the angle of the line.

Our model here can be described as

$$y=wx+b$$

where $w$ is the slope (to change the steepness), $b$ is the bias (to move the line up and down the graph), $x$ is the explanatory variable, and $y$ is the output.

The Problem

We will consider the following as our model. $$y_p=wx+b$$

The Cost function

$$MSE = \frac{1}{2\;T} \sum_{i=1}^{T} (y - y_p)^2 = \frac{1}{2\;T} \sum_{i=1}^{T} (y - (wx+b))^2$$

The factor of $\frac {1}{2}$ is included to cancel the exponent when differentiating. Later, the expression will be multiplied with an arbitrary learning rate, so that it doesn't matter if a constant coefficient is introduced now.


In [0]:
def cost_function(x, y, w, b):
    N = len(x)
    #print("The given length is: ",N)
    total_error = 0.0
    for i in range(N):
        temp= (y[i] - (w*x[i]+b))**2 
        total_error+=temp/2
        #print("The current value of error is: ", temp)
    return total_error / N

The Gradient of cost function

There are two parameters in our cost function that we can control: w and b. Since we need to consider the impact, each one has on the final prediction, we need to use partial derivatives. We calculate the partial derivatives of the cost function with respect to each parameter and store the results in a gradient.

$$\frac{\partial L}{\partial w} = \frac{1}{2\;T} \sum_{i=1}^{T} \; \frac{d}{dw}(y - (wx+b))^2 $$$$= \frac{1}{2\;T} \sum_{i=1}^{T} \; 2(y - (wx+b))\frac{d}{dw}(y - (wx+b)) $$

$$= \frac{1}{2\;T} \sum_{i=1}^{T} \; 2(y - (wx+b))\;(0- 1. x+0) $$ $$= \frac{1}{2\;T} \sum_{i=1}^{T} \; -2x(y - (wx+b)) =\frac{1}{T} \sum_{i=1}^{T} -x(y - (wx+b)) $$

$$\frac{\partial L}{\partial b} = \frac{1}{2\;T} \sum_{i=1}^{T} \; \frac{d}{db}(y - (wx+b))^2 $$$$= \frac{1}{2\;T} \sum_{i=1}^{T} \; 2(y - (wx+b))\frac{d}{db}(y - (wx+b)) $$

$$= \frac{1}{2\;T} \sum_{i=1}^{T} \; 2(y - (wx+b))\;(0-0+1) $$ $$= \frac{1}{2\;T} \sum_{i=1}^{T} \; -2(y - (wx+b)) =\frac{1}{T} \sum_{i=1}^{T} -(y - (wx+b)) $$

The gradient thus becomes:

$$ {\nabla}f = \begin{bmatrix} \frac{\partial L } {d \mathbf{w}} \\ \frac{\partial L }{d \mathbf{b}} \\ \end{bmatrix} = \begin{bmatrix} \frac{1}{T} \sum_{i=1}^{T} -x(y - (wx+b)) \\ \frac{1}{T} \sum_{i=1}^{T} -(y - (wx+b)) \end{bmatrix} $$

In [0]:
def update_weights(w, b, X, Y, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # Calculate partial derivatives
        # -x(y - (wx + b))
        m_deriv += -X[i] * (Y[i] - (w*X[i] + b))

        # -(y - (wx + b))
        b_deriv += -(Y[i] - (w*X[i] + b))

    # We subtract because the gradients point in direction of steepest ascent
    w -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return w, b

The Gradient descent algorithm

Thus the gradient descent algorithm is as follows:

Step 1: Initialize model weight and bias with a random number

Step 2: For each epoch/iteration

Step 2a: For each data point in training data:

Calculate the gradient:

Calculate the prediction (forward)

Calculate the cost

Calculate the partial derivatives (backward)

Update the weight (gradient descent): Simultaneous update both $w$ and $b$

$$w=w + (−η⋅\frac{\partial L}{\partial w}) =w −η⋅\frac{\partial L}{\partial w} $$$$b=b + (−η⋅\frac{\partial L}{\partial b}) =b −η⋅\frac{\partial L}{\partial b} $$

Calculate cost or check if termination criteria is met (see if mean-squared-error on entire validation data reduced).

Step 4: Repeat above for epochs


In [0]:
def cost_function(x, y, w, b):
    N = len(x)
    #print("The given length is: ",N)
    total_error = 0.0
    for i in range(N):
        temp= (y[i] - (w*x[i]+b))**2 
        total_error+=temp/2
        #print("The current value of error is: ", temp)
    return total_error / N
  
def update_weights(w, b, X, Y, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # Calculate partial derivatives
        # -x(y - (wx + b))
        m_deriv += -X[i] * (Y[i] - (w*X[i] + b))

        # -(y - (wx + b))
        b_deriv += -(Y[i] - (w*X[i] + b))

    # We subtract because the derivatives point in direction of steepest ascent
    w -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return w, b
  
n_weight = 1 #some random weight
n_bias = 0 #some random bias
for epoch in range(10):
    #print("Iteration: ", epoch)
    error = cost_function([1,2,3,4,5],[15,30,40,55,65],n_weight,n_bias)
    print("Error: ",error)
    n_weight,n_bias = update_weights(n_weight,n_bias,[1,2,3,4,5],[15,30,40,55,65],0.06)
    print("Weight:", n_weight)
    print("Bias:", n_bias)
    #print("\n")

#import matplotlib.pyplot as plt
#plt.style.use('fivethirtyeight')

#plt.scatter(X, y, color='black')
#plt.plot(X, clf.predict(X))
#plt.gca().set_title("Gradient Descent Linear Regressor")


Error:  855.0
Weight: 9.219999999999999
Bias: 2.28
Error:  72.67020000000004
Weight: 11.604399999999998
Bias: 2.9436
Error:  6.811272480000011
Weight: 12.295647999999998
Bias: 3.138192
Error:  1.2669396491520082
Weight: 12.49564576
Bias: 3.1966838400000004
Error:  0.8000667572610037
Weight: 12.5531164672
Bias: 3.2156665728000005
Error:  0.760631862415398
Weight: 12.569239615744
Bias: 3.223165614336
Error:  0.757182603214749
Weight: 12.57337165877248
Bias: 3.22731254664192
Error:  0.7567653329821199
Weight: 12.574030105587099
Bias: 3.2304668952643585
Error:  0.7566058655986267
Weight: 12.57368619475203
Bias: 3.2333134625428195
Error:  0.7564706103536937
Weight: 12.573056882957983
Bias: 3.236051139734885

Homework

Build a model for celsius = np.arange(0, 100, dtype='float64') and fahrenheit = celsius * 9.0/5.0 + 32.0 with $y = wx$ and $y = wx+b$

Data Representation for Vectors and Matrices

Data is represented as numbers in computing. Three mathematical ways of representing the data is via scalars, vectors and matrices which are collectively called tensors.

Tensors are arrays of data which contain those numbers. They can come in any form or dimension.

For understanding sake, we define:

  1. Scalars as single numbers made up of integers, real numbers, rational numbers
  2. A vector as a 1-D array of numbers accessible by a single index
  3. A matrix as a 2-D array of numbers accessible by two indices
  4. Tensors as arrays with more than two axis

Most common form of data organization for machine vector organization for machine learning is a 2D array.

Natural to think of each sample as a vector of attributes, and whole array as a matrix. A vector can be regarded as special case of a matrix, where one of matrix dimensions = 1.

In the tensor algebra, a scalar is a 0-tensor, a vector is a 1-tensor, a matrix is a 2-tensor, and higher order tensors don't generally have names. But in order for the linear algebra operation we just did, we had to promote a 0-tensor to a 1-tensor.

Tensors are in a sense compositional mathematical objects. Scalars are "made of" nothing. Vectors are "made of" scalars, matrices are made of vectors, 3-tensors are made of 2-tensors.

Vectors

Vectors are 1-dimensional arrays of numbers or terms. In geometry, vectors store the magnitude and direction of a potential change to a point. The vector [3, -2] says go right 3 and down 2. A vector with more than one dimension is called a matrix.

There are a variety of ways to represent vectors. Here are a few you might come across in your reading. $$v = \begin{bmatrix} 1 \\ 2 \\ 3 \\ \end{bmatrix}= \begin{pmatrix} 1 \\ 2 \\ 3 \\ \end{pmatrix}= \begin{bmatrix} 1 & 2 & 3\\ \end{bmatrix}$$

Vectors typically represent movement from a point. They store both the magnitude and direction of potential changes to a point. The vector [-2,5] says move left 2 units and up 5 units. A vector can be applied to any point in space. The vector’s direction equals the slope of the hypotenuse created moving up 5 and left 2. Its magnitude equals the length of the hypotenuse.

Scalar operations on Vectors

Scalar operations involve a vector and a number. You modify the vector in-place by adding, subtracting, or multiplying the number from all the values in the vector. $$\begin{bmatrix} 2 \\ 2 \\ 2 \\ \end{bmatrix}+1= \begin{bmatrix} 3 \\ 3 \\ 3 \\ \end{bmatrix}$$

Elementwise operations on Vectors

In elementwise operations like addition, subtraction, and division, values that correspond positionally are combined to produce a new vector. The 1st value in vector A is paired with the 1st value in vector B. The 2nd value is paired with the 2nd, and so on. This means the vectors must have equal dimensions to complete the operation. $$\begin{bmatrix} a_1 \\ a_2 \\ \end{bmatrix}+ \begin{bmatrix} b_1 \\ b_2 \\ \end{bmatrix}= \begin{bmatrix} a_1+b_1 \\ a_2+b_2 \\ \end{bmatrix}$$

Dot Product of Vectors

$$a =\begin{bmatrix} x_1\\y_1\\z_1 \end{bmatrix} \; b = \begin{bmatrix} x_2\\y_2\\z_2 \end{bmatrix}$$

$$a \cdot b = x_1 y_1 + x_2 y_2 + x_3 y_3 $$

Dot product of two vectors gives us a scalar.

Dot product can also be defined as:

$$a \cdot b = ||a||\; ||b|| \; \cosθ $$

The dot product measures how similar two vectors are, or, how well they travel together. In other words, if they are parallel (i.e. traveling in the same direction), the dot product will be as big as possible (either negatively big or positively big), and if the vectors are perpendicular (and so don’t travel well together at all), the dot product will be zero.

The problem with the dot product, though, is that it spits out a number. Sometimes we want a way to measure how well vectors travel together while still preserving some information about direction. In other words, we want a dot-product-like measurement that returns the same information as a vector rather than a scalar.

We could use $a$ as our “reference direction” and can also say that $a$ $\cdot$ $b$ measures how well $b$ travels in the direction of $a$.

If we want to preserve information about the direction we’re travelling in, we can just multiply $a$ $\cdot$ $b$ by the vector $a$.

$$(a\cdot b) \; a $$

The only issue here is that the length of $a$ is going to mess up the measurement, so to be safe, we should instead multiply the dot product by a unit vector in the $a$ direction.

Hadamard product of Vectors

Hadamard Product is elementwise multiplication and it outputs a vector. $$\begin{bmatrix} a_1 \\ a_2 \\ \end{bmatrix} \odot \begin{bmatrix} b_1 \\ b_2 \\ \end{bmatrix}= \begin{bmatrix} a_1 \cdot b_1 \\ a_2 \cdot b_2 \\ \end{bmatrix}$$

Dimensions of Matrices

We describe the dimensions of a matrix in terms of rows by columns. $$\begin{bmatrix} 2 & 4 \\ 5 & -7 \\ 12 & 5 \\ \end{bmatrix} \begin{bmatrix} a^2 & 2a & 8\\ 18 & 7a-4 & 10\\ \end{bmatrix}$$

The first has dimensions (3,2). The second (2,3).

Scalar Operations on Matrices

Scalar operations with matrices work the same way as they do for vectors. Simply apply the scalar to every element in the matrix — add, subtract, divide, multiply, etc.

$$\begin{bmatrix} 2 & 3 \\ 2 & 3 \\ 2 & 3 \\ \end{bmatrix}+1= \begin{bmatrix} 3 & 4 \\ 3 & 4 \\ 3 & 4 \\ \end{bmatrix}$$

Elementwise operations on Matrices

In order to add, subtract, or divide two matrices they must have equal dimensions. We combine corresponding values in an elementwise fashion to produce a new matrix. $$\begin{bmatrix} a & b \\ c & d \\ \end{bmatrix}+ \begin{bmatrix} 1 & 2\\ 3 & 4 \\ \end{bmatrix}= \begin{bmatrix} a+1 & b+2\\ c+3 & d+4 \\ \end{bmatrix}$$

Hadamard Product of Matrices

Multiplication of two tensors/matrices corresponds to an element-wise product or Hadamard product.

Hadamard product of matrices is an elementwise operation. Values that correspond positionally are multiplied to produce a new matrix.

Consider matrix the X and Y with the same size. The Hadamard product corresponds to multiplying each of the elements at the same position, that is, multiplying elements with the same color together. The result is a new matrix that is the same size as matrix X and Y as shown in the following figure: $$X = \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} \quad Y = \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix}$$

$$X \odot Y = \begin{bmatrix} (1)\; 5 & (2)\; 6 \\ (3)\; 7 & (4) \;8 \end{bmatrix}$$

Matrix Tranpose

Neural networks frequently process weights and inputs of different sizes where the dimensions do not meet the requirements of matrix multiplication. Matrix transpose provides a way to “rotate” one of the matrices so that the operation complies with multiplication requirements and can continue. There are two steps to transpose a matrix:

Rotate the matrix right 90°

Reverse the order of elements in each row (e.g. [a b c] becomes [c b a])

As an example, transpose matrix M into T: $$\begin{bmatrix} a & b \\ c & d \\ e & f \\ \end{bmatrix} \quad \Rightarrow \quad \begin{bmatrix} a & c & e \\ b & d & f \\ \end{bmatrix}$$

Matrix Multiplication

Matrix multiplication is simply the dot product of rows of first matrix with columns of second matrix.

$\begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \times$ $\begin{bmatrix} b_{11} & b_{21} \\ b_{12} & b_{22} \end{bmatrix} =$ $\begin{bmatrix} a_{11}b_{11}+a_{12}b_{12} & a_{11}b_{21}+a_{12}b_{22} \\ a_{21}b_{11}+a_{22}b_{12} & a_{21}b_{21}+a_{22}b_{22}\end{bmatrix}$

In maths, matrices should have comparable size to multiply them.

$$C(m,k) = A(m,n) * B(n,k)$$$$C_{m,k}=\sum_{n} A_{m,n} \; B_{n,k}$$

$A$ has n columns and $B$ has n rows, and for each index of m and k for $C$ we sum the multiplication $A$ and $B$ at the nth row or column.

Rules

Not all matrices are eligible for multiplication. In addition, there is a requirement on the dimensions of the resulting matrix output.

The number of columns of the 1st matrix must equal the number of rows of the 2nd

The product of an M x N matrix and an N x K matrix is an M x K matrix. The new matrix takes the rows of the 1st and columns of the 2nd

Matrix Multiplication Notation

We know that:

$$C_{i,j}=\sum_{k} A_{i,k} \; B_{k,j}$$

$A$ has k columns and $B$ has k rows, and for each index of i and j for $C$ we sum the multiplication A and B at the kth row or column.

Let us take a look at one specific element in the product C=AB, namely the element on position (i,j), i.e. in the ith row and jth column.

To obtain this element, you multiply all elements of the ith row of A pairwise with all the elements of the jth column of B and you add these n products. You have to repeat this procedure for every element of C, but let's zoom in on that one (but arbitrary) specific element for now:

$$\begin{pmatrix} a_{11} &\ldots &a_{1n}\\ \vdots& \ddots &\vdots\\ \color{blue}{\mathbf{a_{i1}}} &\color{blue}{\rightarrow} &\color{blue}{\mathbf{a_{in}}}\\ \vdots& \ddots &\vdots\\ a_{m1} &\ldots &a_{mn} \end{pmatrix} \cdot \begin{pmatrix} b_{11}&\ldots &\color{red}{\mathbf{b_{1j}}} &\ldots &b_{1p}\\ \vdots& \ddots &\color{red}{\downarrow} & \ddots &\vdots\\ b_{n1}&\ldots &\color{red}{\mathbf{b_{nj}}}&\ldots &b_{np} \end{pmatrix}= \begin{pmatrix} c_{11}&\ldots& c_{1j} &\ldots &c_{1p}\\ \vdots& \ddots & & &\vdots\\ c_{i1}& & \color{purple}{\mathbf{c_{ij}}} & &c_{ip}\\ \vdots& & & \ddots &\vdots\\ c_{m1} &\ldots& c_{mj} &\ldots &c_{mp} \end{pmatrix}$$

with element $\color{purple}{\mathbf{c_{ij}}}$ equal to:

$$\mathbf{\color{purple}{c_{ij}} = \color{blue}{a_{i1}} \color{red}{b_{1j}} + \color{blue}{a_{i2}} \color{red}{b_{2j}} + \cdots + \color{blue}{a_{in}} \color{red}{b_{nj}}}$$

Now notice that in the sum above, the left outer index is always i (ith row of A) and the right outer index is always j (jth row of B). The inner indices run from 1 to n so you can introduce a summation index k and write this sum compactly using summation notation:

$$\color{purple}{\mathbf{c_{ij}}}=\sum_{k=1}^{n} \color{blue}{\mathbf{a_{ik}}}\color{red}{\mathbf{b_{kj}}}$$

The formule above thus gives you the element on position (i,j) in the product matrix C=AB and therefore completely defines C by letting i=1,...,m and j=1,...,p.

The illustration above should give you an idea of the general formula, but here's a concrete example where we took 3 \times 3 matrices for A and B and focus on the element on position (2,3):

$$\begin{pmatrix} a_{11} & a_{12} &a_{13}\\ \color{blue}{1} &\color{blue}{2} &\color{blue}{3}\\ a_{31} & a_{32} &a_{33} \end{pmatrix}\cdot \begin{pmatrix} b_{11}&b_{12} &\color{red}{6}\\ b_{21}&b_{22} &\color{red}{5}\\ b_{31}&b_{32} &\color{red}{4} \end{pmatrix}= \begin{pmatrix} c_{11}& c_{12} &c_{13}\\ c_{21}& c_{22} &\color{purple}{\mathbf{c_{23}}}\\ c_{31}& c_{32} &c_{33} \end{pmatrix}$$

with element $\color{purple}{\mathbf{c_{23}}}$ equal to:

$$\begin{array}{rccccccc} \color{purple}{c_{23}} & = & \color{blue}{a_{21}} \color{red}{b_{13}} &+& \color{blue}{a_{22}} \color{red}{b_{23}} &+& \color{blue}{a_{23}} \color{red}{b_{33}} & = & \displaystyle \sum_{k=1}^{3} \color{blue}{a_{2k}}\color{red}{b_{k3}} \\ & = & \color{blue}{1} \cdot \color{red}{6} &+& \color{blue}{2} \cdot \color{red}{5} &+& \color{blue}{3} \cdot \color{red}{4} \\[8pt] & = & 6&+&10&+&12 & =& 28 \end{array}$$

Vectorization/Matrix Representation

Most common form of data organization for machine vector organization for machine learning is a 2D array. And it is natural to think of each sample as a vector of attributes, and whole array as a matrix.

The system of linear equations:

$$w_{11}x_1 + w_{12} x_2 +···+ w_{1n} x_n = h_1, \\ w_{21} x_1 + w_{22} x_2 +···+ w_{2n} x_n = h_2,\\ . . . \\ w_{m1} x_1 + w_{m2}x_2 +···+ w_{mn}x_n = h_m$$

can be simply rewritten using vector and matrix symbols as a matrix equation.

$$wx = y$$

where

$$ w = \begin{bmatrix} w_{11} \; w_{12}\;...\;w_{1n} \\ w_{21} \; w_{22}\;...\;w_{2n} \\ \;\;\;\;\;\;...\\ w_{m1}\;w_{m2}\; ···\;w_{mn} \end{bmatrix} x = \begin{bmatrix} x_1\\ x_2\\ ...\\ x_n \end{bmatrix} y = \begin{bmatrix} h_1\\ h_2\\ ...\\ h_m \end{bmatrix} $$

Vectorization involves avoiding vanilla loops for multiplying matrices but instead just directly doing $w$ * $x$ using very efficient routine to multiply two matrices.

The matrix multiplication uses appropriate vectorized implementations. By organizing our parameters in matrices and using matrix-vector operations, we can take advantage of fast linear algebra routines to quickly perform calculations in our network.

In maths, matrices should have comparable size to multiply them. That’s why we need to transpose the weights.

If we have two functions, we can also organize their gradients into a matrix by stacking the gradients. When we do so, we get the Jacobian matrix (or just the Jacobian) where the gradients are rows:

$$J = \begin {bmatrix} \nabla f(x,y) \\ \nabla g(x,y) \end{bmatrix} = \begin {bmatrix} \frac{\partial \mathbf{f(x,y)}} {\partial \mathbf{x}} \; \frac{\partial \mathbf{f(x,y)}} {\partial \mathbf{y}} \\ \frac{\partial \mathbf{g(x,y)}} {\partial \mathbf{x}} \; \frac{\partial \mathbf{g(x,y)}} {\partial \mathbf{y}} \end {bmatrix}$$

Note that there are multiple ways to represent the Jacobian. We are using the so-called numerator layout but many papers and software will use the denominator layout. This is just transpose of the numerator layout Jacobian.

Artificial Neural Networks

Introduction

Deep learning algorithms (neural networks) in their simplest form are just a sequence of two operations composed to some depth:

  • A linear transformation (i.e. a matrix-vector multiplication)
  • Followed by the element-wise application of a reasonably well-behaved non-linear function (called the activation function).

Together the linear transformation and the non-linear function are called a "layer" of the network, and the composition of many layers forms a deep neural network. Both the depth and the non-linearity of deep neural networks are crucial for their generalizability and learning power.

Since we generally use "bias units" in the linear portion of a layer, every layer can actually perform affine transformations (linear transformation + a translation). We know from linear algebra that affine transformations can only uniformly stretch and shrink (scale), rotate, sheer and translate vectors in a vector space. If we think of some data of interest as a point cloud in some N-dimensional space, then if it is non-random data, it will have some sort of shape.

The activation function ReLU is just max(0,x), and has two "pieces," the flat line for when x<0 and the sloped line for when x≥0.

Concept of Single Neuron

A single artificial neuron is a dot product between w and x, to which a bias is added, the result is the passed to an activation function that will add some non-linearity.

$$Y = W.X + B\\$$$$Output = Weight \times Input + Bias\\$$$$Single \;Neuron \;Output = Activation \;[ Weight \times Input + Bias ]\\$$

The artificial neuron can have multiple inputs.

$$Single \;Neuron \;Output = Activation \;[\;( W_1\cdot X_1 + W_2\cdot X_2 + W_3\cdot X_3 \dots ) + Bias ]\\$$$$Single \;Neuron \;Output = Activation\; (\;\sum_{i=0}^{n} W_i\cdot X_i + Bias )\\$$

The activation function depending on the threshold value, activates or deactivates that particular neuron. Thus the neuron can be represented as follows.

Activation functions are nonlinear, continuous functions and bounded.

  • Nonlinear means that the output of the function varies nonlinearly with the input
  • Continuity of the functions implies that there are no sharp peaks or gaps in the function, so that they can be differentiated throughout, making it possible to implement the delta rule to adjust both input-hidden and hidden-output layer weights in backpropagation of errors
  • The term “bounded” means that the output never reaches very large values, regardless of the input.

Dot product in ANNs

Dot products are used in neural nets work, and are conceptually related to above concept using scalars.

A single unit in a typical neural net receives inputs $\{x_1, \dots, x_n\}$ from other units, and produces an output y. To compute the output, we multiply each input by a corresponding weight $\{w_1, \dots, w_n\}$. The weights determine the strength of the connection from each input. We sum the weighted inputs to obtain the total amount of input, then add a bias term b. The final output is obtained by running this sum through an activation function f, which describes the way that the unit responds to the total input. The activation function is typically nonlinear, e.g. a sigmoidal or rectified linear function. So we have:

$$y = f \left ( \sum_{i=1}^n w_i x_i + b \right )$$

The weighted sum can be re-written as a dot product, which is more convenient notation, and can be computed more efficiently. Let the vector $x = [x_1, \dots, x_n]$ contain the inputs, and the vector $w = [w_1, \dots, w_n]$ contain the corresponding weights. By the definition of the dot product:

$$\sum_{i=1}^n w_i x_i = w \cdot x$$

Plug this back into the equation for the output:

$$y = f \left ( w \cdot x + b \right )$$

In practice, we wouldn't compute the outputs one by one, but we would compute it for an entire layer of the neural net simultaneously. This would use matrix multiplication (called vectorization) rather than individual dot products, which can be implemented more efficiently using numerical linear algebra libraries.

Layers of Neurons

A single neuron cannot do much work alone. Many neurons can be connected together forming a neural network.

In a neural network a collection of neurons is called a layer. Various connected layers of neurons form the neural network.

The output of each layer is treated as set of features and these features are further multiplied by weights as they move into the next hidden layer.

We can interpret these weights as forming templates of the output classes. They form representations of the objects they are trained on, and these representations are useful for complex classification or prediction.

Neural networks as classified as either shallow neural networks or deep neural networks. Deep neural networks have more than two hidden layers.

Activation functions can be applied to whole of the layers as well to each weighted sum in the neuron.

Input neurons are however many needed to accept the input data, neurons in output layer are however many needed by output.

Matrix dimensions and Calculations

Consider the following network. How do we represent in matrix form and calculate the correct dimensions?

Feature arrangement

There are two types we can arrange the data in a matrix form:

1). Arrange features as columns

For example, a matrix with n-features and 1-sample looks like:

$$ [x_1^1\; x_1^2\; x_1^3\; x_1^4\; \cdots \; \cdots \;x_1^n\; ] $$

Thus, a matrix with n-features and n-sample looks like:

2). Arrange features as rows

For example, a matrix with n-features and 1-sample looks like:

$$ \begin{bmatrix} x_1^1\; \\ x_1^2\; \\ x_1^3\; \\ x_1^4\; \\ \cdots \; \\ \cdots \;\\ x_1^n\; \end{bmatrix} $$

Thus, a matrix with n-features and n-sample looks like:

The order of operations depends on how we arrange the data.

If the data is arranged as features as columns: $$X*W + B$$

If the data is arranged as features as rows: $$W*X + B$$

Where, $W$ = Weight Matrix, $X$ = Input Matrix, $B$ = Bias Matrix

Dimensions of Matrices

Given the above deep Neural Network with 2-Hidden layers where:

  • Layer 0 has 4 inputs and 6 outputs
  • Layer 1 has 6 inputs and 6 outputs
  • Layer 2 has 6 inputs and 2 outputs
  • Layer 3 has 2 inputs and 2 outputs

The dimensions of matrices can be calculated as:

(a) When features arranged as columns:

Basic Equation: $$ Y = X \cdot W + B $$

For weights: $$ W_{nodes \;in \;layer\;(n-1)\; \times \;nodes \;in \;layer \;(n) }$$

For bias: $$ B_{1\; \times \; nodes \; in\; layer\; (n)} $$

Where $n$ = 1, 2, 3, …., output layer number

(b) When features arranged as rows:

Basic Equation: $$ Y = W \cdot X + B $$

For weights: $$ W_{nodes \;in \;layer\;(n)\; \times \;nodes \;in \;layer \;(n-1) }$$

For bias: $$ B_{1\; \times \; nodes \; in\; layer\; (n)} $$

Where $n$ = 1, 2, 3, …., output layer number

We can see that dimensions of W & B of “Features arranged as rows” is just the Transpose of W & B of “Features arranged as columns”.

Note: Using Features arranged as columns makes perfect sense. Because, By default W is a vector of weights and in maths a vector is considered a column, not a row.

Layer-wise arrangement of Matrices

We consider the features arranged as columns for input and output dimensions for calculating the dimensions of the matrices in the above example.

At a time, computer can only use 1 sample per thread/core. So, if we have 1024 cores(GPU), then it can use 1 sample x 1024 nos simultaneously.

Dimensions of Matrices for Layer— 1:

$$ X_{1 \times 4} \cdot W_{4 \times 6} + B_{1 \times 6} = Y_{1\times 6}$$

Dimensions of Matrices for Layer— 2:

$$ X_{1 \times 6} \cdot W_{6 \times 6} + B_{1 \times 6} = Y_{1\times 6}$$

Dimensions of Matrices for Layer— 3:

$$ X_{1 \times 6} \cdot W_{6 \times 2} + B_{1 \times 2} = Y_{1\times 2}$$

Finally we get the output $Y$ with $1\times2$ dimensions

Thus for Layer— 1 with 6-hidden neurons:

$$ [x_1\; x_2\; x_3\; x_4\; ] \times \begin{bmatrix} w_{1,1} & w_{1,2} & w_{1,3} & w_{1,4} & w_{1,5} & w_{1,6}\;\\ w_{2,1} & w_{2,2} & w_{2,3} & w_{2,4} & w_{2,5} & w_{2,6}\;\\ w_{3,1} & w_{3,2} & w_{3,3} & w_{3,4} & w_{3,5} & w_{3,6}\;\\ w_{4,1} & w_{4,2} & w_{4,3} & w_{4,4} & w_{4,5} & w_{4,6}\;\\ \end{bmatrix} = \\$$$$ \begin{bmatrix} x_1 \times w_{1,1}+ x_2 \times w_{2,1} + x_3 \times w_{3,1} + x_4 \times w_{4,1} & x_1 \times w_{1,2}+ x_2 \times w_{2,2} + x_3 \times w_{3,2} + x_4 \times w_{4,2} \dots \end{bmatrix} \\$$$$= \begin{bmatrix} h_1 & h_2 & h_3 & h_4 & h_5 & h_6 & \end{bmatrix} + \begin{bmatrix} b_1 & b_2 & b_3 & b_4 & b_5 & b_6 \end{bmatrix} = Y_{1 \times 6} $$

Each column above corresponds to output at a hidden neuron.

Matrix multiplication can be done faster on both CPU (e.g. BLAS) and GPU (e.g. cuBLAS).

Matrix multiplication with numpy is much faster than Python loops.

Intitution for Matrix Calculations

The following figure demonstrates how real values represented in matrix form travel through the various layers.

Graph Theory

Computational graph theory is a “language” describing a function where:

Node: variable (scalar, vector, tensor, or other type)

Edge: operation (simple function of one or more variables, functions more complex obtained by composing operations)

Computational Graph of $xy$ -- If variable $z$ is computed by applying some operation to variable $x$ and $y$ then we can draw the computational graph of $ z=xy$ as in figure (a).

Computational Graph of Logistic Regression -- For expression $y = \sigma (x^T w+b)$ variables in graph $u^{(1)}$ and $u^{(2)}$ are not in original expression, but are needed in graph.

Computational Graph of Logistic Regression -- The expression $H=max$ { $0,XW+b$ } computes a design matrix of Rectified linear unit activations $H$ given design matrix of minibatch of inputs $X$.

Backpropogration using Graph Theory

Multi-variable functions are generally of the form $$f_1(f_2(f_3(f_4(x)))) $$

If we were required to find $\frac {\partial f_1} {\partial x}$ then it can be found as:

$$\frac {\partial f_1} {\partial x} = \frac {\partial f_1} {\partial f_2} \cdot \frac {\partial f_2} {\partial f_3} \cdot \frac {\partial f_3} {\partial f_4} \cdot \frac {\partial f_4} {\partial x}$$

The above equation can be understood by applying chain rule in mathematics. As can be seen above the function $f_1$ doesnt directly change based on $x$ but propogates slowly first changing with $f_2$ and then subsequently $f_2$ changing with respect to $f_3$ and so on and so forth.

Backpropagation, is essentially the chain rule of calculus applied to computational graphs.

$$\\ \huge \xrightarrow{x} \bigcirc \xrightarrow{A} \bigcirc \xrightarrow{B} \bigcirc \xrightarrow{C} \bigcirc \xrightarrow{y}$$

Say we wanted to find the partial derivative of the variable $y$ with respect to $x$ in the figure above. We can’t find that out directly because there are 3 other variables involved in the computational graph.

Thus,

$$ \frac{\partial y} {\partial x} = \frac {\partial y} {\partial C} \cdot \frac {\partial C} {\partial B} \cdot \frac {\partial B} {\partial A} \cdot \frac {\partial A} {\partial x}$$

Intuitively for neural networks this can be understood as follows using a water pipeline analogy where the final water flow is dependent on various valves at previous locations.

$$\require {enclosed} \fbox{input water flow} \xrightarrow{} \bigotimes \xrightarrow{} \bigotimes \xrightarrow{} \bigotimes \xrightarrow{} \bigotimes \xrightarrow{} \fbox{output water flow}$$

Partial Derivatives using Graph Theory (Pointwise)

To calculate a derivative of node $a$ w.r.t. node $b$ we following rules:

  1. Find an unvisited path from $a$ to $b$
  2. Multiply all edge values along the above path
  3. Add to the resulting derivative and repeat the process till all paths are exhausted

Thus for the above function:

$$\frac {\partial p}{\partial x_1} = \frac {\partial p}{\partial z_1} \frac {\partial z_1}{\partial x_1} + \frac {\partial p}{\partial z_2} \frac {\partial z_2}{\partial x_1}$$$$\frac {\partial p}{\partial x_2} = \frac {\partial p}{\partial z_1} \frac {\partial z_1}{\partial x_2} + \frac {\partial p}{\partial z_2} \frac {\partial z_2}{\partial x_2}$$

The above is called the chain rule.

For every additional layers/composite functions we follow a very similar process.

Thus:

$$\frac {\partial p}{\partial x_1} = \frac {\partial p}{\partial h_1} \frac {\partial h_1}{\partial z_1} \frac {\partial z_1}{\partial x_1} + \frac {\partial p}{\partial h_1} \frac {\partial h_1}{\partial z_2} \frac {\partial z_2}{\partial x_1} + \frac {\partial p}{\partial h_2} \frac {\partial h_2}{\partial z_1} \frac {\partial z_1}{\partial x_1} + \frac {\partial p}{\partial h_2} \frac {\partial h_2}{\partial z_2} \frac {\partial z_2}{\partial x_1}$$

Dimension Balancing for Vectorized Values

Supposedly we have 3 inputs and 2 hidden neurons with linear activation, and no bias term and 1 output.

Using features represented in the columns, and the samples in the rows:

$$ \begin{bmatrix} x_{1} & x_{2} & x_{3} \end{bmatrix}\cdot \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \\ w_{31} & w_{32} \end{bmatrix} = (z_{1} \; z_{2})$$

which is equivalent to:

$$ z_{1} = x_{1}w_{11} + x_{2}w_{21} + x_{3}w_{31}$$$$ z_{2} = x_{1}w_{12} + x_{2}w_{22} + x_{3}w_{32}$$

and is written as: $$ x\cdot w =z $$

  • Matrix multiplication can be done faster on both CPU (e.g. BLAS) and GPU (e.g. cuBLAS).

  • Matrix multiplication with numpy is much faster than Python loops.

  • L is the scalar loss and we need to find the loss with respect to all the weights as that gives the partial derivatives required for the gradient descent algorithm for weight updates. Thus

$$\nabla L = (\frac {\partial{L}}{\partial w_1},\frac {\partial{L}}{\partial w_2}, \cdots, \frac {\partial{L}}{\partial w_l})$$

And in matrix form it can be expressed as:

$$\frac {\partial L}{\partial w} = \begin{bmatrix} \frac {\partial{L}} {\partial w_{11}} & \frac {\partial{L}}{\partial w_{12}} \\ \frac {\partial{L}}{\partial w_{21}} & \frac {\partial{L}}{\partial w_{22}} \\ \frac {\partial{L}}{\partial w_{31}} & \frac {\partial{L}}{\partial w_{32}} \end{bmatrix}$$

As per chain rule:

$$ \frac {\partial L}{\partial w_{11}} = \frac {\partial L}{\partial z_{1}} \frac {\partial z_1}{\partial w_{11}}$$

$$\frac {\partial L}{\partial w_{12}} = \frac {\partial L}{\partial z_{2}} \frac {\partial z_2}{\partial w_{12}}$$

$$\frac {\partial L}{\partial w_{21}} = \frac {\partial L}{\partial z_{1}} \frac {\partial z_1}{\partial w_{21}}$$$$\frac {\partial L}{\partial w_{22}} = \frac {\partial L}{\partial z_{2}} \frac {\partial z_2}{\partial w_{22}}$$$$ \frac {\partial L}{\partial w_{31}} = \frac {\partial L}{\partial z_{1}} \frac {\partial z_1}{\partial w_{31}}$$$$\frac {\partial L}{\partial w_{32}} = \frac {\partial L}{\partial z_{2}} \frac {\partial z_2}{\partial w_{32}}$$

And

$$ \frac {\partial z_1}{\partial w_{11}} = \frac {\partial({x_1}w_{11}+x_2w_{21}+x_3w_{31})}{\partial w_{11}} = x_1$$$$ \frac {\partial z_2}{\partial w_{12}} = \frac {\partial (x_{1}w_{12} + x_{2}w_{22} + x_{3}w_{32})}{\partial w_{12}} = x_1$$$$ \frac {\partial z_1}{\partial w_{21}} = \frac {\partial({x_1}w_{11}+x_2w_{21}+x_3w_{31})}{\partial w_{21}} = x_2$$$$ \frac {\partial z_2}{\partial w_{22}} = \frac {\partial (x_{1}w_{12} + x_{2}w_{22} + x_{3}w_{32})}{\partial w_{22}} = x_2$$$$ \frac {\partial z_1}{\partial w_{31}} = \frac {\partial({x_1}w_{11}+x_2w_{21}+x_3w_{31})}{\partial w_{31}} = x_3$$$$ \frac {\partial z_2}{\partial w_{32}} = \frac {\partial (x_{1}w_{12} + x_{2}w_{22} + x_{3}w_{32})}{\partial w_{32}} = x_3$$$$\frac {\partial L}{\partial w} = \begin{bmatrix} \frac {\partial{L}} {\partial z_{1}}x_1 & \frac {\partial{L}}{\partial z_{2}} x_1 \\ \frac {\partial{L}}{\partial z_{1}} x_2& \frac {\partial{L}}{\partial z_{2}}x_2 \\ \frac {\partial{L}}{\partial z_{1}}x_3 & \frac {\partial{L}}{\partial z_{2}}x_3 \end{bmatrix}$$$$\frac {\partial L}{\partial w} = \begin{bmatrix} x_1 \\ x_2 \\x_3 \end{bmatrix} \cdot\begin{bmatrix} \frac{\partial L}{\partial z_1}& \frac{\partial L}{\partial z_2}\end{bmatrix} = x^T \cdot \frac{\partial L}{\partial z}$$

The reason for doing the transpose is simply to make the dimensions work out, that matrix needs to be transposed in order to guarantee that multiplying it with the error vector afterwards results in the correct dimensionality.

For Batch of 2 Samples:

$$ \begin{bmatrix} x_{11} & x_{12} & x_{13} \\ x_{21} & x_{22} & x_{23} \end{bmatrix}\cdot \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \\ w_{31} & w_{32} \end{bmatrix} = \begin{bmatrix} z_{11} & z_{12} \\ z_{21} & z_{22} \end{bmatrix}$$

which is equivalent to:

$$ z_{11} = x_{11}w_{11} + x_{12}w_{21} + x_{13}w_{31}$$$$ z_{12} = x_{11}w_{12} + x_{12}w_{22} + x_{13}w_{32}$$$$ z_{21} = x_{21}w_{11} + x_{22}w_{21} + x_{23}w_{31}$$$$ z_{22} = x_{21}w_{12} + x_{22}w_{22} + x_{23}w_{32}$$

Thus:

$$\frac {\partial L_b}{\partial w} = \begin{bmatrix} x_{11} & x_{21}\\ x_{12} & x_{22} \\x_{13} & x_{23} \end{bmatrix} \cdot\begin{bmatrix} \frac{\partial L}{\partial z_{11}}& \frac{\partial L}{\partial z_{12}} \\ \frac{\partial L}{\partial z_{21}}& \frac{\partial L}{\partial z_{22}} \end{bmatrix} = x^T \cdot \frac{\partial L}{\partial z}$$

The Calculation Loop

Input

Forward Propogation (Calcuate prediction values)

Loss/Cost Function

Backward Step (Calculate Derivatives)

Weight Update (Gradient Descent)

Solving for ANNs

We will use a fully-connected ReLU network as our running example. The network will have a single hidden layer, i.e, a two-layer network and will be trained with gradient descent to fit random data by minimizing the Euclidean distance between the network output and the true output.

$$\begin{bmatrix} x_{1} & x_{2} \end{bmatrix}_{1,2} \cdot \begin{bmatrix} w_{11} & w_{12} \\ w_{13} & w_{14} \end{bmatrix}_{2,2}\to \begin{bmatrix} h_{1} & h_{2} \end{bmatrix}_{1,2} \to \begin{bmatrix} h_{relu1} & h_{relu2} \end{bmatrix}_{1,2} \cdot \begin{bmatrix} w_{21} \\ w_{22} \end{bmatrix}_{2,1} \to y_{pred} \to {Loss}$$

Initialize Weights:

x = np.random.randn(1, 2)

y = np.random.randn(1, 1)

w1 = np.random.randn(2, 2)

w2 = np.random.randn(2, 1)

The forward propogation:

$$h = x \cdot w1$$$$h_{relu} = max \;of \;(h,0)$$$$ y_{pred} = h_{relu} \cdot w2$$

where:

$$ h_1 = x_1w_{11}+x_2w_{13}$$$$ h_2 = x_1w_{12}+x_2w_{14}$$$$h_{relu1} = \sigma(h_1)$$$$h_{relu2} = \sigma(h_2)$$$$y_{pred} = h_{relu1} w_{21} + h_{relu2} w_{22}$$

Cost function: $$ Loss = \frac{1}{\;T} \sum_{i=1}^{T} (y_{pred} - y)^2 $$

Backpropogation:

$$ grad\_y_{pred} = (\frac {\partial L}{\partial y_{pred}})_{ij} = 2 (y_{pred_{ij}} -y_{ij})$$

We need to find all partial derivatives of loss with respect to weights:

$$\frac {\partial L}{\partial w{1}} = \begin{bmatrix} \frac {\partial{L}} {\partial w_{11}} & \frac {\partial{L}}{\partial w_{12}} \\ \frac {\partial{L}}{\partial w_{13}} & \frac {\partial{L}}{\partial w_{14}}\end{bmatrix}$$$$\frac {\partial L}{\partial w{2}} = \begin{bmatrix} \frac {\partial{L}} {\partial w_{21}} \\ \frac {\partial{L}}{\partial w_{22}}\end{bmatrix}$$

Where: $$ \frac {\partial L}{\partial w_{21}}= \frac{\partial L}{\partial y_{pred}} \frac{\partial y_{pred}}{\partial w_{21}}$$

$$ \frac {\partial L}{\partial w_{22}}= \frac{\partial L}{\partial y_{pred}} \frac{\partial y_{pred}}{\partial w_{22}}$$$$\frac {\partial L}{\partial w{2}} = \frac{\partial L}{\partial y_{pred}} \begin{bmatrix} \frac {\partial{ y_{pred}}} {\partial w_{21}} \\ \frac {\partial{ y_{pred}}}{\partial w_{22}}\end{bmatrix}$$$$\frac{ \partial y_{pred}} {\partial w_{21}} = \frac {\partial (h_{relu1} w_{21}) + \partial (h_{relu2} w_{22}) } {\partial w_{21}} = h_{relu1}$$$$\frac{ \partial y_{pred}} {\partial w_{22}} = \frac {\partial ( h_{relu1} w_{21}) + \partial (h_{relu2} w_{22}) } {\partial w_{22}} = h_{relu2}$$$$\frac {\partial L}{\partial w{2}} = \frac{\partial L}{\partial y_{pred}} \begin{bmatrix} h_{relu1} \\ h_{relu2} \end{bmatrix} = \frac{\partial L}{\partial y_{pred}} \begin{bmatrix} h_{relu1} & h_{relu2} \end{bmatrix}^T =\frac{\partial L}{\partial y_{pred}}h_{relu}^T$$

Since 1x1 matrix cannot be multiplied by 2x1 matrix and the expected weight matrix is 2x1 we do dimensionality correction:

$$ \frac {\partial L}{\partial w{2}}= h_{relu}^T \frac{\partial L}{\partial y_{pred}} = h_{relu}^T \cdot grad\_y_{pred}$$

Dimension balancing is the “cheap” but efficient approach to gradient calculations in most practical settings.

Also:

$$ \frac {\partial L}{\partial w_{11}}= \frac{\partial L}{\partial y_{pred}} \frac{\partial y_{pred}}{\partial h_{relu1}} \frac{\partial h_{relu1}}{\partial h_1} \frac{\partial h_1}{\partial w_{11}} = \frac{\partial L}{\partial y_{pred}} \;w_{21}\; \frac{\partial h_{relu1} }{\partial h_1}\; x_1$$$$ \frac {\partial L}{\partial w_{12}}= \frac{\partial L}{\partial y_{pred}} \frac{\partial y_{pred}}{\partial h_{relu2}} \frac{\partial h_{relu2}}{\partial h_2} \frac{\partial h_2}{\partial w_{12}} = \frac{\partial L}{\partial y_{pred}} \;w_{22}\; \frac{\partial h_{relu2} }{\partial h_2}\; x_1$$$$ \frac {\partial L}{\partial w_{13}}= \frac{\partial L}{\partial y_{pred}} \frac{\partial y_{pred}}{\partial h_{relu1}} \frac{\partial h_{relu1}}{\partial h_1} \frac{\partial h_1}{\partial w_{13}} = \frac{\partial L}{\partial y_{pred}} \;w_{21}\; \frac{\partial h_{relu1} }{\partial h_1}\; x_2$$$$ \frac {\partial L}{\partial w_{14}}= \frac{\partial L}{\partial y_{pred}} \frac{\partial y_{pred}}{\partial h_{relu2}} \frac{\partial h_{relu2}}{\partial h_2} \frac{\partial h_2}{\partial w_{14}} = \frac{\partial L}{\partial y_{pred}} \;w_{22}\; \frac{\partial h_{relu2} }{\partial h_2}\; x_2$$$$\frac {\partial L}{\partial w{1}} = \begin{bmatrix} \frac{\partial L}{\partial y_{pred}}w_{21}\frac {\partial h_{relu1}} {\partial h_1} x_1 & \frac{\partial L}{\partial y_{pred}}w_{22}\frac {\partial h_{relu2}} {\partial h_2} x_1 \\ \frac{\partial L}{\partial y_{pred}}w_{21}\frac {\partial h_{relu1}} {\partial h_1} x_2 & \frac{\partial L}{\partial y_{pred}}w_{22}\frac {\partial h_{relu2}} {\partial h_2} x_2\end{bmatrix} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \cdot \begin{bmatrix} \frac{\partial L}{\partial y_{pred}}w_{21}\frac {\partial h_{relu1}} {\partial h_1} & \frac{\partial L}{\partial y_{pred}}w_{22}\frac {\partial h_{relu2}} {\partial h_2} \\ \frac{\partial L}{\partial y_{pred}}w_{21}\frac {\partial h_{relu1}} {\partial h_1} & \frac{\partial L}{\partial y_{pred}}w_{22}\frac {\partial h_{relu2}} {\partial h_2} \end{bmatrix}$$$$ =\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \cdot \frac{\partial L}{\partial y_{pred}} \cdot \begin{bmatrix} w_{21}\frac {\partial h_{relu1}} {\partial h_1} & w_{22}\frac {\partial h_{relu2}} {\partial h_2} \\ \ w_{21}\frac {\partial h_{relu1}} {\partial h_1} & w_{22}\frac {\partial h_{relu2}} {\partial h_2} \end{bmatrix} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \cdot \frac{\partial L}{\partial y_{pred}} \cdot \begin{bmatrix} w_{21}\frac {\partial h_{relu1}} {\partial h_1} & w_{22}\frac {\partial h_{relu2}} {\partial h_2} \end{bmatrix}$$

Note that derivative of a ReLU function is 0 when h < 0 and 1 when h > 0.

$$ ReLU' (h) = \begin{cases} 1 & h > 0\\ 0 & h \le 0 \end{cases}$$

For the case where the derivative of ReLu function is 1 (for all h >0) the partial derivative of $\partial L$ with respect to $\partial w1$ becomes:

$$\frac {\partial L}{\partial w{1}} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}\frac{\partial L}{\partial y_{pred}} \begin{bmatrix} w_{21} & w_{22} \end{bmatrix} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}_{2,1} \cdot {grad\_y_{pred}}_{1,1} \cdot \begin{bmatrix} w_{21} & w_{22} \end{bmatrix}_{1,2} $$

Since the expected weight matrix is a 2x2 matrix, we do dimensionality correction by changing the multiplication order:

$$\frac {\partial L}{\partial w{1}} = \begin{bmatrix} x_1 & x_2 \end{bmatrix}^T_{2,1} \cdot {grad\_y_{pred}}_{1,1} \cdot \begin{bmatrix} w_{21} \\ w_{22} \end{bmatrix}^T_{1,2} = x^T_{2,1} \cdot {grad\_y_{pred}}_{1,1} \cdot W2^T_{1,2}$$

However not all the values of h are greater than 0 and for such cases, we need consider the case where the matrix h is less than 0 and make all such values tend to zero so that weight matrix has been correctly multiplied by zeros.

Update weights:

$$w1 -= learning\_rate * \frac {\partial L}{\partial w{1}} $$$$w2 -= learning\_rate * \frac {\partial L}{\partial w{2}} $$

Let us consider a larger case where: $$ Input\; x_{1,1000}\; with \;1000 \;features \\ Output \; y_{1,10}\;with\;10\;outputs$$

Additionally there is the hidden layer with 100 neurons. We consider 64-batch of samples as inputs.

Initialize matrices with correct dimensions and random values:

$$ x_{64,1000} $$$$ w1_{1000,100}$$$$ w2_{100,10}$$$$ y_{64,10} $$

Forward propogation:

$$ h_{64,100} = x_{64,1000} \cdot w1_{1000,100} $$$$ h\_relu_{64,100} = max(h_{64,100},0) $$$$ y_{64,10} = h\_relu_{64,100} \cdot w2_{100,10}$$

Cost:

We calculate cost.

Backpropogation:

$$\frac {\partial L}{\partial w{2}} =(\frac{\partial L}{\partial y_{pred}})_{64,10} \cdot h\_{relu}_{100,64}^T$$

$$ \frac {\partial L}{\partial w{2}}= h\_{relu}_{100,64}^T \cdot (\frac{\partial L}{\partial y_{pred}})_{64,10} = h\_{relu}_{100,64}^T \cdot grad\_y\_{pred}_{64,10}$$

When all the values of h>0 then all the relu derivatives are 1 hence:

$$\frac {\partial L}{\partial w{1}} = x^T_{1000,64} \cdot grad\_y\_{pred}_{64,10} \cdot W2^T_{10,100}$$

However this is not the case when some values for h<0 are zero. Hence we need to take $h_{64,100}$ and check for the condition where h<0 and make it zero and pass it as an index to $w2_{10,100}^T$.

However for fancy indexing to work, the dimensions must match, hence pass it on as an index to the product $$grad\_y\_{pred}_{64,10} \cdot W2^T_{10,100} = grad\_h\_relu_{64,100} $$

Thus

$$grad\_h\_relu_{64,100} (h_{64,100} < 0) = 0$$

Weights:

We finally update weights.

Example1


In [0]:
import numpy as np

# Create random input and output data
x = np.random.randn(1, 2)
y = np.random.randn(1, 1)

# Randomly initialize weights
w1 = np.random.randn(2, 2)
w2 = np.random.randn(2, 1)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
   
    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

Example2


In [0]:
import numpy as np

# N is batch size(sample size); D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 4, 3, 2, 1

# Create random input and output data
x = np.array([[0, 0, 1], 
              [0, 1, 1], 
              [1, 0, 1], 
              [1, 1, 1]])
y = np.array([[0], [1], [1], [0]])

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 0.02
loss_col = []
for t in range(200):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0) #- 0.001*np.maximum(-h, 0)  # using ReLU as activate function
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum() # loss function
    loss_col.append(loss)
    print(t, " : ", loss)
    print(y_pred)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y) # the last layer's error
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T) # the second layer's error
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0 # the derivate of ReLU
    print("Activated Hidden Layer Error: ")
    print(grad_h_relu)
    print("Unactivated Hidden Layer (Inputs x Weights): ")
    print(h)
    print("Derived Error of Unactivated Hidden Layer: ")
    print(grad_h)
    grad_w1 = x.T.dot(grad_h)
    print("Delta Input to Hidden Weights:")
    print(grad_w1)
    print()

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

Example3


In [0]:
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)  #64x1000
y = np.random.randn(N, D_out) #64x10

# Randomly initialize weights
w1 = np.random.randn(D_in, H)  #1000 x 100
w2 = np.random.randn(H, D_out) #100 x 10

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1) 
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

Deep Learning Hardware

Tensor Cores and their associated data paths are custom-crafted to dramatically increase floating-point compute throughput at only modest area and power costs. Clock gating is used extensively to maximize power savings.

Each Tensor Core provides a 4x4x4 matrix processing array which performs the operation $\textbf{D} = \textbf{A} \times \textbf{B} + \textbf{C}$, where $\textbf{A}, \textbf{B}, \textbf{C}$, and $\textbf{D}$ are 4×4 matrices as Figure below shows. The matrix multiply inputs $\textbf{A}$ and $\textbf{B}$ are FP16 matrices, while the accumulation matrices $\textbf{C}$ and $\textbf{D}$ may be FP16 or FP32 matrices.

Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock (FP16 input multiply with full-precision product and FP32 accumulate, as Figure 8 shows) and 8 Tensor Cores in an SM perform a total of 1024 floating point operations per clock. Tensor Cores operate on FP16 input data with FP32 accumulation. The FP16 multiply results in a full precision result that is accumulated in FP32 operations with the other products in a given dot product for a 4x4x4 matrix multiply, as Figure below shows

Homework

Solve for ANN by including the bias term and write the python code.