Before learning about what regression and classification are, we will do a review of key mathematical concepts from linear algebra and calculus. These fundamentals will be helpful to understand some of the theoretical materials of the next few guides. Additionally, we will introduce Numpy.
We will be working a lot with Numpy, a Python library which adds support for large-scale vector and matrices, as well as fast and efficient computation of important mathematical functions operating on data.
Numpy is a very large library with many convenient functions. A review of them is beyond the scope of this chapter. We will introduce relevant numpy functions in future sessions as we go, depending on when we need them. A good non-comprehensive review can also be found here. If you are unfamiliar with Numpy, it will be very helpful to go through the exercises in that tutorial, which also contain a short Python review.
In the following section, we will just review some of the most common operations briefly, focusing on the ones that will help us in the next section.
To start, import numpy (frequently as np
to make calls shorter).
In [1]:
import numpy as np
Numpy has many convenience functions for generating lists of numbers. For example, to create a list of all integers between 0 and 10:
In [2]:
np.arange(0, 10)
Out[2]:
The numpy.linspace
function gives you a linear interpolation of n
numbers between two endpoints.
In [3]:
np.linspace(0, 10, 8)
Out[3]:
In [4]:
np.array([2, 3, 1])
Out[4]:
When two vectors of equal length are added, the elements are added point-wise.
$$ \begin{bmatrix} 2 \\ 3 \\ 1 \end{bmatrix} + \begin{bmatrix} 0 \\ 2 \\ -2 \end{bmatrix} = \begin{bmatrix} 2 \\ 5 \\ -1 \end{bmatrix} $$
In [5]:
a = np.array([2, 3, 1])
b = np.array([0, 2, -2])
c = a + b
print(c)
A vector can be multiplied element-wise by a number (called a "scalar"). For example:
$$ 3 \begin{bmatrix} 2 \\ 3 \\ 1 \end{bmatrix} = \begin{bmatrix} 6 \\ 9 \\ 3 \end{bmatrix} $$
In [6]:
3 * np.array([2,3,1])
Out[6]:
A dot product is defined as the sum of the element-wise products of two equal-sized vectors. For two vectors $a$ and $b$, it is denoted as $a \cdot b$ or as $a b^T$ (where T refers to the transpose operation, introduced further down this notebook.
$$ \begin{bmatrix} 1 & -2 & 2 \end{bmatrix} \begin{bmatrix} 0 \\ 2 \\ 3 \end{bmatrix} = 2 $$In other words, it's:
$$ 1 \cdot 0 + -2 \cdot 2 + 2 \cdot 3 = 2 $$This can be calculated with the numpy.dot
function:
In [7]:
a = np.array([1,-2,2])
b = np.array([0,2,3])
c = np.dot(a, b)
print(c)
Or the shorter way:
In [8]:
c = a.dot(b)
print(c)
A matrix is a rectangular array of numbers. For example, consider the following 2x3 matrix:
$$ \begin{bmatrix} 2 & 3 & 1 \\ 0 & 4 & -2 \end{bmatrix} $$Note that we always denote the size of the matix as rows x columns. So a 2x3 matrix has two rows and 3 columns.
Numpy can create matrices from normal Python lists using numpy.matrix
. For example:
In [9]:
np.matrix([[2,3,1],[0, 4,-2]])
Out[9]:
To instantiate a matrix of all zeros:
In [10]:
np.zeros((3, 3))
Out[10]:
To instantiate a matrix of all ones:
In [11]:
np.ones((2, 2))
Out[11]:
In linear algebra, a square matrix whose elements are all zeros, except the diagonals, which are ones, is called an "identity matrix."
For example:
$$ \mathbf I = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} $$is a 3x3 identity matrix. The reason why it is called an identity matrix is that it is analagous to multiplying a scalar by 1. A matrix multiplied by an identity matrix is unchanged.
$$ \mathbf I v = v $$To instantiate an identity matrix, use numpy.eye
. For example:
In [12]:
np.eye(3)
Out[12]:
Notice when you multiple an identity matrix by another matrix, the result is the same as the original matrix. This goes in either order. Basically, the identity matrix is like $\times 1$.
In [13]:
M = np.matrix([[9,5,6],[-1,0,5],[-2,4,2]])
I = np.eye(3)
print("original matrix = \n", M)
M2 = I * M
print("I * M = \n", M2)
M3 = M * I
print("M * I = \n", M3)
To instantiate a matrix of random elements (between 0 and 1), you can use numpy.random
:
In [14]:
A = np.random.random((2, 3))
print(A)
Transposition is to reverse the axes of two matrices. So the element at i,j
in the transposed matrix is equal to the element at j,i
in the original. The matrix A transposed is denoted as $A^T$.
In [15]:
A_transpose = np.transpose(A)
print(A_transpose)
It can also be done with the shorthand .T
operation, as in:
In [16]:
A_transpose = A.T
print(A_transpose)
Like regular vectors, matrices are added point-wise (or element-wise) and must be of the same size. So for example:
$$ \begin{bmatrix} 4 & 3 \\ 3 & -1 \\ -2 & 1 \end{bmatrix} + \begin{bmatrix} -2 & 1 \\ 5 & 3 \\ 1 & 0 \end{bmatrix} = \begin{bmatrix} 2 & 4 \\ 8 & 2 \\ -1 & 1 \end{bmatrix} $$
In [17]:
a = np.matrix([[4, 3],[3,-1],[-2,1]])
b = np.matrix([[-2, 1],[5,3],[1,0]])
c = a + b
print(c)
Also like vectors, matrices can be multiplied element-wise by a scalar.
$$ -2 \begin{bmatrix} 1 & -2 & 0 \\ 6 & 4 & -2 \end{bmatrix} = \begin{bmatrix} -2 & 4 & 0 \\ -12 & -8 & 4 \end{bmatrix} $$
In [18]:
a = np.matrix([[1,-2,0],[6,4,-2]])
-2 * a
Out[18]:
To multiply two matrices together, you take the dot product of each row of the first matrix and each column of the second matrix, as in:
So in order to multiply matrices $A$ and $B$ together, as in $C = A \dot B$, $A$ must have the same number of columns as $B$ has rows. For example:
$$ \begin{bmatrix} 1 & -2 & 0 \\ 6 & 4 & -2 \end{bmatrix} * \begin{bmatrix} 4 & -1 \\ 0 & -2 \\ 1 & 3 \end{bmatrix} = \begin{bmatrix} 4 & 3 \\ 22 & -20 \end{bmatrix} $$
In [19]:
a = np.matrix([[1,-2,0],[6,4,-2]])
b = np.matrix([[4,-1],[0,-2],[1,3]])
c = a * b
print(c)
The Hadamard product of two matrices differs from normal multiplication in that it is the element-wise multiplication of two matrices.
$$ \mathbf A \odot B = \begin{bmatrix} A_{1,1} B_{1,1} & \dots & A_{1,n} B_{1,n} \\ \vdots & \dots & \vdots \\ A_{m,1} B_{m,1} & \dots & A_{m,n} B_{m,n} \end{bmatrix} $$So for example:
$$ \begin{bmatrix} 3 & 1 \\ 0 & 5 \end{bmatrix} \odot \begin{bmatrix} -2 & 4 \\ 1 & -2 \end{bmatrix} = \begin{bmatrix} -6 & 4 \\ 0 & -10 \end{bmatrix} $$To calculate this with numpy, simply instantiate the matrices with numpy.array
instead of numpy.matrix
and it will use element-wise multiplication by default.
In [20]:
a = np.array([[3,1],[0,5]])
b = np.array([[-2,4],[1,-2]])
np.multiply(a,b)
Out[20]:
A function is an equation which shows the value of some expression which depends on or more variables. For example:
$$ f(x) = 3x^2 - 5x + 9 $$So for example, at $x=2$, $f(2)=11$. We will be encountering functions constantly. A neural network is one very big function.
With functions, in machine learning, we often make a distinction between "variables" and "parameters". The variable is that part of the equation which can vary, and the output depends on it. So the above function depends on $x$. The coefficients in the above function (3, -5, 9) are sometimes called parameters because they are characterize the shape of the function, but are held fixed.
In [21]:
def f(x):
return 3*(x**2)-5*x+9
f(2)
Out[21]:
The derivative of a function $f(x)$ is the instantaneous slope of the function at a given point, and is denoted as $f^\prime(x)$.
$$f^\prime(x) = \lim_{\Delta x\to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} $$The derivative of $f$ with respect to $x$ can also be denoted as $\frac{df}{dx}$.
The derivative can be interpreted as the slope of a function at any point, as in the following video clip, which shows that the limit converges upon the true slope as $\Delta x$ approaches 0.
The derivative of a polynomial function is given below:
$$ f(x) = a x ^ b $$$$ \frac{df}{dx} = b a x^{b-1} $$For example, let:
$$ f(x) = -2 x^3 $$then:
$$ \frac{df}{dx} = -6 x^2 $$
In [22]:
def f(x):
return -2*(x**3)
def f_deriv(x):
return -6*(x**2)
print(f(2))
print(f_deriv(2))
The derivative of any constant is 0. To see why, let:
$$ f(x) = C $$Then:
$$ f^\prime(x) = \lim_{\Delta x\to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \\ f^\prime(x) = \lim_{\Delta x\to 0} \frac{C - C}{\Delta x} \\ f^\prime(x) = \lim_{\Delta x\to 0} \frac{0}{\Delta x} \\ f^\prime(x) = 0 $$Derivatives are commutative. That is, the derivative of a sum is the sum of the derivatives. In other words:
Let $g$ and $h$ be functions. Then:
$$ \frac{d}{dx}(g + h) = \frac{dg}{dx} +\frac{dh}{dx} $$Similarly, constants can be factored out of derivatives, using the following property:
$$ \frac{d}{dx}(C f(x)) = C \frac{df}{dx} $$Functions can be composites of multiple functions. For example, consider the function:
$$ f(x) = (4x-5)^3 $$This function can be broken down by letting:
$$ h(x) = 4x-5 \\ g(x) = x^3 \\ f(x) = g(h(x)) $$The chain rule states that the derivative of a composite function $g(h(x))$ is:
$$ f^\prime(x) = g^\prime(h(x)) h^\prime(x) $$Another way of expressing this is:
$$ \frac{df}{dx} = \frac{dg}{dh} \frac{dh}{dx} $$Since $f$ and $g$ are both polynomials we find, we can easily calculate that:
$$ g^\prime(x) = 3x^2 \\ h^\prime(x) = 4 $$and therefore:
$$ f^\prime(x) = g^\prime(h(x)) h^\prime(x) \\ f^\prime(x) = g^\prime(4x-5) \cdot 4 \\ f^\prime(x) = 3 \cdot (4x-5)^2 \cdot 4 \\ f^\prime(x) = 12 \cdot (4x-5)^2 $$The chain rule is extremely important to the study of neural networks, because it is what allows us to find the derivative of the network's cost function analytically. We will see more about this in the next notebook.
In [23]:
def h(x):
return 4*x-5
def g(x):
return x**3
def f(x):
return g(h(x))
def h_deriv(x):
return 4
def g_deriv(x):
return 3*(x**2)
def f_deriv(x):
return g_deriv(h(x)) * h_deriv(x)
In [24]:
f(4)
Out[24]:
In [25]:
f_deriv(2)
Out[25]:
A function may have more than one variable. For example:
$$ f(X) = w_1 x_1 + w_2 x_2 + w_3 x_3 + ... + w_n x_n + b $$or using sum notation:
$$ f(X) = b + \sum_i w_i x_i $$One useful trick to simplify this formula is to append a $1$ to the input vector $X$, so that:
$$ X = \begin{bmatrix} x_1 & x_2 & ... & x_n & 1 \end{bmatrix} $$and let $b$ just be an element in the weights vector, so:
$$ W = \begin{bmatrix} w_1 & w_2 & ... & w_n & b \end{bmatrix} $$So then we can rewrite the function as:
$$ f(X) = W X^T $$A partial derivative of a multivariable function is the derivative of the function with respect to just one of the variables, holding all the others constant.
The partial derivative of $f$ with respect to $x_i$ is denoted as $\frac{\partial f}{\partial x_i}$.
The gradient of a function is the vector containing each of its partial derivatives at point $x$.
$$ \nabla f(X) = \left[ \frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_n} \right] $$We will look more closely at the gradient later when we get into how neural networks are trained.
In [27]:
import matplotlib.pyplot as plt
X = np.arange(-5, 5, 0.1)
Y = np.sin(X)
# make the figure
plt.figure(figsize=(6,6))
plt.plot(X, Y)
plt.xlabel('x')
plt.ylabel('y = sin(x)')
plt.title('My plot title')
Out[27]: