We've covered a lot of fundamentals in the last 4 posts about Linear Regression and in this post we will cover another important idea, "Vectorization". In this context we will rewrite the equations for Linear Regression in matrix/vector form and derive the direct/exact solution to find the model parameters by solving a simple matrix equation. We will also generalize the linear regression problem to multiple feature variables, **Multi-Variate Linear Regression**. This is a natural extension from expressing the equations in matrix form.

In the paragraph above I used the word

"simple"in regard to the matrix equations. I should qualify that!Everything is simple when you know how to do it!This post is a bit more advanced than the previous posts since it is using some linear algebra.Linear algebrais one of the important corner-stones of understanding machine learning, and actually, is a corner-stone of nearly all numerical computing.

The basic mathematics prerequisites for understanding Machine Learning are Calculus-I,II,III, Linear Algebra, and, Probability and Statistics. A computer science student that is interested in Machine Learning would be well advised to get a minor in Mathematics (or just get a degree in Mathematics instead!) You don't need to know these subjects at the level of a mathematician rather more like an engineer. This is especially true for linear algebra. Linear algebra is a great first course for a theorem proving course (real mathematics). What is needed for machine learning and a computer scientist is more like "applied numerical linear algebra".

You don't need to understand all of the math used in this post! The main take-aways are the finial equations and the "idea" of rewriting formulas in matrix vector form. You can think of vectors and matrices as "programming data structures" and then realize how they are used by libraries like numpy.

If you don't have much mathematics background don't be intimidated by the statements above! It's not that hard and you can learn a lot by just following along as best you can. I wont explain everything in detail but will try to give insight into what's going on.

For background the previous 4 posts contain a lots of good stuff!

- Linear Regression Part1: Introduction
- Linear Regression Part2: Getting and Evaluating Data
- Linear Regression Part3: Model and Cost Function
- Linear Regression Part 4: Parameter Optimization by Gradient Descent

These posts along with the current one were converted to html from Jupyter notebooks. The notebooks are available at https://github.com/dbkinghorn/blog-jupyter-notebooks

Here is very brief (and inadequate!) introduction to some of the structures and operations in linear algebra.

A scalar is just a single number. For example, $$ \alpha = 2 $$

A vector is a column array of scalars. It has 1 index ($a_1$ is the first element of $a$, -- or maybe that could be $a_0$!)

$$ a = \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} $$where $a_1$ and $a_2$ are scalars.

A matrix is a 2 dimensional array. The following matrix has 2 rows and 2 columns. You can think of the columns as vectors.

$$ A = \begin{bmatrix} a_{11} & a_{12}\\ a_{21} & a_{22} \end{bmatrix} $$A matrix has 2 index values ($i,j$). The "1,2 element" of $A$ is $a_{12}$"

We would say the the matrix $A$ is a "2 by 2" or $2 \times 2$ matrix. In general it could be $n \times m$ i.e. $n$ by $m$, where are $n$ and $m$ are scalars. The vector $a$ is "a 2 vector" i.e. has length 2. You could also think of it as a 2 by 1 matrix.

The transpose of a vector or matrix has it's "indices switched". It is usuall written as a superscript $'$ or $T$. For example,

$$ a' = \begin{bmatrix} a_1 & a_2 \end{bmatrix} $$and

$$ A' = \begin{bmatrix} a_{11} & a_{21}\\ a_{12} & a_{22} \end{bmatrix} $$The elements around the diagonal of $A$ are switched. In general $A' \ne A$.

Scalars , vectors and matrices can be multiplied.

$$ \alpha a' = \begin{bmatrix} \alpha a_1 & \alpha a_2 \end{bmatrix} $$Matrices and vectors are multiplied so that each element of the result is a "row times column with the result summed". The following vector product is also called a vector "dot" product

$$ a' b = \begin{bmatrix} a_{1} & a_{2} \end{bmatrix}\begin{bmatrix} b_{1} \\ b_{2} \end{bmatrix} = a_1 b_1 + a_2 b_2$$The "1 by 2" row vector $a$ times the "2 by 1" column vector $b$ is the "1 by 1" scalar $a_1 b_1 + a_2 b_2$.

$$ A B = \begin{bmatrix} a_{11} & a_{21}\\ a_{12} & a_{22} \end{bmatrix} \begin{bmatrix} b_{11} & b_{21}\\ b_{12} & b_{22} \end{bmatrix} = \begin{bmatrix} a_{11}b_{11} + a_{12}b_{21} & a_{11}b_{12} + a_{12}b_{22} \\ a_{21}b_{11} + a_{22}b_{21} & a_{21}b_{12} + a_{22}b_{22} \end{bmatrix} $$For matrices think "row times column summed" for each element of the result matrix.

When you perform operations with vectors and matrices you need to keep their "dimensions" in mind. For conventional matrix multiplication **the inner dimensions must be conformable**. For example you can multiply a $1 \times 2$ vector with a $2 \times 3$ matrix and the result would be a $ 1 \times 3$ dimensional vector. You can multiply a matrix or vector by a scalar element-wise and you can add a scalar to a matrix or vector element-wise. You can also add 2 vectors or matrices element-wise if they have the same dimensions.

You have to be careful when using matrices and vectors. It's always good to check your work by looking at the dimensions of all operations to be sure you haven't forgotten a transpose or something. A couple of other characteristics to be aware of is that matrix multiplication is not commutative, that is, in general $AB \neq BA$ and the transpose or inverse of a product may not be what you think, $(AB)' = B'A'$ and $(AB)^{-1} = B^{-1}A^{-1}$

That is not enough explanation! If you understand the above you are set, if not, then you might want to look at some of the great Wikipedia pages.

We have been thinking of our model function in two ways. One was as our "predictor equation",

$$ h(x) = a_0 + a_1 x$$where, given the optimal values for $a_0, a_1$ we can use $h(x)$ as a predictor for the value of a house of size $x$ sqft. The other way we were looking at $h(x)$ was in terms of our data set,

$$h_a(x^{(i)}) = a_0 + a_1 x^{(i)} $$where we were considering the $i^{th}$ element in our housing data set. Recall we use $m$ for the number of elements in our dataset so the complete column of values $x$ (and thus $h(x)$) are vectors with $m$ elements.

We can consider $h(x)$ as a vector valued function of the complete column of $m$ numbers that are house sizes in our dataset i.e. $x$ is a vector of house sizes. We had also introduced $a$ as being a 2 vector containing the parameters $a_0, a_1$. So, without using any special change in notation we can think of $h$ as a vector function of $x$.

$$ h_a(x) = a_0 + a_1 x = \begin{bmatrix} 1 & x^{(1)} \\ 1 & x^{(2)} \\ \vdots & \vdots \\ 1 & x^{(m)} \end{bmatrix} \begin{bmatrix} a_{0} \\ a_{1} \end{bmatrix} = \begin{bmatrix} a_0 + a_1 x^{(1)} \\ a_0 + a_1 x^{(2)} \\ \vdots \\ a_0 + a_1 x^{(m)} \end{bmatrix} = X a $$I have introduced the **"augmented matrix" $X$** that has a column of 1's to take care of the constant term $a_0$. The model function $h$ on the full set of training data is now just a simple matrix vector multiplication!

We originally wrote our cost function as the sum of squared errors between the model function and the dataset values for the house prices $y$.

$$ J(a_0, a_1) = \frac{1}{2m} \sum_{i=1}^{m}((a_0 + a_1x^{(i)}) - y^{(i)})^2 $$A sum of squares is know as a "quadratic form" and we can write it in matrix form using the vector expression for $h_a(X)$ and the full column vector of house prices $y$. This will replace the summation, $\Sigma$, with matrix/vector multiplication.

Note, the 3 terms are just quadratic forms, they are scalers i.e. just numbers. That means that each of those terms is equal to it's transpose. (just in case you derive it an wonder where $-2y'Xa$ came from).

In the last post (part4) in order to find the optimal parameters $a$ we needed the gradient of $J(a)$ when we used the gradient descent algorithm. We also need the gradient (or derivative) to find the "matrix" solution to the equations. **(I will do the derivation here but, this would be considered advanced even though it is a relatively easy problem ... I just happen to know matrix differential calculus :-)**

The differential is,

$$ \begin{align} d[J(a)] & = \frac{1}{2m} ( da'[X'Xa] + [a'X'X]da - 2da'[X'y] ) \\ \\ & = \frac{1}{2m} ( 2[a'X'X]da - 2[y'X]da ) \\ \\ & = \frac{1}{m} ([a'X'X -y'X]da) \end{align}$$This implies that the derivative is (the stuff in front of $da$ ... a row vector)

$$ \frac{\partial J}{\partial a'} = \frac{1}{m} (a'X'X -y'X)$$and the gradient is ( the transpose of the derivative ... a column vector)

We could use that matrix form of the gradient vector in an optimization calculation, like gradient descent, to find the best values of $a$. We can also now find the solution without doing an optimization!

The cost function $J(a)$ is a quadratic function (it's bowl shaped). That means it is "convex" and that means that it has a single minimum and that minimum is the global minimum i.e. the "lowest point of the function". At the minimum the gradient is equal to zero. We have the gradient so we can set it equal to zero and solve for $a$!

$$ \nabla_aJ(a) = \frac{1}{m} (X'Xa - X'y) = 0 $$implies that,

$$ X'Xa = X'y $$applying the inverse of $X'X$ to both sides gives,

$$ (X'X)^{-1}X'Xa = (X'X)^{-1}X'y $$therefore we have,

**That's it!** That's the matrix solution to find the least squares linear regression parameters $a$.

The term

$$ (X'X)^{-1}X' $$is known as the **Moore-Penrose (pseudo) inverse of $X$** and is sometimes written as $X^+$. Most numerical programing libraries have an efficient function to find that inverse. It's often something like **pinv(X)**.

Now that we have the regression equations in matrix form it is trivial to extend linear regression to the case where we have more than one feature variable in our model function. For 1 feature our model was a straight line. For 2 features the model would be a flat plane. For more than 2 features the model would be a "hyper-plane" (higher dimensional planer surface). It is much more difficult to visualize multi-variate linear regression but all of the characteristics of the simple 1 variable case remain. The multi-variate extension can be very powerful for complicated models!

Generalizing the matrix/vector equations the the case where we have $n$ feature variables would look like the following.

$$ \begin{align} h_a(X) & = a_0 + a_1 x_1 + a_2 x_2 + \cdots + a_n x_n \\ \\ & = \begin{bmatrix} 1 & x^{(1)}_1 & x^{(1)}_2 & \ldots & x^{(1)}_n \\ 1 & x^{(2)}_1 & x^{(2)}_2 & \ldots & x^{(2)}_n \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x^{(m)}_1 & x^{(m)}_2 & \ldots & x^{(m)}_n \end{bmatrix} \begin{bmatrix} a_{0} \\ a_{1} \\a_{2} \\ \vdots \\ a_{n} \end{bmatrix} \\ \\ \end{align} $$**These formulas are the same as for the single feature variable versions!** For the multi-variate version the matrix $X$ has a column for each feature and the vector $a$ has a parameter for each feature (plus the constant term $a_0$).

```
In [1]:
```import pandas as pd # data handeling
import numpy as np # numeriacal computing
import matplotlib.pyplot as plt # plotting core
import seaborn as sns # higher level plotting tools
%matplotlib inline
sns.set()

```
In [2]:
```df_98039 = pd.read_csv("df_98039.csv")
x = df_98039['sqft_living']
y = df_98039['price']
m = len(x) # Number of data pairs

```
In [3]:
```x_mean = x.mean()
x_std = x.std()
xs = (x - x_mean)/x_std
y_mean = y.mean()
y_std = y.std()
ys = (y - y_mean)/y_std

```
In [4]:
```Xs = np.column_stack( (np.ones((m,1)) , xs ) ) # Construct the augmented matrix X
a = np.dot( np.linalg.pinv(Xs), ys) # a = pinv(X)*y
print( 'a0 = %.6f, a1 = %.6f ' %(a[0],a[1]) ) # print the values of a

```
```

```
In [5]:
```X = np.column_stack( (np.ones((m,1)) , x ) ) # Construct the augmented matrix X
a = np.dot( np.linalg.pinv(X), y) # a = pinv(X)*y
print( 'a0 = %.6f, a1 = %.6f ' %(a[0],a[1]) ) # print the values of a

```
```

Those are the values we obtained with the back transformation in the last post. Doing mean normalization is not as important when using the matrix solution.

Here's a plot of the data with those parameters for a.

```
In [6]:
```fig, ax = plt.subplots()
ax.plot(x,y, 'o', color='g', label='training data')
ax.plot(x, a[0] + a[1]*x, label='h(x) = %.2f + %.2f x' %(a[0], a[1]))
plt.legend();

```
```

To finish lets check the cost of an 8000 sqft house;

```
In [7]:
```def h(x,a0,a1) : # The model function h
h = a0 + a1*x
return h

```
In [8]:
```print( '${:,.2f}'.format( h(8000,a[0],a[1]) ) )

```
```

Indeed! I still can't afford it!

I hope the value of converting problems to matrix form is clear from this post. I feel it is much easier to work with matrix equations and it makes implementing efficient code straight forward. It's generally the case that having formulas in matrix form and using optimized mathematical libraries will give excellent performance. In anaconda Python numpy is compiled against Intel's excellent MKL (Math Kernel Library) this gives Python the potential for some of the performance capability of code built with lower level compiled compiled languages like C/C++ or Fortran.

In the next post I'll finish up Linear Regression with a look at using non-linear feature variables (like polynomial regression). This will be a chance to illustrate the very important problem of over/under fitting data.

**Happy computing! --dbk**