Linear Regression (OLS)

Key Equation: $Ax =b ~~ \text{for} ~~ n \times p+1 $

Linear regression - Ordinary Least Square (OLS) is the most basic form of supervised learning. In this we have a target variable (y) and we want to establish a linear relationship with a set of features (x1, x2, x3, ...)

Lets take a simple example to illustrate this problem:

We have price ('000 INR) and mileage (kmpl) for 7 hatchback cars as below

price = [199 , 248 , 302 , 363 , 418 , 462 , 523 ]
kmpl  = [23.9, 22.7, 21.1, 20.5, 19.8, 20.4, 18.6]

We want to predict the target variable price, given the input variable kmpl


In [3]:
import numpy as np

In [20]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (10, 6)

import ipywidgets as widgets
from ipywidgets import interact, interactive

In [5]:
price = np.array([199, 248, 302, 363, 418, 462, 523])

In [6]:
kmpl = np.array([23.9, 22.7, 21.1, 20.5, 19.8, 20.4, 18.6])

In [7]:
plt.scatter(kmpl, price, s = 150)
plt.xlabel('kmpl')
plt.ylabel('price')


Out[7]:
<matplotlib.text.Text at 0x112a3b898>

Thinking Linear Algebra Way

The basic problem in linear regression is solving - n linear equation, with p unknowns, where p < n

So a linear relationship can be written as:

$$ price = \beta_{0} + \beta_{1} kmpl $$

We have added an intercept to the equation, so that the line does not need to pass through zero

So we are trying to solve these n = 7 equations with, p = 2

$$ 199 = \beta_{0} + \beta_{1} 23.9 ~~~~ \text{(eq 1)} $$$$ 248 = \beta_{0} + \beta_{1} 22.7 ~~~~ \text{(eq 2)} $$$$ 302 = \beta_{0} + \beta_{1} 21.1 ~~~~ \text{(eq 3)} $$$$ 363 = \beta_{0} + \beta_{1} 20.5 ~~~~ \text{(eq 4)} $$$$ 418 = \beta_{0} + \beta_{1} 19.8 ~~~~ \text{(eq 5)} $$$$ 462 = \beta_{0} + \beta_{1} 20.4 ~~~~ \text{(eq 6)} $$$$ 523 = \beta_{0} + \beta_{1} 18.6 ~~~~ \text{(eq 7)} $$

So the key to remember here is that we are solving for $\beta_{0}$ and $ \beta_{1} $

Now if we plot these lines, it is clear that there will not be a one point of intersection that we can get like we get if we had only 2 equations.


In [8]:
b0 = np.arange(-500,4000, 100)

for i in range(7):
    b1 = (price[i] - b0)/kmpl[i]
    plt.plot(b0, b1, linewidth = 1)
    plt.text(b0[-10], b1[-10], 'eq %s'% (i + 1), fontsize = 8 )

plt.axhline(0, color='grey', linewidth=2)
plt.axvline(0, color='grey', linewidth=2)

plt.xlabel('beta0')
plt.ylabel('beta1')

plt.ylim(-150,50)


Out[8]:
(-150, 50)

Now we don't have an exact solution. But can see the $\beta_{0} $ is around [1500,1700] and $ \beta_{1} $ is around [-50,-70]. So one possible line is

$$ price = 1600 - 60 * kmpl $$

But we can clearly see that this is probably not the best possible line!!


In [26]:
beta_0_list = widgets.IntSlider(min=1500, max=1700, step=10, value=1600)
beta_1_list = widgets.IntSlider(min=-70, max=-50, step=2, value=-60)

In [14]:
beta_0 = 1600 
beta_1 = -60

In [15]:
def plot_line(beta_0, beta_1):

    plt.scatter(kmpl, price, s = 150)
    plt.xlabel('kmpl')
    plt.ylabel('price')
    y = beta_0 + beta_1 * kmpl
    plt.plot(kmpl, y, '-')

Lets change the value of beta_0 and beta_1 and see if can find the right answer


In [27]:
interactive(plot_line, beta_0 = beta_0_list, beta_1 = beta_1_list )


Adding Error Term

The linear relationship hence needs to be modeled through a error variable $\epsilon_{i}$ — an unobserved random variable that adds noise to the linear relationship between the target variable and input variable.

If we have p input variables then,

$$ y_{i} = \beta_{0} + \sum_{i=1}^p \beta_{i} x_{i} + \epsilon_{i} $$

We can add the $x_{0} = 1 $ in the equation:

$$ y_{i} = \sum_{i=0}^p \beta_{i} x_{i} + \epsilon_{i} $$$$ y_{i} = x_{i}^T \beta_{i} + \epsilon_{i} $$

In [8]:
plt.scatter(kmpl, price, s = 150)
plt.xlabel('kmpl')
plt.ylabel('price')
y = 1600 - 60 * kmpl
yerrL = y - price
yerrB = y - y
plt.errorbar(kmpl,y, fmt = 'o', yerr= [yerrL, yerrB], c= 'r')
plt.plot(kmpl, y,linewidth = 2)


Out[8]:
[<matplotlib.lines.Line2D at 0x1104c98d0>]

Represent Matrix Way

If we write this in matrix form

$$ y = X\beta + \epsilon $$$$ \text{where} ~~~~ X = \begin{bmatrix} - x_{1}^T- \\ - x_{2}^T- \\ ... \\ - x_{n}^T- \end{bmatrix} ~~ \text{,} ~~ y = \begin{bmatrix} y_{1} \\ y_{2} \\ ... \\ y_{n} \end{bmatrix} ~~ \text{and} ~~ \epsilon = \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ ... \\ \epsilon_{n} \end{bmatrix} $$

For our specific example, the matrix looks like:

$$ \begin{bmatrix}199 \\ 248 \\ 302 \\ 363 \\ 418 \\ 462 \\ 523 \end{bmatrix} = \begin{bmatrix} 1 & 23.9 \\ 1 & 22.7 \\ 1 & 21.1 \\ 1 & 20.5 \\ 1 & 19.8 \\ 1 & 20.4 \\ 1 & 18.6 \end{bmatrix} \begin{bmatrix} \beta_{0} \\ \beta_{1} \end{bmatrix} + \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \\ \epsilon_{6} \\ \epsilon_{7} \end{bmatrix} $$

Minimize Error - Ordinary Least Square

The error we will aim to minimize is the squared error:

$$ E(\beta)= \frac {1}{n} \sum _{i=1}^{n}(\epsilon_{i})^2 $$

This is why this technique is called Ordinary Least Square (OLS) regression

$$ E(\beta)= \frac {1}{n} \sum _{i=1}^{n}(y_{i}-x_{i}^{T}\beta)^{2} $$

which in matrix way is equal to:

$$ E(\beta)= \frac {1}{n} (y-X\beta)^{T}(y-X\beta) $$$$ E(\beta)= \frac {1}{n} ((y^{T} - \beta^{T}X^{T})(y-X\beta)) $$$$ E(\beta)= \frac {1}{n} (y^{T}y - \beta^{T}X^{T}y - y^{T}X\beta - \beta^{T}X^{T}X\beta) $$

Now, $ y^{T}X\beta = {\beta^{T}X^{T}y}^T $ and is a scalar matrix of $1 x 1$, which means it is equal to its transpose and hence $ y^{T}X\beta = \beta^{T}X^{T}y $

$$ E(\beta)= \frac {1}{n} (y^{T}y - 2\beta^{T}X^{T}y + \beta^{T}X^{T}X\beta) $$

To get the minimum for this error function, we need to differentiate by $\beta^T$

$$ \nabla E(\beta) = 0 $$$$ \nabla E(\beta) ={\frac {dE(\beta)}{d\beta^T}} = {\frac {d}{d\beta^T}}{\bigg (}{ \frac {1}{n} ||y - X\beta||}^2{\bigg )} = 0 $$$$ {\frac {d}{d\beta^T}}{\bigg (}{ y^{T}y - 2\beta^{T}X^{T}y + \beta^{T}X^{T}X\beta}{\bigg )} = 0 $$$$ - 2 X^Ty + 2X^{T}X\beta = 0 $$$$ X^T X\beta = X^T y $$

So the solution to OLS:

$$ \beta = X^†y ~~ \text{where} ~~ X^† = (X^T X)^{−1} X^T $$$$X^† ~~ \text{is the pseudo inverse of} ~~ X $$

Calculate Pseudo Inverse

$$ X^† = (X^T X)^{−1} X^T $$

$X^† $ is the pseudo inverse of $ X $ has good properties

$$ X^† = \left( \begin{matrix} ~ \\ \begin{bmatrix} ~ \\ p + 1 \times n \\ ~ \end{bmatrix} \begin{bmatrix} ~ \\ n \times p + 1 \\ ~ \end{bmatrix} \\ ~ \end{matrix} \right)^{-1} \begin{bmatrix} ~ \\ (p + 1 \times n) \\ ~ \end{bmatrix}$$$$ X^† = \left( \begin{matrix} ~ \\ \begin{bmatrix} ~ \\ p + 1 \times p + 1 \\ ~ \end{bmatrix} \\ ~ \end{matrix} \right)^{-1} \begin{bmatrix} ~ \\ (p + 1 \times n) \\ ~ \end{bmatrix}$$$$ X^† = \begin{bmatrix} ~ \\ (p + 1 \times n) \\ ~ \end{bmatrix}$$$$ X^†_{p + 1 \times n} = {(X^T_{p + 1 \times n} ~ X_{n \times p+1})}^{-1} ~ X^T_{p + 1 \times n}$$

In [9]:
n = 7

In [10]:
x0 = np.ones(n)
x0


Out[10]:
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.])

In [11]:
x1 = kmpl
x1


Out[11]:
array([ 23.9,  22.7,  21.1,  20.5,  19.8,  20.4,  18.6])

In [12]:
# Create the X matrix
X = np.c_[x0, x1]
X = np.asmatrix(X)
X


Out[12]:
matrix([[  1. ,  23.9],
        [  1. ,  22.7],
        [  1. ,  21.1],
        [  1. ,  20.5],
        [  1. ,  19.8],
        [  1. ,  20.4],
        [  1. ,  18.6]])

In [13]:
# Create the y matrix
y = np.asmatrix(price.reshape(-1,1))
y


Out[13]:
matrix([[199],
        [248],
        [302],
        [363],
        [418],
        [462],
        [523]])

In [14]:
y.shape


Out[14]:
(7, 1)

In [15]:
X_T = np.transpose(X)
X_T


Out[15]:
matrix([[  1. ,   1. ,   1. ,   1. ,   1. ,   1. ,   1. ],
        [ 23.9,  22.7,  21.1,  20.5,  19.8,  20.4,  18.6]])

In [16]:
X_T * X


Out[16]:
matrix([[    7.  ,   147.  ],
        [  147.  ,  3106.12]])

In [17]:
X_pseudo = np.linalg.inv(X_T * X) * X_T
X_pseudo


Out[17]:
matrix([[-3.0422893 , -1.72429767,  0.03302451,  0.69202032,  1.46084877,
          0.80185296,  2.77884041],
        [ 0.15167364,  0.08891213,  0.00523013, -0.02615063, -0.06276151,
         -0.03138075, -0.12552301]])

In [18]:
beta = X_pseudo * y
beta


Out[18]:
matrix([[ 1662.55977286],
        [  -62.06066946]])

OLS Solution

Hence we now know that the best-fit line is $\beta_0 = 1662 $ and $\beta_1 = -62$

$$ price = 1662 - 62 * kmpl $$

In [19]:
beta_0 = 1662 
beta_1 = -62
plt.scatter(kmpl, price, s = 150)
plt.xlabel('kmpl')
plt.ylabel('price')
y = beta_0 + beta_1 * kmpl
plt.plot(kmpl, y, '-')


Out[19]:
[<matplotlib.lines.Line2D at 0x110376ac8>]

Exercise 1

We had price ('000 INR), mileage (kmpl) and now we have one more input variable - horsepower (bhp) for the 7 cars

price = [199 , 248 , 302 , 363 , 418 , 462 , 523 ]
kmpl  = [23.9, 22.7, 21.1, 20.5, 19.8, 20.4, 18.6]
bhp   = [38  , 47  , 55  , 67  , 68  , 83  , 82  ]

We want to predict the value of price, given the variable kmpl and bhp


In [20]:
bhp = np.array([38, 47, 55, 67, 68, 83, 82])

In [21]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.scatter(bhp, kmpl, price, c='r', marker='o', s = 200)
ax.view_init(azim=30)


So a linear relationship can be written as:

$$ price = \beta_{0} + \beta_{1} kmpl + \beta_{2} bhp $$

We have added an intercept to the equation, so that the plane does not need to pass through zero

So we are trying to solve these n = 7 equations with, p = 3

$$ 199 = \beta_{0} + \beta_{1} 23.9 + \beta_{2} 38 + \epsilon_{1} ~~~~ \text{(eq 1)} $$$$ 248 = \beta_{0} + \beta_{1} 22.7 + \beta_{2} 47 + \epsilon_{2} ~~~~ \text{(eq 2)} $$$$ 302 = \beta_{0} + \beta_{1} 21.1 + \beta_{2} 55 + \epsilon_{3} ~~~~ \text{(eq 3)} $$$$ 363 = \beta_{0} + \beta_{1} 20.5 + \beta_{2} 67 + \epsilon_{4} ~~~~ \text{(eq 4)} $$$$ 418 = \beta_{0} + \beta_{1} 19.8 + \beta_{2} 68 + \epsilon_{5} ~~~~ \text{(eq 5)} $$$$ 462 = \beta_{0} + \beta_{1} 20.4 + \beta_{2} 83 + \epsilon_{6} ~~~~ \text{(eq 6)} $$$$ 523 = \beta_{0} + \beta_{1} 18.6 + \beta_{2} 82 + \epsilon_{7} ~~~~ \text{(eq 7)} $$

or in matrix form - we can write it as

$$ \begin{bmatrix}199 \\ 248 \\ 302 \\ 363 \\ 418 \\ 462 \\ 523 \end{bmatrix} = \begin{bmatrix} 1 & 23.9 & 38 \\ 1 & 22.7 & 47 \\ 1 & 21.1 & 55 \\ 1 & 20.5 & 67 \\ 1 & 19.8 & 68 \\ 1 & 20.4 & 83 \\ 1 & 18.6 & 82 \end{bmatrix} \begin{bmatrix}\beta_{0} \\ \beta_{1} \\ \beta_{2}\end{bmatrix} + \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \epsilon_{3} \\ \epsilon_{4} \\ \epsilon_{5} \\ \epsilon_{6} \\ \epsilon_{7} \end{bmatrix}$$

Develop the $X$ matrix for this problem?


In [ ]:


In [ ]:

Develop the $y$ matrix for this problem?


In [ ]:

Calculate the pseudo inverse of $X$.


In [ ]:

Find the $\beta$ for the best-fit plane.


In [ ]:

Plot the price, kmpl and bhp and the best-fit plane.


In [22]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.scatter(bhp, kmpl, price, c='r', marker='o', s = 200)

xrange = np.arange(min(bhp), max(bhp), 1)
yrange = np.arange(min(kmpl), max(kmpl), 1)
x, y = np.meshgrid(xrange, yrange)
z = 524 - 22 * y + 4 * x
ax.plot_surface(x, y, z, color ='blue', alpha = 0.5)
ax.view_init(azim=60)


Using a package: sklearn

Run the Ordinary Least Square using the package sklearn


In [23]:
import pandas as pd
df = pd.read_csv("data/cars_sample.csv")

In [24]:
from sklearn import linear_model

In [25]:
y = df.price

In [26]:
X = df[['kmpl', 'bhp']]

In [27]:
model_sklearn = linear_model.LinearRegression()

In [28]:
model_sklearn.fit(X, y)


Out[28]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [29]:
model_sklearn.coef_


Out[29]:
array([-21.72670412,   4.64289836])

In [30]:
model_sklearn.intercept_


Out[30]:
523.70717536052405

In [31]:
model_sklearn_norm = linear_model.LinearRegression(normalize = True)

In [32]:
model_sklearn_norm.fit(X, y)


Out[32]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=True)

In [33]:
model_sklearn_norm.coef_


Out[33]:
array([-21.72670412,   4.64289836])

In [34]:
model_sklearn_norm.intercept_


Out[34]:
523.70717536052439

Non Linear Transformation

What happens when we do Non-Linear transforms to the features?

What if we want to predict $price$ based on $kmpl$, $bhp$, $kmpl^2$ and $bhp / kmpl$

The think to remember is that non-linear transforms of the features does not impact the Linear Regression. Because the linear relationship is really about $\beta $ and not the features.

We can be write this as:

$$ price = \beta_{0} + \beta_{1} kmpl + \beta_{2} bhp + \beta_{3} kmpl^2 + \beta_{4} bhp/kmpl $$

In [35]:
df['kmpl2'] = np.power(df.kmpl,2)

In [36]:
plt.scatter(df.kmpl2, df.price, s = 150)
plt.xlabel('kmpl2')
plt.ylabel('price')


Out[36]:
<matplotlib.text.Text at 0x114864b38>

In [37]:
df['bhp_kmpl'] = np.divide(df.bhp, df.kmpl)

In [38]:
plt.scatter(df.bhp_kmpl, df.price, s = 150)
plt.xlabel('bhp/kmpl')
plt.ylabel('price')


Out[38]:
<matplotlib.text.Text at 0x1148fbc18>

In [39]:
df


Out[39]:
brand model price kmpl bhp type kmpl2 bhp_kmpl
0 Tata Nano 199 23.9 38 Hatchback 571.21 1.589958
1 Suzuki Alto 800 248 22.7 47 Hatchback 515.29 2.070485
2 Hyundai EON 302 21.1 55 Hatchback 445.21 2.606635
3 Suzuki Wagon-R 363 20.5 67 Hatchback 420.25 3.268293
4 Hyundai i10 418 19.8 68 Hatchback 392.04 3.434343
5 Suzuki Swift 462 20.4 83 Hatchback 416.16 4.068627
6 Hyundai i20 523 18.6 82 Hatchback 345.96 4.408602

Exercise 2

Run a linear regeression: $$ price = \beta_{0} + \beta_{1} kmpl + \beta_{2} bhp + \beta_{2} kmpl^2 + \beta_{2} bhp/kmpl $$

Using Pseudo-Inverse Matrix:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:

Using sklearn package:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: