So far we have been looking at solving for vector $x$ when there is a known matrix $A$ and vector $b$, such that
$$ Ax = b $$The first approach is solving for one (or none) unique solution when $n$ dimensions and $p$ feature when $ n = p + 1 $ i.e. $n \times n$ matrix
The second approach is using OLS - ordinary least squares linear regression, when $ n > p + 1 $
Ordinary least squares estimation leads to an overdetermined (over-fitted) solution, which fits well with in-sample we have but does not generalise well when we extend it to outside the sample
Lets take the OLS cars example: Our sample was 7 cars for which we had $price$ and $kmpl$ data. However, our entire data is a population is a total of 42 cars. We want to see how well does this OLS for 7 cars do when we extend it to the entire set of 42 cars.
In [2]:
import numpy as np
import pandas as pd
In [3]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (10, 6)
In [4]:
pop = pd.read_csv('data/cars_small.csv')
In [5]:
pop.shape
Out[5]:
In [6]:
pop.head()
Out[6]:
In [6]:
sample_rows = [35,17,11,25,12,22,13]
In [7]:
sample = pop.loc[sample_rows,:]
In [8]:
sample
Out[8]:
Lets plot the entire population (n = 42) and the sample (n =7) and our original prediction line.
$$ price = 1662 - 62 * kmpl ~~~~ \textit{(sample = 7)}$$
In [9]:
# Plot the population and the sample
plt.scatter(pop.kmpl, pop.price, s = 150, alpha = 0.5 )
plt.scatter(sample.kmpl, sample.price, s = 150, alpha = 0.5, c = 'r')
# Plot the OLS Line - Sample
beta_0_s, beta_1_s = 1662, -62
x = np.arange(min(pop.kmpl),max(pop.kmpl),1)
plt.xlabel('kmpl')
plt.ylabel('price')
y_s = beta_0_s + beta_1_s * x
plt.text(x[-1], y_s[-1], 'sample')
plt.plot(x, y_s, '-')
Out[9]:
Let us find the best-fit OLS line for the population
In [10]:
def ols (df):
n = df.shape[0]
x0 = np.ones(n)
x1 = df.kmpl
X = np.c_[x0, x1]
X = np.asmatrix(X)
y = np.transpose(np.asmatrix(df.price))
X_T = np.transpose(X)
X_pseudo = np.linalg.inv(X_T * X) * X_T
beta = X_pseudo * y
return beta
In [11]:
ols(sample)
Out[11]:
In [12]:
ols(pop)
Out[12]:
So the two OLS lines are: $$ price = 1662 - 62 * kmpl ~~~~ \textit{(sample = 7)}$$ $$ price = 1158 - 36 * kmpl ~~~~ \textit{(population = 42)}$$
Let us plot this data:
In [13]:
# Plot the population and the sample
plt.scatter(pop.kmpl, pop.price, s = 150, alpha = 0.5 )
plt.scatter(sample.kmpl, sample.price, s = 150, alpha = 0.5, c = 'r')
# Plot the OLS Line - sample and population
beta_0_s, beta_1_s = 1662, -62
beta_0_p, beta_1_p = 1158, -36
x = np.arange(min(pop.kmpl),max(pop.kmpl),1)
plt.xlabel('kmpl')
plt.ylabel('price')
y_s = beta_0_s + beta_1_s * x
plt.text(x[-1], y_s[-1], 'sample')
y_p = beta_0_p + beta_1_p * x
plt.text(x[-1], y_p[-1], 'population')
plt.plot(x, y_s, '-')
plt.plot(x, y_p, '-')
Out[13]:
In [14]:
# Randomly select 7 cars from this dataset
sample_random = pop.sample(n=7)
sample_random
Out[14]:
Let us write some code to randomly draw a sample of 7 and do it $z$ times and see the OLS lines and coefficients
In [15]:
ols(sample_random)
Out[15]:
In [16]:
def random_cars_ols (z):
beta = []
for i in range(z):
# Select a sample and run OLS
sample_random = pop.sample(n=7)
b = ols(sample_random)
beta.append([b[0,0], b[1,0]])
# Get the OLS line
x = np.arange(min(pop.kmpl), max(pop.kmpl), 1)
y = b[0,0] + b[1,0] *x
# Set the plotting area
plt.subplot(1, 2, 1)
plt.tight_layout()
a = round(1/np.log(z), 2)
# Plot the OLS line
plt.plot(x,y, '-', linewidth = 1.0, c = 'b', alpha = a)
plt.xlabel('kmpl')
plt.ylabel('price')
plt.ylim(0,1000)
# Plot the intercept and coefficients
plt.subplot(1,2,2)
plt.scatter(beta[i][1],beta[i][0], s = 50, alpha = a)
plt.xlim(-120,60)
plt.ylim(-500,3000)
plt.xlabel('beta_1')
plt.ylabel('beta_0')
# Plot the Popultaion line
plt.subplot(1, 2, 1)
beta_0_p, beta_1_p = 1158, -36
x = np.arange(min(pop.kmpl),max(pop.kmpl),1)
y_p = beta_0_p + beta_1_p * x
plt.plot(x, y_p, '-', linewidth =4, c = 'r')
Let us do this 500 times, $ z = 500 $
In [17]:
random_cars_ols(500)
Now to prevent our $\beta $ from going all over the place to fit the line, we can need to constrain the constraint $\beta$
$$ \beta^{T} \beta < C $$For OLS our error term was:
$$ E_{ols}(\beta)= \frac {1}{n} (y-X\beta)^{T}(y-X\beta) $$So now we add another constraint on the $\beta$ to our minimization function
$$ E_{reg}(\beta)= \frac {1}{n} (y-X\beta)^{T}(y-X\beta) + \frac {\alpha}{n} \beta^{T}\beta$$To get the minimum for this error function, we need to differentiate by $\beta^T$
$$ \nabla E_{reg}(\beta) = 0 $$$$ \nabla E_{reg}(\beta) ={\frac {dE_{reg}(\beta)}{d\beta^T}} = \frac {2}{n} X^T(X\beta−y) + \frac {\alpha}{n} \beta= 0 $$$$ X^T X\beta + \alpha \beta= X^T y $$So our $\beta$ for a regularized function is
$$ \beta_{reg} = (X^T X + \alpha I)^{-1}X^Ty$$When $ \alpha = 0 $, then it becomes OLS
$$ \beta_{ols} = (X^T X)^{-1}X^Ty$$
In [18]:
def ridge (df, alpha):
n = df.shape[0]
x0 = np.ones(n)
x1 = df.kmpl
X = np.c_[x0, x1]
X = np.asmatrix(X)
y = np.asmatrix(df.price.values.reshape(-1,1))
X_T = np.transpose(X)
I = np.identity(2)
beta = np.linalg.inv(X_T * X + alpha * I ) * X_T * y
return beta
Let us run this with slpha = 0, which is OLS
In [19]:
ridge(sample, 0)
Out[19]:
Lets increase alpha to constraint the plot and see the result
In [20]:
def ridge_plot(df, alphas, func):
plt.scatter(df.kmpl, df.price, s = 150, alpha = 0.5)
plt.xlabel('kmpl')
plt.ylabel('price')
# Plot the Ridge line
for a in alphas:
beta = func(df, a)
x = np.arange(min(df.kmpl), max(df.kmpl), 1)
y = beta[0,0] + beta[1,0] * x
plt.plot(x,y, '-', linewidth = 1, c = 'b')
plt.text(x[-1], y[-1], '%s' % a, size = "smaller")
In [21]:
ridge_plot(sample, [0, 0.005, 0.01, 0.02, 0.03, 0.05, 0.1], ridge)
In [ ]:
In [ ]:
In [ ]:
Plot the Ridge Regression for different values of $\alpha$
In [ ]:
In [ ]:
Plot the overfitting by taking $n = 20$ samples?
In [ ]:
In [ ]:
Plot the overfitting by taking $n = 42$ (entire population)?
In [ ]:
In [22]:
from sklearn import linear_model
def ridge_sklearn(df, alpha):
y = df.price
X = df.kmpl.values.reshape(-1,1)
X = np.c_[np.ones((X.shape[0],1)),X]
model = linear_model.Ridge(alpha = alpha, fit_intercept = False)
model.fit(X,y)
beta = np.array([model.coef_]).T
return beta
In [23]:
ridge_sklearn(pop, 0)
Out[23]:
In [24]:
ridge_plot(sample, [0, 0.005, 0.01, 0.02, 0.03, 0.05, 0.1], ridge_sklearn)
Let us now run the see how this ridge regression helps in reducing overplotting
In [25]:
def random_cars_ridge (z, alpha, func):
beta = []
for i in range(z):
# Select a sample and run OLS
sample_random = pop.sample(n=7)
b = func(sample_random, alpha)
beta.append([b[0,0], b[1,0]])
# Get the OLS line
x = np.arange(min(pop.kmpl), max(pop.kmpl), 1)
y = b[0,0] + b[1,0] *x
# Set the plotting area
plt.subplot(1, 2, 1)
plt.tight_layout()
a = round(1/np.log(z), 2)
# Plot the OLS line
plt.plot(x,y, '-', linewidth = 1, c = 'b', alpha = a)
plt.xlabel('kmpl')
plt.ylabel('price')
plt.ylim(0,1000)
# Plot the intercept and coefficients
plt.subplot(1,2,2)
plt.scatter(beta[i][1],beta[i][0], s = 50, alpha = a)
plt.xlim(-120,60)
plt.ylim(-500,3000)
plt.xlabel('beta_1')
plt.ylabel('beta_0')
# Plot the Population line
plt.subplot(1, 2, 1)
beta_0_p, beta_1_p = 1158, -36
x = np.arange(min(pop.kmpl),max(pop.kmpl),1)
y_p = beta_0_p + beta_1_p * x
plt.plot(x, y_p, '-', linewidth =4, c = 'r')
In [26]:
random_cars_ridge (500, 0.02, ridge)
In [ ]:
In [ ]: