Linear Regression (Ridge)

So far we have been looking at solving for vector $x$ when there is a known matrix $A$ and vector $b$, such that

$$ Ax = b $$

The first approach is solving for one (or none) unique solution when $n$ dimensions and $p$ feature when $ n = p + 1 $ i.e. $n \times n$ matrix

The second approach is using OLS - ordinary least squares linear regression, when $ n > p + 1 $

Overfitting in OLS

Ordinary least squares estimation leads to an overdetermined (over-fitted) solution, which fits well with in-sample we have but does not generalise well when we extend it to outside the sample

Lets take the OLS cars example: Our sample was 7 cars for which we had $price$ and $kmpl$ data. However, our entire data is a population is a total of 42 cars. We want to see how well does this OLS for 7 cars do when we extend it to the entire set of 42 cars.



In [2]:

    
import numpy as np
import pandas as pd



In [3]:

    
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (10, 6)



In [4]:

    
pop = pd.read_csv('data/cars_small.csv')



In [5]:

    
pop.shape









    Out[5]:





(42, 6)



In [6]:

    
pop.head()









    Out[6]:







  
    
      
      brand
      model
      price
      kmpl
      bhp
      type
    
  
  
    
      0
      Chevrolet
      Beat
      421
      18.6
      79
      Hatchback
    
    
      1
      Chevrolet
      Sail
      551
      18.2
      82
      Sedan
    
    
      2
      Chevrolet
      Sail Hatchback
      468
      18.2
      82
      Hatchback
    
    
      3
      Chevrolet
      Spark
      345
      16.2
      62
      Hatchback
    
    
      4
      Fiat
      Linea Classic
      612
      14.9
      89
      Sedan



In [6]:

    
sample_rows = [35,17,11,25,12,22,13]



In [7]:

    
sample = pop.loc[sample_rows,:]



In [8]:

    
sample









    Out[8]:







  
    
      
      brand
      model
      price
      kmpl
      bhp
      type
    
  
  
    
      35
      Tata
      Nano
      199
      23.9
      38
      Hatchback
    
    
      17
      Suzuki
      Alto 800
      248
      22.7
      47
      Hatchback
    
    
      11
      Hyundai
      EON
      302
      21.1
      55
      Hatchback
    
    
      25
      Suzuki
      Wagon-R
      363
      20.5
      67
      Hatchback
    
    
      12
      Hyundai
      i10
      418
      19.8
      68
      Hatchback
    
    
      22
      Suzuki
      Swift
      462
      20.4
      83
      Hatchback
    
    
      13
      Hyundai
      i20
      523
      18.6
      82
      Hatchback

Lets plot the entire population (n = 42) and the sample (n =7) and our original prediction line.

$$ price = 1662 - 62 * kmpl ~~~~ \textit{(sample = 7)}$$



In [9]:

    
# Plot the population and the sample
plt.scatter(pop.kmpl, pop.price, s = 150, alpha = 0.5 )
plt.scatter(sample.kmpl, sample.price, s = 150, alpha = 0.5, c = 'r')

# Plot the OLS Line - Sample
beta_0_s, beta_1_s = 1662, -62 
x = np.arange(min(pop.kmpl),max(pop.kmpl),1)

plt.xlabel('kmpl')
plt.ylabel('price')

y_s = beta_0_s + beta_1_s * x
plt.text(x[-1], y_s[-1], 'sample')

plt.plot(x, y_s, '-')









    Out[9]:





[<matplotlib.lines.Line2D at 0x11ae65be0>]

Let us find the best-fit OLS line for the population



In [10]:

    
def ols (df):
    n = df.shape[0]
    x0 = np.ones(n)
    x1 = df.kmpl
    X = np.c_[x0, x1]
    X = np.asmatrix(X)
    y = np.transpose(np.asmatrix(df.price))
    X_T = np.transpose(X)
    X_pseudo = np.linalg.inv(X_T * X) * X_T
    beta = X_pseudo * y
    return beta



In [11]:

    
ols(sample)









    Out[11]:





matrix([[ 1662.55977286],
        [  -62.06066946]])



In [12]:

    
ols(pop)









    Out[12]:





matrix([[ 1157.81674084],
        [  -36.12619236]])

So the two OLS lines are: $$ price = 1662 - 62 * kmpl ~~~~ \textit{(sample = 7)}$$ $$ price = 1158 - 36 * kmpl ~~~~ \textit{(population = 42)}$$

Let us plot this data:



In [13]:

    
# Plot the population and the sample
plt.scatter(pop.kmpl, pop.price, s = 150, alpha = 0.5 )
plt.scatter(sample.kmpl, sample.price, s = 150, alpha = 0.5, c = 'r')

# Plot the OLS Line - sample and population
beta_0_s, beta_1_s = 1662, -62 
beta_0_p, beta_1_p = 1158, -36 
x = np.arange(min(pop.kmpl),max(pop.kmpl),1)

plt.xlabel('kmpl')
plt.ylabel('price')

y_s = beta_0_s + beta_1_s * x
plt.text(x[-1], y_s[-1], 'sample')

y_p = beta_0_p + beta_1_p * x
plt.text(x[-1], y_p[-1], 'population')

plt.plot(x, y_s, '-')
plt.plot(x, y_p, '-')









    Out[13]:





[<matplotlib.lines.Line2D at 0x11b06f7b8>]

Understanding Overfitting

The reason overfitting is happening is because our orginal line is really dependent on the selection of the sample of 7 observations. If we change our sample, we would get a different answer every time!



In [14]:

    
# Randomly select 7 cars from this dataset
sample_random = pop.sample(n=7)
sample_random









    Out[14]:







  
    
      
      brand
      model
      price
      kmpl
      bhp
      type
    
  
  
    
      2
      Chevrolet
      Sail Hatchback
      468
      18.2
      82
      Hatchback
    
    
      0
      Chevrolet
      Beat
      421
      18.6
      79
      Hatchback
    
    
      13
      Hyundai
      i20
      523
      18.6
      82
      Hatchback
    
    
      3
      Chevrolet
      Spark
      345
      16.2
      62
      Hatchback
    
    
      24
      Suzuki
      SX4
      715
      16.5
      103
      Sedan
    
    
      1
      Chevrolet
      Sail
      551
      18.2
      82
      Sedan
    
    
      34
      Tata
      Indigo
      499
      14.0
      65
      Sedan

Let us write some code to randomly draw a sample of 7 and do it $z$ times and see the OLS lines and coefficients



In [15]:

    
ols(sample_random)









    Out[15]:





matrix([[ 553.18091306],
        [  -2.91160758]])



In [16]:

    
def random_cars_ols (z):
    beta = []
    for i in range(z):
        
        # Select a sample and run OLS
        sample_random = pop.sample(n=7)
        b = ols(sample_random)
        beta.append([b[0,0], b[1,0]])
        
        # Get the OLS line
        x = np.arange(min(pop.kmpl), max(pop.kmpl), 1)
        y = b[0,0] + b[1,0] *x
        
        # Set the plotting area
        plt.subplot(1, 2, 1)
        plt.tight_layout()
        a = round(1/np.log(z), 2)
        
        # Plot the OLS line
        plt.plot(x,y, '-', linewidth = 1.0, c = 'b', alpha = a)
        plt.xlabel('kmpl')
        plt.ylabel('price')
        plt.ylim(0,1000)

        # Plot the intercept and coefficients
        plt.subplot(1,2,2)
        plt.scatter(beta[i][1],beta[i][0], s = 50, alpha = a)
        plt.xlim(-120,60)
        plt.ylim(-500,3000)
        plt.xlabel('beta_1')
        plt.ylabel('beta_0')
    
    # Plot the Popultaion line
    plt.subplot(1, 2, 1)
    beta_0_p, beta_1_p = 1158, -36 
    x = np.arange(min(pop.kmpl),max(pop.kmpl),1)
    y_p = beta_0_p + beta_1_p * x
    plt.plot(x, y_p, '-', linewidth =4, c = 'r')

Let us do this 500 times, $ z = 500 $



In [17]:

    
random_cars_ols(500)

L2 Regularization - Ridge Regression

Now to prevent our $\beta $ from going all over the place to fit the line, we can need to constrain the constraint $\beta$

$$ \beta^{T} \beta < C $$

For OLS our error term was:

$$ E_{ols}(\beta)= \frac {1}{n} (y-X\beta)^{T}(y-X\beta) $$

So now we add another constraint on the $\beta$ to our minimization function

$$ E_{reg}(\beta)= \frac {1}{n} (y-X\beta)^{T}(y-X\beta) + \frac {\alpha}{n} \beta^{T}\beta$$

To get the minimum for this error function, we need to differentiate by $\beta^T$

$$ \nabla E_{reg}(\beta) = 0 $$$$ \nabla E_{reg}(\beta) ={\frac {dE_{reg}(\beta)}{d\beta^T}} = \frac {2}{n} X^T(X\beta−y) + \frac {\alpha}{n} \beta= 0 $$$$ X^T X\beta + \alpha \beta= X^T y $$

So our $\beta$ for a regularized function is

$$ \beta_{reg} = (X^T X + \alpha I)^{-1}X^Ty$$

When $ \alpha = 0 $, then it becomes OLS

$$ \beta_{ols} = (X^T X)^{-1}X^Ty$$

Direct Calculation



In [18]:

    
def ridge (df, alpha):
    n = df.shape[0]
    x0 = np.ones(n)
    x1 = df.kmpl
    X = np.c_[x0, x1]
    X = np.asmatrix(X)
    y = np.asmatrix(df.price.values.reshape(-1,1))
    X_T = np.transpose(X)
    I = np.identity(2)
    beta = np.linalg.inv(X_T * X + alpha * I ) * X_T * y
    return beta

Let us run this with slpha = 0, which is OLS



In [19]:

    
ridge(sample, 0)









    Out[19]:





matrix([[ 1662.55977286],
        [  -62.06066946]])

Lets increase alpha to constraint the plot and see the result



In [20]:

    
def ridge_plot(df, alphas, func):
    plt.scatter(df.kmpl, df.price, s = 150, alpha = 0.5)
    plt.xlabel('kmpl')
    plt.ylabel('price')
    
    # Plot the Ridge line
    for a in alphas: 
        beta = func(df, a)
        x = np.arange(min(df.kmpl), max(df.kmpl), 1)
        y = beta[0,0] + beta[1,0] * x
        plt.plot(x,y, '-', linewidth = 1, c = 'b')
        plt.text(x[-1], y[-1], '%s' % a, size = "smaller")



In [21]:

    
ridge_plot(sample, [0, 0.005, 0.01, 0.02, 0.03, 0.05, 0.1], ridge)

Exercises

Run a Ridge Linear Regression: $$ price = \beta_{0} + \beta_{1} kmpl + \beta_{2} bhp + \beta_{2} kmpl^2 + \beta_{2} bhp/kmpl $$

Run the Ridge Regression using Pseudo Inverse?



In [ ]:



In [ ]:



In [ ]:

Plot the Ridge Regression for different values of $\alpha$



In [ ]:



In [ ]:

Plot the overfitting by taking $n = 20$ samples?



In [ ]:



In [ ]:

Plot the overfitting by taking $n = 42$ (entire population)?



In [ ]:

Using sklearn



In [22]:

    
from sklearn import linear_model

def ridge_sklearn(df, alpha):
    y = df.price
    X = df.kmpl.values.reshape(-1,1)
    X = np.c_[np.ones((X.shape[0],1)),X]
    model = linear_model.Ridge(alpha = alpha, fit_intercept = False)
    model.fit(X,y)
    beta = np.array([model.coef_]).T
    return beta



In [23]:

    
ridge_sklearn(pop, 0)









    Out[23]:





array([[ 1157.81674084],
       [  -36.12619236]])



In [24]:

    
ridge_plot(sample, [0, 0.005, 0.01, 0.02, 0.03, 0.05, 0.1], ridge_sklearn)

Let us now run the see how this ridge regression helps in reducing overplotting



In [25]:

    
def random_cars_ridge (z, alpha, func):
    beta = []
    for i in range(z):
        
        # Select a sample and run OLS
        sample_random = pop.sample(n=7)
        b = func(sample_random, alpha)
        beta.append([b[0,0], b[1,0]])
        
        # Get the OLS line
        x = np.arange(min(pop.kmpl), max(pop.kmpl), 1)
        y = b[0,0] + b[1,0] *x
        
        # Set the plotting area
        plt.subplot(1, 2, 1)
        plt.tight_layout()
        a = round(1/np.log(z), 2)
        
        # Plot the OLS line
        plt.plot(x,y, '-', linewidth = 1, c = 'b', alpha = a)
        plt.xlabel('kmpl')
        plt.ylabel('price')
        plt.ylim(0,1000)

        # Plot the intercept and coefficients
        plt.subplot(1,2,2)
        plt.scatter(beta[i][1],beta[i][0], s = 50, alpha = a)
        plt.xlim(-120,60)
        plt.ylim(-500,3000)
        plt.xlabel('beta_1')
        plt.ylabel('beta_0')
    
    # Plot the Population line
    plt.subplot(1, 2, 1)
    beta_0_p, beta_1_p = 1158, -36 
    x = np.arange(min(pop.kmpl),max(pop.kmpl),1)
    y_p = beta_0_p + beta_1_p * x
    plt.plot(x, y_p, '-', linewidth =4, c = 'r')



In [26]:

    
random_cars_ridge (500, 0.02, ridge)



In [ ]:



In [ ]:

	brand	model	price	kmpl	bhp	type
0	Chevrolet	Beat	421	18.6	79	Hatchback
1	Chevrolet	Sail	551	18.2	82	Sedan
2	Chevrolet	Sail Hatchback	468	18.2	82	Hatchback
3	Chevrolet	Spark	345	16.2	62	Hatchback
4	Fiat	Linea Classic	612	14.9	89	Sedan

	brand	model	price	kmpl	bhp	type
35	Tata	Nano	199	23.9	38	Hatchback
17	Suzuki	Alto 800	248	22.7	47	Hatchback
11	Hyundai	EON	302	21.1	55	Hatchback
25	Suzuki	Wagon-R	363	20.5	67	Hatchback
12	Hyundai	i10	418	19.8	68	Hatchback
22	Suzuki	Swift	462	20.4	83	Hatchback
13	Hyundai	i20	523	18.6	82	Hatchback