03 - Linear Regression

by Alejandro Correa Bahnsen and Jesus Solano

version 1.4, January 2019

Part of the class Practical Machine Learning

This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Special thanks goes to Rick Muller, Sandia National Laboratories(https://github.com/justmarkham)


In [1]:
# Import libraries.

import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import itertools

plt.style.use('ggplot')

In [2]:
# Download and load the update dataset.
urlDataset = 'https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/houses_prices_prediction.csv.zip'
data = pd.read_csv(urlDataset)
data.head()


Out[2]:
area bedroom price
0 2104 3 399900
1 1600 3 329900
2 2400 3 369000
3 1416 2 232000
4 3000 4 539900

In [3]:
data.columns


Out[3]:
Index(['area', 'bedroom', ' price'], dtype='object')

In [4]:
plt.style.use('bmh')
y = data[' price'].values
X = data['area'].values
plt.scatter(X, y)
plt.xlabel('Area')
plt.ylabel('Price')
plt.show()


Normalize data

$$ x = \frac{x -\overline x}{\sigma_x} $$


In [5]:
y_mean, y_std = y.mean(), y.std()
X_mean, X_std = X.mean(), X.std()

y = (y - y_mean)/ y_std
X = (X - X_mean)/ X_std

plt.scatter(X, y)
plt.xlabel('Area')
plt.ylabel('Price')


Out[5]:
Text(0, 0.5, 'Price')

Form of linear regression

$$h_\beta(x) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$$

  • $h_\beta(x)$ is the response
  • $\beta_0$ is the intercept
  • $\beta_1$ is the coefficient for $x_1$ (the first feature)
  • $\beta_n$ is the coefficient for $x_n$ (the nth feature)

The $\beta$ values are called the model coefficients:

  • These values are estimated (or "learned") during the model fitting process using the least squares criterion.
  • Specifically, we are find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors").
  • And once we've learned these coefficients, we can use the model to predict the response.

In the diagram above:

  • The black dots are the observed values of x and y.
  • The blue line is our least squares line.
  • The red lines are the residuals, which are the vertical distances between the observed values and the least squares line.

Cost function

The goal became to estimate the parameters $\beta$ that minimisse the sum of squared residuals

$$J(\beta_0, \beta_1)=\frac{1}{2n}\sum_{i=1}^n (h_\beta(x_i)-y_i)^2$$


In [6]:
# create X and y
n_samples = X.shape[0]
X_ = np.c_[np.ones(n_samples), X]

Lets suppose the following betas


In [7]:
beta_ini = np.array([-1, 1])

In [8]:
# h
def lr_h(beta,x):
    return np.dot(beta, x.T)

In [9]:
# scatter plot
plt.scatter(X, y)

# Plot the linear regression
x = np.c_[np.ones(2), [X.min(), X.max()]]
plt.plot(x[:, 1], lr_h(beta_ini, x), 'r', lw=5)
plt.xlabel('Area')
plt.ylabel('Price')
plt.show()


Lets calculate the error of such regression


In [10]:
# Cost function
def lr_cost_func(beta, x, y):
    # Can be vectorized
    res = 0
    for i in range(x.shape[0]):
        res += (lr_h(beta,x[i, :]) - y[i]) ** 2
    res *= 1 / (2*x.shape[0])
    return res
lr_cost_func(beta_ini, X_, y)


Out[10]:
0.6450124071218747

Understanding the cost function

Lets see how the cost function looks like for different values of $\beta$


In [11]:
beta0 = np.arange(-15, 20, 1)
beta1 = 2

In [12]:
cost_func=[]
for beta_0 in beta0:
    cost_func.append(lr_cost_func(np.array([beta_0, beta1]), X_, y) )

plt.plot(beta0, cost_func)
plt.xlabel('beta_0')
plt.ylabel('J(beta)')


Out[12]:
Text(0, 0.5, 'J(beta)')

In [13]:
beta0 = 0
beta1 = np.arange(-15, 20, 1)

In [14]:
cost_func=[]
for beta_1 in beta1:
    cost_func.append(lr_cost_func(np.array([beta0, beta_1]), X_, y) )

plt.plot(beta1, cost_func)
plt.xlabel('beta_1')
plt.ylabel('J(beta)')


Out[14]:
Text(0, 0.5, 'J(beta)')

Analyzing both at the same time


In [15]:
beta0 = np.arange(-5, 7, 0.2)
beta1 = np.arange(-5, 7, 0.2)

In [16]:
cost_func = pd.DataFrame(index=beta0, columns=beta1)

for beta_0 in beta0:
    for beta_1 in beta1:
        cost_func.loc[beta_0, beta_1] = lr_cost_func(np.array([beta_0, beta_1]), X_, y)

In [17]:
betas = np.transpose([np.tile(beta0, beta1.shape[0]), np.repeat(beta1, beta0.shape[0])])
fig = plt.figure(figsize=(10, 10))
ax = fig.gca(projection='3d')
ax.plot_trisurf(betas[:, 0], betas[:, 1], cost_func.T.values.flatten().astype('float'), cmap=cm.jet, linewidth=0.1)
ax.set_xlabel('beta_0')
ax.set_ylabel('beta_1')
ax.set_zlabel('J(beta)')
plt.show()


It can also be seen as a contour plot


In [18]:
contour_levels = [0, 0.1, 0.25, 0.5, 0.75, 1, 1.5, 2, 3, 4, 5, 7, 10, 12, 15, 20]
plt.contour(beta0, beta1, cost_func.T.values, contour_levels)
plt.xlabel('beta_0')
plt.ylabel('beta_1')


Out[18]:
Text(0, 0.5, 'beta_1')

Lets understand how different values of betas are observed on the contour plot


In [19]:
betas = np.array([[0, 0],
                 [-1, -1],
                 [-5, 5],
                 [3, -2]])

In [20]:
for beta in betas:
    print('\n\nLinear Regression with betas ', beta)
    f, (ax1, ax2) = plt.subplots(1,2, figsize=(12, 6))
    ax2.contour(beta0, beta1, cost_func.T.values, contour_levels)
    ax2.set_xlabel('beta_0')
    ax2.set_ylabel('beta_1')
    ax2.scatter(beta[0], beta[1], s=50)

    # scatter plot
    ax1.scatter(X, y)

    # Plot the linear regression
    x = np.c_[np.ones(2), [X.min(), X.max()]]
    ax1.plot(x[:, 1], lr_h(beta, x), 'r', lw=5)
    ax1.set_xlabel('Area')
    ax1.set_ylabel('Price')
    plt.show()



Linear Regression with betas  [0 0]

Linear Regression with betas  [-1 -1]

Linear Regression with betas  [-5  5]

Linear Regression with betas  [ 3 -2]

Gradient descent

Have some function $J(\beta_0, \beta_1)$

Want $\min_{\beta_0, \beta_1}J(\beta_0, \beta_1)$

Process:

  • Start with some $\beta_0, \beta_1$

  • Keep changing $\beta_0, \beta_1$ to reduce $J(\beta_0, \beta_1)$ until hopefully end up at a minimum

Gradient descent algorithm

Repeat until convergence{

$$ \beta_j := \beta_j - \alpha \frac{\partial }{\partial \beta_j} J(\beta_0, \beta_1)$$

}

while simuntaneously update j=0 and j=1

$\alpha$ is refered as the learning rate

For the particular case of linear regression with one variable and one intercept the gradient is calculated as:

$$\frac{\partial }{\partial \beta_j} J(\beta_0, \beta_1) = \frac{\partial }{\partial \beta_j} \frac{1}{2n}\sum_{i=1}^n (h_\beta(x_i)-y_i)^2$$

$$\frac{\partial }{\partial \beta_j} J(\beta_0, \beta_1) = \frac{\partial }{\partial \beta_j} \frac{1}{2n}\sum_{i=1}^n (\beta_0 + \beta_1x_i-y_i)^2$$

$ j = 0: \frac{\partial }{\partial \beta_0} = \frac{1}{n}\sum_{i=1}^n (\beta_0 + \beta_1x_i-y_i)$

$ j = 1: \frac{\partial }{\partial \beta_1} = \frac{1}{n}\sum_{i=1}^n (\beta_0 + \beta_1x_i-y_i) \cdot x_i$

Gradient descent algorithm

Repeat until convergence{

$ \beta_0 := \beta_0- \alpha \frac{1}{n}\sum_{i=1}^n (\beta_0 + \beta_1x_i-y_i)$

$ \beta_1 := \beta_1- \alpha \frac{1}{n}\sum_{i=1}^n (\beta_0 + \beta_1x_i-y_i) \cdot x_i$

}

simultaneously!

Calculate gradient


In [21]:
# gradient calculation
beta_ini = np.array([-1.5, 0.])

def gradient(beta, x, y):
    # Not vectorized
    gradient_0  = 1 / x.shape[0] * ((lr_h(beta, x) - y).sum())
    gradient_1  = 1 / x.shape[0] * ((lr_h(beta, x) - y)* x[:, 1]).sum()

    return np.array([gradient_0, gradient_1])

gradient(beta_ini, X_, y)


Out[21]:
array([-1.5       , -0.85498759])

Gradient descent algorithm


In [22]:
def gradient_descent(x, y, beta_ini, alpha, iters): 
    betas = np.zeros((iters, beta_ini.shape[0] + 1))

    beta = beta_ini
    for iter_ in range(iters):

        betas[iter_, :-1] = beta
        betas[iter_, -1] = lr_cost_func(beta, x, y)
        beta -= alpha * gradient(beta, x, y)
        
    return betas

In [23]:
iters = 100
alpha = 0.05
beta_ini = np.array([-4., -4.])

betas =  gradient_descent(X_, y, beta_ini, alpha, iters)

Lets see the evolution of the cost per iteration


In [24]:
plt.plot(range(iters), betas[:, -1])
plt.xlabel('iteration')
plt.ylabel('J(beta)')


Out[24]:
Text(0, 0.5, 'J(beta)')

Understanding what it is doing in each iteration


In [25]:
betas_ = betas[range(0, iters, 10), :-1]
for i, beta in enumerate(betas_):
    print('\n\nLinear Regression with betas ', beta)
    f, (ax1, ax2) = plt.subplots(1,2, figsize=(12, 6))
    ax2.contour(beta0, beta1, cost_func.T.values, contour_levels)
    ax2.set_xlabel('beta_0')
    ax2.set_ylabel('beta_1')
    ax2.scatter(beta[0], beta[1], c='r', s=50)
    
    if i > 0:
        for beta_ in betas_[:i]:
            ax2.scatter(beta_[0], beta_[1], s=50)

    # scatter plot
    ax1.scatter(X, y)

    # Plot the linear regression
    x = np.c_[np.ones(2), [X.min(), X.max()]]
    ax1.plot(x[:, 1], lr_h(beta, x), 'r', lw=5)
    ax1.set_xlabel('Area')
    ax1.set_ylabel('Price')
    plt.show()



Linear Regression with betas  [-4. -4.]

Linear Regression with betas  [-2.39494776 -2.05187282]

Linear Regression with betas  [-1.43394369 -0.88545711]

Linear Regression with betas  [-0.85855506 -0.18708094]

Linear Regression with betas  [-0.51404863  0.23106267]

Linear Regression with betas  [-0.3077799   0.48142069]

Linear Regression with betas  [-0.1842792   0.63131929]

Linear Regression with betas  [-0.11033476  0.72106912]

Linear Regression with betas  [-0.0660615   0.77480566]

Linear Regression with betas  [-0.03955346  0.8069797 ]

Estimated Betas


In [26]:
betas[-1, :-1]


Out[26]:
array([-0.02492854,  0.82473065])

Normal equations (aka OLS)

$$ \beta = (X^T X)^{-1} X^T Y $$


In [27]:
beta = np.dot(np.linalg.inv(np.dot(X_.T, X_)),np.dot(X_.T, y))

In [28]:
beta


Out[28]:
array([-6.75111258e-17,  8.54987593e-01])

Estimating the regression using sklearn

Using OLS


In [29]:
# import
from sklearn.linear_model import LinearRegression

In [30]:
# Initialize
linreg = LinearRegression(fit_intercept=False)

In [31]:
# Fit
linreg.fit(X_, y)


Out[31]:
LinearRegression(copy_X=True, fit_intercept=False, n_jobs=None,
         normalize=False)

In [32]:
linreg.coef_


Out[32]:
array([-9.71656032e-17,  8.54987593e-01])

Using (Stochastic) Gradient Descent*

*Differs from normal gradient descent by updating the weights with each example. This converges faster for large datasets


In [33]:
# import
from sklearn.linear_model import SGDRegressor

In [34]:
# Initialize
linreg2 = SGDRegressor(fit_intercept=False, max_iter=500,tol = 0.0000001)

In [35]:
# Fit
linreg2.fit(X_, y)


Out[35]:
SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
       eta0=0.01, fit_intercept=False, l1_ratio=0.15,
       learning_rate='invscaling', loss='squared_loss', max_iter=500,
       n_iter=None, n_iter_no_change=5, penalty='l2', power_t=0.25,
       random_state=None, shuffle=True, tol=1e-07, validation_fraction=0.1,
       verbose=0, warm_start=False)

In [36]:
linreg2.coef_


Out[36]:
array([1.82493731e-04, 8.54304344e-01])

Comparing OLS and GD

Gradient Descent Normal Equation
Need to choose $\alpha$ No need to choose $\alpha$
Needs many iterations Don't need to iterate
Works weel even when $k$ is large Slow if $k$ is very large
Need to compute $(X^TX)^{-1}$

Linear regression with multiple variables

Lets create a new freature $ area^2 $


In [37]:
data['area2'] = data['area'] ** 2
data.head()


Out[37]:
area bedroom price area2
0 2104 3 399900 4426816
1 1600 3 329900 2560000
2 2400 3 369000 5760000
3 1416 2 232000 2005056
4 3000 4 539900 9000000

Notation review

  • n = n_samples = number of examples
  • k = number of features
  • y = price
  • $x^{(i)}$ = features of the $i$ example

In [38]:
i = 2
data.loc[2, ['area', 'area2']]


Out[38]:
area        2400
area2    5760000
Name: 2, dtype: int64
  • $x_j^{(i)}$ = value of the $j$ feature of the $i$ example

In [39]:
i = 2
j = 2
data.loc[2, 'area2']


Out[39]:
5760000

Hypothesis:

  • Previously:

$$ h_\beta(x) = \beta_0 + \beta_1 x_1 $$

where $x_1$ = area

  • New:

$$ h_\beta(x) = \beta_0 + \beta_1 x_1 + \beta_2 x_2$$

where $x_2$ = $area^2$

Create new matrix X and scale


In [40]:
X = data[['area', 'area2']].values
X[0:5]


Out[40]:
array([[   2104, 4426816],
       [   1600, 2560000],
       [   2400, 5760000],
       [   1416, 2005056],
       [   3000, 9000000]])

In [41]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler(with_mean=True, with_std=True)
ss.fit(X.astype(np.float))
X = ss.transform(X.astype(np.float))
ss.mean_, ss.scale_


Out[41]:
(array([2.00068085e+03, 4.62083843e+06]),
 array([7.86202619e+02, 4.05394589e+06]))

In [42]:
X[0:5]


Out[42]:
array([[ 0.13141542, -0.04786014],
       [-0.5096407 , -0.50835371],
       [ 0.5079087 ,  0.28100069],
       [-0.74367706, -0.64524355],
       [ 1.27107075,  1.08022201]])

In [43]:
X_ = np.c_[np.ones(n_samples), X]
X_[0:5]


Out[43]:
array([[ 1.        ,  0.13141542, -0.04786014],
       [ 1.        , -0.5096407 , -0.50835371],
       [ 1.        ,  0.5079087 ,  0.28100069],
       [ 1.        , -0.74367706, -0.64524355],
       [ 1.        ,  1.27107075,  1.08022201]])

Cost function

The goal became to estimate the parameters $\beta$ that minimize the sum of squared residuals

$$J(\beta)=\frac{1}{2n}\sum_{i=1}^n (h_\beta(x^{(i)})-y_i)^2$$

$$h_\beta(x^{(i)}) = \sum_{j=0}^k \beta_j x_j^{(i)}$$

$$J(\beta)=\frac{1}{2n}\sum_{i=1}^n \left( \left( \sum_{j=0}^k \beta_j x_j^{(i)}\right) -y_i \right)^2$$

Note that $x^0$ is refering to the column of ones

Gradient descent algorithm

Repeat until convergence{

$$ \beta_j := \beta_j - \alpha \frac{\partial }{\partial \beta_j} J(\beta)$$

}

while simuntaneously update j=0..k

$\alpha$ is refered as the learning rate


In [44]:
beta_ini = np.array([0., 0., 0.])

# gradient calculation
def gradient(beta, x, y):
    return 1 / x.shape[0] * np.dot((lr_h(beta, x) - y).T, x)

gradient(beta_ini, X_, y)


Out[44]:
array([ 9.44870659e-17, -8.54987593e-01, -8.33162685e-01])

In [45]:
beta_ini = np.array([0., 0., 0.])
alpha = 0.005
iters = 100
betas = gradient_descent(X_, y, beta_ini, alpha, iters)

# Print iteration vs J
plt.plot(range(iters), betas[:, -1])
plt.xlabel('iteration')
plt.ylabel('J(beta)')


Out[45]:
Text(0, 0.5, 'J(beta)')

Aparently the cost function is not converging

Lets change alpha and increase the number of iterations


In [46]:
beta_ini = np.array([0., 0., 0.])
alpha = 0.5
iters = 1000
betas = gradient_descent(X_, y, beta_ini, alpha, iters)

# Print iteration vs J
plt.plot(range(1,iters), betas[1:, -1])
plt.xlabel('iteration')
plt.ylabel('J(beta)')


Out[46]:
Text(0, 0.5, 'J(beta)')

In [47]:
print('betas using gradient descent\n', betas[-1, :-1])


betas using gradient descent
 [-9.21248893e-17  8.91147493e-01 -3.70307030e-02]

Using the normal equations


In [48]:
betas_ols = np.dot(np.linalg.inv(np.dot(X_.T, X_)),np.dot(X_.T, y))
betas_ols


Out[48]:
array([-8.21301897e-17,  8.91150925e-01, -3.70341353e-02])

Difference


In [49]:
betas_ols - betas[-1, :-1]


Out[49]:
array([ 9.99469961e-18,  3.43234288e-06, -3.43234289e-06])

Making predictions

Predict the price when the area is 3000

Note: remeber the matrix X is scaled


In [50]:
x = np.array([3000., 3000.**2])

# scale
x_scaled = ss.transform(x.reshape(1, -1))
x_ = np.c_[1, x_scaled]
x_


Out[50]:
array([[1.        , 1.27107075, 1.08022201]])

In [51]:
y_pred = lr_h(betas_ols, x_)
y_pred


Out[51]:
array([1.09271078])

In [52]:
y_pred = y_pred * y_std + y_mean
y_pred


Out[52]:
array([475583.75451797])

Using sklearn


In [53]:
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import LinearRegression

clf1 = LinearRegression()
clf2 = SGDRegressor( max_iter=10000,tol=None)

When using sklearn there is no need to create the intercept

Also sklearn works with pandas


In [54]:
clf1.fit(data[['area', 'area2']], data[' price'])


Out[54]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [55]:
clf2.fit(X, y)


/home/jesus/.local/lib/python3.6/site-packages/sklearn/linear_model/stochastic_gradient.py:183: FutureWarning: max_iter and tol parameters have been added in SGDRegressor in 0.19. If max_iter is set but tol is left unset, the default value for tol in 0.19 and 0.20 will be None (which is equivalent to -infinity, so it has no effect) but will change in 0.21 to 1e-3. Specify tol to silence this warning.
  FutureWarning)
Out[55]:
SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
       eta0=0.01, fit_intercept=True, l1_ratio=0.15,
       learning_rate='invscaling', loss='squared_loss', max_iter=10000,
       n_iter=None, n_iter_no_change=5, penalty='l2', power_t=0.25,
       random_state=None, shuffle=True, tol=None, validation_fraction=0.1,
       verbose=0, warm_start=False)

Making predictions


In [56]:
clf1.predict(x.reshape(1, -1))


Out[56]:
array([475583.75451797])

In [57]:
clf2.predict(x_scaled.reshape(1, -1)) * y_std + y_mean


Out[57]:
array([475465.40017581])

Evaluation metrics for regression problems

Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. We need evaluation metrics designed for comparing continuous values.

Here are three common evaluation metrics for regression problems:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

Mean Squared Error (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

In [58]:
y_pred = clf1.predict(data[['area', 'area2']])

In [59]:
from sklearn import metrics
import numpy as np
print('MAE:', metrics.mean_absolute_error(data[' price'], y_pred))
print('MSE:', metrics.mean_squared_error(data[' price'], y_pred))
print('RMSE:', np.sqrt(metrics.mean_squared_error(data[' price'], y_pred)))


MAE: 51990.96151069319
MSE: 4115290102.059942
RMSE: 64150.52690399309

Comparing these metrics:

  • MAE is the easiest to understand, because it's the average error.
  • MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
  • RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are loss functions, because we want to minimize them.

Comparing linear regression with other models

Advantages of linear regression:

  • Simple to explain
  • Highly interpretable
  • Model training and prediction are fast
  • No tuning is required (excluding regularization)
  • Features don't need scaling
  • Can perform well with a small number of observations
  • Well-understood

Disadvantages of linear regression:

  • Presumes a linear relationship between the features and the response
  • Performance is (generally) not competitive with the best supervised learning methods due to high bias
  • Can't automatically learn feature interactions