Linear Regression is perhaps the most straightforward supervised learning technique for regression problems. This notebook will walk through the basic algorithms of the model and build this learning tool in python.

1) we can define the notations:

$m$ - Number of training examples

$n$ - Number of features

$X$ - Features

$y$ - Target

$\theta$ - Parameters

$h_\theta(x)$ - Hypothesis

2) The cost function (witn L2 regularization term) is:

$$J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}\Big(h_\theta(x^{(i)})-y^{(i)}\Big)^2+\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2$$

3) The Gradient is:

$$\frac{\partial}{\partial\theta_j}J(\theta)=\frac{1}{m}\sum_{i=1}^{m}\Big(h_\theta(x^{(i)})-y^{(i)}\Big)X_j^{(i)}+\frac{\lambda}{m}\theta_j$$

We want to minimize the cost function with repect to $\theta$. Once we know the cost function and the graident, we can apply Gradient Descent or other more advanced optimization algorithms to learn the parameter values.

The following class LinearRegression() builds the model:



In [1]:

    
# Linear Regression

import numpy as np
import scipy

class LinearRegression():

    def __init__(self, lamda):

        self._lamda = lamda
        self._mu = None
        self._sigma = None
        self._coef = None

    def _feature_norm(self, X):

        # Normalize all features to expedite the gradient descent process
        mu = np.mean(X, axis=0)
        sigma = np.std(X, axis=0)
        X_norm = (X - mu) / sigma

        return X_norm, mu, sigma

    def _cost_calc(self, theta, X, y):

        # Formulate cost function
        m, n = X.shape
        y = y.reshape((m, 1))
        theta = theta.reshape((n, 1))
        J = 1.0 / (2 * m) * sum((X.dot(theta) - y)**2) \
            + self._lamda / (2.0 * m) * sum(theta[1:]**2)

        return J

    def _gradient_calc(self, theta, X, y):

        # Formulate the gradient of the cost function
        m, n = X.shape
        y = y.reshape((m, 1))
        theta = theta.reshape((n, 1))
        grad = np.zeros((n, 1))
        grad[0] = 1.0 / m * sum(X.dot(theta) - y)
        grad[1:] = 1.0 / m * X[:, 1:].T.dot(X.dot(theta) - y) \
                   + float(self._lamda) / m * theta[1:]

        return grad.ravel()

    def fit(self, X, y):

        # Fit the model
        m, n = X.shape
        X, self._mu, self._sigma = self._feature_norm(X)
        X = np.c_[np.ones((m, 1)), X]
        theta = np.zeros(X.shape[1])
        result = scipy.optimize.minimize(fun=self._cost_calc, x0=theta, args=(X, y),
                                         method='BFGS', jac=self._gradient_calc,
                                         options={"maxiter": 100, "disp": False})
        
        self._coef = result.x

    def predict(self, X):

        # predict with the fitted model
        m, n = X.shape
        X = np.c_[np.ones((m, 1)), (X - self._mu) / self._sigma]
        y_pred = X.dot(self._coef.reshape((n+1, 1)))

        return y_pred

Now, we can use the Boston Housing Data to do a demo.

Sklearn.datasets already incorporate this dataset. We can directly load from there.

http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html



In [2]:

    
from sklearn.datasets import load_boston

boston = load_boston()

X = boston['data']
y = boston['target']

print X.shape
print y.shape









    



(506L, 13L)
(506L,)

We can then randomly split the train and test set



In [3]:

    
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)

Finally, we fit the model on training set using the LinearRegression() class, and then predict on the test set to check the Root Mean Squared Error (RMSE)



In [4]:

    
lr = LinearRegression(lamda=0)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print "RMSE is {:.4}".format(np.sqrt(np.mean((y_test.reshape((len(y_test), 1)) - y_pred) ** 2)))









    



RMSE is 4.557

Other things to consider:

L1 (Lasso) Regularization Term that produces sparsity features
Elastic Net that combines L1 (Lasso) and L2 (Ridge) terms
Stochatic Gradient Descent vs Batch Gradient Descent