Linear regression

Linear regression in Python can be done in different ways. From coding it yourself to using a function from a statistics module.

Here we will do both.

Coding with numpy

From the Wikipedia, we see that linear regression can be expressed as: $$ y = \alpha + \beta x $$ where: $$ \beta = \frac{\overline{xy} -\bar x \bar y}{\overline{x^2} - \bar{x}^2}=\frac{\mathrm{Cov}[x,y]}{\mathrm{Var}[x]} $$ and $\alpha=\overline y - \beta \bar x$

We first import the basic modules and generate some data with noise.


In [2]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [5]:
x = np.arange(10.)
y = 5*x+3
np.random.seed(3)
y+= np.random.normal(scale=10,size=x.size)
plt.scatter(x,y);



In [6]:
def lin_reg(x,y):
    """
    Perform a linear regression of x vs y.
    x, y are 1 dimensional numpy arrays
    returns alpha and beta for the model y = alpha + beta*x
    """
    beta = np.mean(x*y)-np.mean(x)*np.mean(y)
    beta /= np.mean(x**2)-np.mean(x)**2
    alpha = np.mean(y)-beta*np.mean(x)
    return alpha, beta

lin_reg(x,y)


Out[6]:
(7.1454545454545411, 3.6787878787878796)

We could also implement it with the numpy covariance function. The diagonal terms represent the variance.


In [9]:
def lin_reg2(x,y):
    """
    Perform a linear regression of x vs y. Uses covariances.
    x, y are 1 dimensional numpy arrays
    returns alpha and beta for the model y = alpha + beta*x
    """
    c = np.cov(x,y)
    beta = c[0,1]/c[0,0]
    alpha = np.mean(y)-beta*np.mean(x)
    return alpha, beta

lin_reg2(x,y)


Out[9]:
(7.1454545454545446, 3.6787878787878787)

Coding as a least square problem

The previous methods only works for single variables. We could generalize it if we code it as a least square problem: $$ \bf y = \bf A \boldsymbol \beta $$ Remark that $\bf A$ is $\bf X$ with an extra column to represent independent term, previously called $\alpha$, that now corresponds to $\beta_{N+1}$. $$ \bf A = \left[\bf X , \bf 1 \right] $$


In [10]:
def lin_reg3(x,y):
    """
    Perform a linear regression of x vs y. Uses least squares.
    x, y are 1 dimensional numpy arrays
    returns alpha and beta for the model y = alpha + beta*x
    """
    A = np.vstack([x,np.ones_like(x)])
    return np.linalg.lstsq(A.T,y)[0]

lin_reg3(x,y)


Out[10]:
array([ 3.67878788,  7.14545455])

The simple ways

numpy

As usual, for tasks as common as a linear regression, there are already implemented solutions in several packages. In numpy, we can use polyfit, which can fit polinomial of degree $N$.


In [11]:
np.polyfit(x,y, 1)


Out[11]:
array([ 3.67878788,  7.14545455])

scipy

scipy has a statistics module that returns the fit and the correlation coefficient:


In [12]:
import scipy.stats as stats

In [13]:
alpha, beta, r, *__ = stats.linregress(x,y)
alpha, beta, r


Out[13]:
(3.6787878787878792, 7.1454545454545446, 0.81763690029241587)

scikit-learn

The most powerful module for doing data analysis, and Machine Learning is scikit-learn. There is a good documentation on linear models


In [63]:
from sklearn import linear_model

In [64]:
clf = linear_model.LinearRegression()

In [68]:
clf.fit(x[:, np.newaxis], y)


Out[68]:
LinearRegression(copy_X=True, fit_intercept=True, normalize=False)

In [73]:
clf.coef_, clf.intercept_


Out[73]:
(array([ 4.58787879]), 9.0545454545454511)

Efficiency

As an exercice test the speed of these implementation for a larger dataset.


In [ ]: