Linear regression

Linear regression in Python can be done in different ways. From coding it yourself to using a function from a statistics module.

Here we will do both.

Coding with numpy

From the Wikipedia, we see that linear regression can be expressed as: $$ y = \alpha + \beta x $$ where: $$ \beta = \frac{\overline{xy} -\bar x \bar y}{\overline{x^2} - \bar{x}^2}=\frac{\mathrm{Cov}[x,y]}{\mathrm{Var}[x]} $$ and $\alpha=\overline y - \beta \bar x$

We first import the basic modules and generate some data with noise.


In [6]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [8]:
x = np.arange(10.)
y = 5*x+3
np.random.seed(3)
y+= np.random.normal(scale=10,size=x.size)
plt.scatter(x,y);



In [6]:
def lin_reg(x,y):
    """
    Perform a linear regression of x vs y.
    x, y are 1 dimensional numpy arrays
    returns alpha and beta for the model y = alpha + beta*x
    """
    beta = np.mean(x*y)-np.mean(x)*np.mean(y)
    #finish...
    
lin_reg(x,y)


Out[6]:
(7.1454545454545411, 3.6787878787878796)

We could also implement it with the numpy covariance function. The diagonal terms represent the variance.


In [9]:
def lin_reg2(x,y):
    """
    Perform a linear regression of x vs y. Uses covariances.
    x, y are 1 dimensional numpy arrays
    returns alpha and beta for the model y = alpha + beta*x
    """
    c = np.cov(x,y)
    #finish...

lin_reg2(x,y)


Out[9]:
(7.1454545454545446, 3.6787878787878787)

Coding as a least square problem

The previous methods only works for single variables. We could generalize it if we code it as a least square problem: $$ \bf y = \bf A \boldsymbol \beta $$ Remark that $\bf A$ is $\bf X$ with an extra column to represent independent term, previously called $\alpha$, that now corresponds to $\beta_{N+1}$. $$ \bf A = \left[\bf X , \bf 1 \right] $$


In [10]:
def lin_reg3(x,y):
    """
    Perform a linear regression of x vs y. Uses least squares.
    x, y are 1 dimensional numpy arrays
    returns alpha and beta for the model y = alpha + beta*x
    """
    #finish...

lin_reg3(x,y)


Out[10]:
array([ 3.67878788,  7.14545455])

The simple ways

numpy

As usual, for tasks as common as a linear regression, there are already implemented solutions in several packages. In numpy, we can use polyfit, which can fit polinomial of degree $N$.


In [11]:
#finish...


Out[11]:
array([ 3.67878788,  7.14545455])

scipy

scipy has a statistics module that returns the fit and the correlation coefficient:


In [12]:
import scipy.stats as stats

In [13]:
#finish


Out[13]:
(3.6787878787878792, 7.1454545454545446, 0.81763690029241587)

scikit-learn

The most powerful module for doing data analysis, and Machine Learning is scikit-learn. There is a good documentation on linear models


In [63]:
from sklearn import linear_model

In [64]:
#Finish

Efficiency

As an exercice test the speed of these implementation for a larger dataset.


In [11]:
x = np.arange(10.)
y = 5*x+3
np.random.seed(3)
y+= np.random.normal(scale=10,size=x.size)
plt.scatter(x,y);


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-774528051800> in <module>()
      3 np.random.seed(3)
      4 y+= np.random.normal(scale=10,size=x.size)
----> 5 plt.scatter(x,y, '.');

/usr/local/lib/python3.5/dist-packages/matplotlib/pyplot.py in scatter(x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, verts, edgecolors, hold, data, **kwargs)
   3433                          vmin=vmin, vmax=vmax, alpha=alpha,
   3434                          linewidths=linewidths, verts=verts,
-> 3435                          edgecolors=edgecolors, data=data, **kwargs)
   3436     finally:
   3437         ax._hold = washold

/usr/local/lib/python3.5/dist-packages/matplotlib/__init__.py in inner(ax, *args, **kwargs)
   1890                     warnings.warn(msg % (label_namer, func.__name__),
   1891                                   RuntimeWarning, stacklevel=2)
-> 1892             return func(ax, *args, **kwargs)
   1893         pre_doc = inner.__doc__
   1894         if pre_doc is None:

/usr/local/lib/python3.5/dist-packages/matplotlib/axes/_axes.py in scatter(self, x, y, s, c, marker, cmap, norm, vmin, vmax, alpha, linewidths, verts, edgecolors, **kwargs)
   4026                 offsets=offsets,
   4027                 transOffset=kwargs.pop('transform', self.transData),
-> 4028                 alpha=alpha
   4029                 )
   4030         collection.set_transform(mtransforms.IdentityTransform())

/usr/local/lib/python3.5/dist-packages/matplotlib/collections.py in __init__(self, paths, sizes, **kwargs)
    890         Collection.__init__(self, **kwargs)
    891         self.set_paths(paths)
--> 892         self.set_sizes(sizes)
    893         self.stale = True
    894 

/usr/local/lib/python3.5/dist-packages/matplotlib/collections.py in set_sizes(self, sizes, dpi)
    863             self._sizes = np.asarray(sizes)
    864             self._transforms = np.zeros((len(self._sizes), 3, 3))
--> 865             scale = np.sqrt(self._sizes) * dpi / 72.0 * self._factor
    866             self._transforms[:, 0, 0] = scale
    867             self._transforms[:, 1, 1] = scale

TypeError: ufunc 'sqrt' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [ ]: