All code was executed in Python 3.4


I am really looking forward to your comments and suggestions to improve and
extend this little collection! Just send me a quick note
via Twitter: @rasbt
or Email: bluewoodtree@gmail.com


Implementing the least squares fit method for linear regression and speeding it up via Cython



Sections



Introduction

Linear regression via the least squares method is the simplest approach to performing a regression analysis of a dependent and a explanatory variable. The objective is to find the best-fitting straight line through a set of points that minimizes the sum of the squared offsets from the line.
The offsets come in 2 different flavors: perpendicular and vertical - with respect to the line.

Here, we will use the more common approach: minimizing the sum of the perpendicular offsets.

In more mathematical terms, our goal is to compute the best fit to n points $(x_i, y_i)$ with $i=1,2,...n,$ via linear equation of the form
$f(x) = a\cdot x + b$.
Here, we assume that the y-component is functionally dependent on the x-component.
In a cartesian coordinate system, $b$ is the intercept of the straight line with the y-axis, and $a$ is the slope of this line.

In order to obtain the parameters for the linear regression line for a set of multiple points, we can re-write the problem as matrix equation
$\pmb X \; \pmb a = \pmb y$

$\Rightarrow\Bigg[ \begin{array}{cc} x_1 & 1 \\ ... & 1 \\ x_n & 1 \end{array} \Bigg]$ $\bigg[ \begin{array}{c} a \\ b \end{array} \bigg]$ $=\Bigg[ \begin{array}{c} y_1 \\ ... \\ y_n \end{array} \Bigg]$

With a little bit of calculus, we can rearrange the term in order to obtain the parameter vector $\pmb a = [a\;b]^T$

$\Rightarrow \pmb a = (\pmb X^T \; \pmb X)^{-1} \pmb X^T \; \pmb y$


The more classic approach to obtain the slope parameter $a$ and y-axis intercept $b$ would be:

$a = \frac{S_{x,y}}{\sigma_{x}^{2}}\quad$ (slope)

$b = \bar{y} - a\bar{x}\quad$ (y-axis intercept)

where

$S_{xy} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})\quad$ (covariance)

$\sigma{_x}^{2} = \sum_{i=1}^{n} (x_i - \bar{x})^2\quad$ (variance)



Least squares fit implementations

1. The matrix approach

First, let us implement the equation:

$\pmb a = (\pmb X^T \; \pmb X)^{-1} \pmb X^T \; \pmb y$

which I will refer to as the "matrix approach".


In [1]:
import numpy as np

def lin_lstsqr_mat(x, y):
    """ Computes the least-squares solution to a linear matrix equation. """
    X = np.vstack([x, np.ones(len(x))]).T
    return (np.linalg.inv(X.T.dot(X)).dot(X.T)).dot(y)

2. The classic approach

Next, we will calculate the parameters separately, using only standard library functions in Python, which I will call the "classic approach".

$a = \frac{S_{x,y}}{\sigma_{x}^{2}}\quad$ (slope)

$b = \bar{y} - a\bar{x}\quad$ (y-axis intercept)


In [2]:
def classic_lstsqr(x, y):
    """ Computes the least-squares solution to a linear matrix equation. """
    x_avg = sum(x)/len(x)
    y_avg = sum(y)/len(y)
    var_x = sum([(x_i - x_avg)**2 for x_i in x])
    cov_xy = sum([(x_i - x_avg)*(y_i - y_avg) for x_i,y_i in zip(x,y)])
    slope = cov_xy / var_x
    y_interc = y_avg - slope*x_avg
    return (slope, y_interc)

3. Using the lstsq numpy function

For our convenience, numpy has a function that can also compute the leat squares solution of a linear matrix equation. For more information, please refer to the documentation.


In [3]:
def numpy_lstsqr(x, y):
    """ Computes the least-squares solution to a linear matrix equation. """
    X = np.vstack([x, np.ones(len(x))]).T
    return np.linalg.lstsq(X,y)[0]

4. Using the linregress scipy function

The last approach is using scipy.stats.linregress(), which returns a tuple of 5 different attributes, where the 1st value in the tuple is the slope, and the second value is the y-axis intercept, respectively. The documentation for this function can be found here.


In [4]:
import scipy.stats

def scipy_lstsqr(x,y):
    """ Computes the least-squares solution to a linear matrix equation. """
    return scipy.stats.linregress(x, y)[0:2]



Generating sample data and benchmarking

In order to test our different least squares fit implementation, we will generate some sample data:

  • 500 sample points for the x-component within the range [0,500)
  • 500 sample points for the y-component within the range [100,600)

where each sample point is multiplied by a random value within the range [0.8, 12).


In [5]:
import random
random.seed(12345)

x = [x_i*random.randrange(8,12)/10 for x_i in range(500)]
y = [y_i*random.randrange(8,12)/10 for y_i in range(100,600)]



Visualization

To check how our dataset is distributed, and how the straight line, which we obtain via the least square fit method, we will plot it in a scatter plot.
Note that we are using our "matrix approach" here for simplicity, but after plotting the data, we will check whether all of the four different implementations yield the same parameters.


In [6]:
%pylab inline
from matplotlib import pyplot as plt

slope, intercept = lin_lstsqr_mat(x, y)

line_x = [round(min(x)) - 1, round(max(x)) + 1]
line_y = [slope*x_i + intercept for x_i in line_x]

plt.figure(figsize=(8,8))
plt.scatter(x,y)
plt.plot(line_x, line_y, color='red', lw='2')

plt.ylabel('y')
plt.xlabel('x')
plt.title('Linear regression via least squares fit')

ftext = 'y = ax + b = {:.3f} + {:.3f}x'\
        .format(slope, intercept)
plt.figtext(.15,.8, ftext, fontsize=11, ha='left')

plt.show()


Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['random']
`%matplotlib` prevents importing * from pylab and numpy



Comparing the results from the different implementations

As mentioned above, let us confirm that the different implementation computed the same parameters (i.e., slope and y-axis intercept) as solution for the linear equation.


In [7]:
import prettytable

params = [appr(x,y) for appr in [lin_lstsqr_mat, classic_lstsqr, numpy_lstsqr, scipy_lstsqr]]

print(params)

fit_table = prettytable.PrettyTable(["", "slope", "y-intercept"])
fit_table.add_row(["matrix approach", params[0][0], params[0][1]])
fit_table.add_row(["classic approach", params[1][0], params[1][1]])
fit_table.add_row(["numpy function", params[2][0], params[2][1]])
fit_table.add_row(["scipy function", params[3][0], params[3][1]])

print(fit_table)


[array([   0.95181895,  107.01399744]), (0.95181895319126741, 107.01399744459181), array([   0.95181895,  107.01399744]), (0.95181895319126764, 107.01399744459175)]
+------------------+----------------+---------------+
|                  |     slope      |  y-intercept  |
+------------------+----------------+---------------+
| matrix approach  | 0.951818953191 | 107.013997445 |
| classic approach | 0.951818953191 | 107.013997445 |
|  numpy function  | 0.951818953191 | 107.013997445 |
|  scipy function  | 0.951818953191 | 107.013997445 |
+------------------+----------------+---------------+



Initial performance comparison

For a first impression how the performances of the different least squares implementations compare against each other, let us do a quick benchmark using the timeit module via IPython's %timeit magic.


In [8]:
import timeit

for lab,appr in zip(["matrix approach","classic approach",
                     "numpy function","scipy function"],
                    [lin_lstsqr_mat, classic_lstsqr, 
                     numpy_lstsqr, scipy_lstsqr]):
    print("\n{}: ".format(lab), end="")
    %timeit appr(x, y)


matrix approach: 1000 loops, best of 3: 273 µs per loop

classic approach: 100 loops, best of 3: 2.78 ms per loop

numpy function: 1000 loops, best of 3: 379 µs per loop

scipy function: 1000 loops, best of 3: 598 µs per loop


The timing above indicates, that the "classic" approach (Python's standard library functions only) is significantly worse in performance than the other implemenations - roughly by a magnitude of 10.



Compiling the Python code via Cython in the IPython notebook

Maybe we can speed things up a little bit via Cython's C-extensions for Python. Cython is basically a hybrid between C and Python and can be pictured as compiled Python code with type declarations.
Since we are working in an IPython notebook here, we can make use of the IPython magic: It will automatically convert it to C code, compile it, and load the function.
Just to be thorough, let us also compile the other functions, which already use numpy objects.


In [9]:
%load_ext cythonmagic

In [10]:
%%cython
import numpy as np
import scipy.stats
cimport numpy as np

def cy_lin_lstsqr_mat(x, y):
    """ Computes the least-squares solution to a linear matrix equation. """
    X = np.vstack([x, np.ones(len(x))]).T
    return (np.linalg.inv(X.T.dot(X)).dot(X.T)).dot(y)

def cy_classic_lstsqr(x, y):
    """ Computes the least-squares solution to a linear matrix equation. """
    x_avg = sum(x)/len(x)
    y_avg = sum(y)/len(y)
    var_x = sum([(x_i - x_avg)**2 for x_i in x])
    cov_xy = sum([(x_i - x_avg)*(y_i - y_avg) for x_i,y_i in zip(x,y)])
    slope = cov_xy / var_x
    y_interc = y_avg - slope*x_avg
    return (slope, y_interc)

def cy_numpy_lstsqr(x, y):
    """ Computes the least-squares solution to a linear matrix equation. """
    X = np.vstack([x, np.ones(len(x))]).T
    return np.linalg.lstsq(X,y)[0]

def cy_scipy_lstsqr(x,y):
    """ Computes the least-squares solution to a linear matrix equation. """
    return scipy.stats.linregress(x, y)[0:2]



Comparing the compiled Cython code to the original Python code


In [11]:
import timeit

for lab,appr in zip(["matrix approach","classic approach",
                     "numpy function","scipy function"],
                    [(lin_lstsqr_mat, cy_lin_lstsqr_mat), 
                     (classic_lstsqr, cy_classic_lstsqr),
                     (numpy_lstsqr, cy_numpy_lstsqr),
                     (scipy_lstsqr, cy_scipy_lstsqr)]):
    print("\n\n{}: ".format(lab))
    %timeit appr[0](x, y)
    %timeit appr[1](x, y)



matrix approach: 
1000 loops, best of 3: 276 µs per loop
1000 loops, best of 3: 266 µs per loop


classic approach: 
100 loops, best of 3: 2.77 ms per loop
1000 loops, best of 3: 210 µs per loop


numpy function: 
1000 loops, best of 3: 536 µs per loop
1000 loops, best of 3: 372 µs per loop


scipy function: 
1000 loops, best of 3: 613 µs per loop
1000 loops, best of 3: 596 µs per loop



As we've seen before, our "classic" implementation of the least square method is pretty slow compared to using numpy's functions. This is not surprising, since numpy is highly optmized and uses compiled C/C++ and Fortran code already. This explains why there is no significant difference if we used Cython to compile the numpy-objects-containing functions.
However, we were able to speed up the "classic approach" quite significantly, roughly by 1500%.

The following 2 code blocks are just to visualize our results in a bar plot.


In [18]:
import timeit

funcs =  ['classic_lstsqr', 'cy_classic_lstsqr', 
          'lin_lstsqr_mat', 'numpy_lstsqr', 'scipy_lstsqr']
labels = ['classic approach','classic approach (cython)', 
          'matrix approach', 'numpy function', 'scipy function']

times = [timeit.Timer('%s(x,y)' %f, 
                      'from __main__ import %s, x, y' %f).timeit(1000)
         for f in funcs]

times_rel = [times[0]/times[i+1] for i in range(len(times[1:]))]

In [20]:
%pylab inline
import matplotlib.pyplot as plt

x_pos = np.arange(len(funcs))
plt.bar(x_pos, times, align='center', alpha=0.5)
plt.xticks(x_pos, labels, rotation=45)
plt.ylabel('time in ms')
plt.title('Performance of different least square fit implementations')
plt.show()

x_pos = np.arange(len(funcs[1:]))
plt.bar(x_pos, times_rel, align='center', alpha=0.5, color="green")
plt.xticks(x_pos, labels[1:], rotation=45)
plt.ylabel('relative performance gain')
plt.title('Performance gain compared to the classic least square implementation')
plt.show()


Populating the interactive namespace from numpy and matplotlib



Bonus: How to use Cython without the IPython magic

IPython's notebook is really great for explanatory analysis and documentation, but what if we want to compile our Python code via Cython without letting IPython's magic doing all the work?
These are the steps you would need.



1. Creating a .pyx file containing the the desired code or function.


In [76]:
%%file ccy_classic_lstsqr.pyx

def ccy_classic_lstsqr(x, y):
    """ Computes the least-squares solution to a linear matrix equation. """
    x_avg = sum(x)/len(x)
    y_avg = sum(y)/len(y)
    var_x = sum([(x_i - x_avg)**2 for x_i in x])
    cov_xy = sum([(x_i - x_avg)*(y_i - y_avg) for x_i,y_i in zip(x,y)])
    slope = cov_xy / var_x
    y_interc = y_avg - slope*x_avg
    return (slope, y_interc)


Overwriting ccy_classic_lstsqr.pyx



2. Creating a simple setup file


In [77]:
%%file setup.py

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

setup(
    cmdclass = {'build_ext': build_ext},
    ext_modules = [Extension("ccy_classic_lstsqr", ["ccy_classic_lstsqr.pyx"])]
)


Overwriting setup.py



3. Building and Compiling


In [78]:
!python3 setup.py build_ext --inplace


running build_ext
cythoning ccy_classic_lstsqr.pyx to ccy_classic_lstsqr.c
building 'ccy_classic_lstsqr' extension
/usr/bin/clang -fno-strict-aliasing -Werror=declaration-after-statement -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -arch i386 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.4/include/python3.4m -c ccy_classic_lstsqr.c -o build/temp.macosx-10.6-intel-3.4/ccy_classic_lstsqr.o
ccy_classic_lstsqr.c:2040:28: warning: unused function '__Pyx_PyObject_AsString'
      [-Wunused-function]
static CYTHON_INLINE char* __Pyx_PyObject_AsString(PyObject* o) {
                           ^
ccy_classic_lstsqr.c:2037:32: warning: unused function
      '__Pyx_PyUnicode_FromString' [-Wunused-function]
static CYTHON_INLINE PyObject* __Pyx_PyUnicode_FromString(char* c_str) {
                               ^
ccy_classic_lstsqr.c:2104:26: warning: unused function '__Pyx_PyObject_IsTrue'
      [-Wunused-function]
static CYTHON_INLINE int __Pyx_PyObject_IsTrue(PyObject* x) {
                         ^
ccy_classic_lstsqr.c:2159:33: warning: unused function '__Pyx_PyIndex_AsSsize_t'
      [-Wunused-function]
static CYTHON_INLINE Py_ssize_t __Pyx_PyIndex_AsSsize_t(PyObject* b) {
                                ^
ccy_classic_lstsqr.c:2188:33: warning: unused function '__Pyx_PyInt_FromSize_t'
      [-Wunused-function]
static CYTHON_INLINE PyObject * __Pyx_PyInt_FromSize_t(size_t ival) {
                                ^
ccy_classic_lstsqr.c:1584:32: warning: unused function '__Pyx_PyInt_From_long'
      [-Wunused-function]
static CYTHON_INLINE PyObject* __Pyx_PyInt_From_long(long value) {
                               ^
ccy_classic_lstsqr.c:1631:27: warning: function '__Pyx_PyInt_As_long' is not
      needed and will not be emitted [-Wunneeded-internal-declaration]
static CYTHON_INLINE long __Pyx_PyInt_As_long(PyObject *x) {
                          ^
ccy_classic_lstsqr.c:1731:26: warning: function '__Pyx_PyInt_As_int' is not
      needed and will not be emitted [-Wunneeded-internal-declaration]
static CYTHON_INLINE int __Pyx_PyInt_As_int(PyObject *x) {
                         ^
8 warnings generated.
ccy_classic_lstsqr.c:2040:28: warning: unused function '__Pyx_PyObject_AsString'
      [-Wunused-function]
static CYTHON_INLINE char* __Pyx_PyObject_AsString(PyObject* o) {
                           ^
ccy_classic_lstsqr.c:2037:32: warning: unused function
      '__Pyx_PyUnicode_FromString' [-Wunused-function]
static CYTHON_INLINE PyObject* __Pyx_PyUnicode_FromString(char* c_str) {
                               ^
ccy_classic_lstsqr.c:2104:26: warning: unused function '__Pyx_PyObject_IsTrue'
      [-Wunused-function]
static CYTHON_INLINE int __Pyx_PyObject_IsTrue(PyObject* x) {
                         ^
ccy_classic_lstsqr.c:2159:33: warning: unused function '__Pyx_PyIndex_AsSsize_t'
      [-Wunused-function]
static CYTHON_INLINE Py_ssize_t __Pyx_PyIndex_AsSsize_t(PyObject* b) {
                                ^
ccy_classic_lstsqr.c:2188:33: warning: unused function '__Pyx_PyInt_FromSize_t'
      [-Wunused-function]
static CYTHON_INLINE PyObject * __Pyx_PyInt_FromSize_t(size_t ival) {
                                ^
ccy_classic_lstsqr.c:1584:32: warning: unused function '__Pyx_PyInt_From_long'
      [-Wunused-function]
static CYTHON_INLINE PyObject* __Pyx_PyInt_From_long(long value) {
                               ^
ccy_classic_lstsqr.c:1631:27: warning: function '__Pyx_PyInt_As_long' is not
      needed and will not be emitted [-Wunneeded-internal-declaration]
static CYTHON_INLINE long __Pyx_PyInt_As_long(PyObject *x) {
                          ^
ccy_classic_lstsqr.c:1731:26: warning: function '__Pyx_PyInt_As_int' is not
      needed and will not be emitted [-Wunneeded-internal-declaration]
static CYTHON_INLINE int __Pyx_PyInt_As_int(PyObject *x) {
                         ^
8 warnings generated.
/usr/bin/clang -bundle -undefined dynamic_lookup -arch i386 -arch x86_64 -g build/temp.macosx-10.6-intel-3.4/ccy_classic_lstsqr.o -o /Users/sebastian/Github/pattern_classification/data_fitting/regression/ccy_classic_lstsqr.so

4. Importing and running the code


In [72]:
import ccy_classic_lstsqr

%timeit classic_lstsqr(x, y)
%timeit cy_classic_lstsqr(x, y)
%timeit ccy_classic_lstsqr.c_classic_lstsqr(x, y)


100 loops, best of 3: 2.86 ms per loop
1000 loops, best of 3: 218 µs per loop
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-72-2ce35504d680> in <module>()
      3 get_ipython().magic('timeit classic_lstsqr(x, y)')
      4 get_ipython().magic('timeit cy_classic_lstsqr(x, y)')
----> 5 get_ipython().magic('timeit ccy_classic_lstsqr.ccy_classic_lstsqr(x, y)')

/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/IPython/core/interactiveshell.py in magic(self, arg_s)
   2203         magic_name, _, magic_arg_s = arg_s.partition(' ')
   2204         magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2205         return self.run_line_magic(magic_name, magic_arg_s)
   2206 
   2207     #-------------------------------------------------------------------------

/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/IPython/core/interactiveshell.py in run_line_magic(self, magic_name, line)
   2124                 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
   2125             with self.builtin_trap:
-> 2126                 result = fn(*args,**kwargs)
   2127             return result
   2128 

/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/IPython/core/magics/execution.py in timeit(self, line, cell)

/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    191     # but it's overkill for just that one bit of state.
    192     def magic_deco(arg):
--> 193         call = lambda f, *a, **k: f(*a, **k)
    194 
    195         if callable(arg):

/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/site-packages/IPython/core/magics/execution.py in timeit(self, line, cell)
   1011             number = 1
   1012             for _ in range(1, 10):
-> 1013                 if timer.timeit(number) >= 0.2:
   1014                     break
   1015                 number *= 10

/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/timeit.py in timeit(self, number)
    176         gc.disable()
    177         try:
--> 178             timing = self.inner(it, self.timer)
    179         finally:
    180             if gcold:

<magic-timeit> in inner(_it, _timer)

AttributeError: 'module' object has no attribute 'ccy_classic_lstsqr'

In [68]:
ccy_classic_lstsqr


Out[68]:
<module 'ccy_classic_lstsqr' (namespace)>

In [73]:
ccy_classic_lstsqr


Out[73]:
<module 'ccy_classic_lstsqr' (namespace)>

In [79]:
from ccy_classic_lstsqr import c_classic_lstsqr


---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-79-b4b6ef83886a> in <module>()
----> 1 from ccy_classic_lstsqr import c_classic_lstsqr

ImportError: cannot import name 'c_classic_lstsqr'

In [ ]: