PyBroom Example - Multiple Datasets - Scipy Robust Fit

This notebook is part of pybroom.

This notebook demonstrate using pybroom when fitting a set of curves (curve fitting) using robust fitting and scipy. We will show that pybroom greatly simplifies comparing, filtering and plotting fit results from multiple datasets. See pybroom-example-multi-datasets for an example using lmfit.Model instead of directly scipy.


In [ ]:
%matplotlib inline
%config InlineBackend.figure_format='retina'  # for hi-dpi displays
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pylab import normpdf
import seaborn as sns
from lmfit import Model
import lmfit
print('lmfit: %s' % lmfit.__version__)

In [ ]:
sns.set_style('whitegrid')

In [ ]:
import pybroom as br

Create Noisy Data

We start simulating N datasets which are identical except for the additive noise.


In [ ]:
N = 200

In [ ]:
x = np.linspace(-10, 10, 101)

In [ ]:
peak1 = lmfit.models.GaussianModel(prefix='p1_')
peak2 = lmfit.models.GaussianModel(prefix='p2_')
model = peak1 + peak2

In [ ]:
#params = model.make_params(p1_amplitude=1.5, p2_amplitude=1, 
#                           p1_sigma=1, p2_sigma=1)

In [ ]:
Y_data = np.zeros((N, x.size))
Y_data.shape, x.shape

In [ ]:
for i in range(Y_data.shape[0]):
    Y_data[i] = model.eval(x=x, p1_center=-1, p2_center=2, 
                           p1_sigma=0.5, p2_sigma=1, 
                           p1_height=1, p2_height=0.5)
Y_data += np.random.randn(*Y_data.shape)/10

Add some outliers:


In [ ]:
num_outliers = int(Y_data.size * 0.05)
idx_ol = np.random.randint(low=0, high=Y_data.size, size=num_outliers)
Y_data.reshape(-1)[idx_ol] = (np.random.rand(num_outliers) - 0.5)*4

In [ ]:
plt.plot(x, Y_data.T, 'ok', alpha=0.1);
plt.title('%d simulated datasets, with outliers' % N);

Model Fitting

curve_fit()


In [ ]:
import scipy.optimize as so

In [ ]:
from collections import namedtuple

In [ ]:
# Model PDF to be maximized
def model_pdf(x, a1, a2, mu1, mu2, sig1, sig2):
    return (a1 * normpdf(x, mu1, sig1) + 
            a2 * normpdf(x, mu2, sig2))

In [ ]:
result = so.curve_fit(model_pdf, x, Y_data[0])

In [ ]:
type(result), type(result[0]), type(result[1])

In [ ]:
result[0]

Using a namedtuple is a clean way to assign names to an array of paramenters:


In [ ]:
Params = namedtuple('Params', 'a1 a2 mu1 mu2 sig1 sig2')

In [ ]:
p = Params(*result[0])
p

Unfortunately, not much data is returned by curve_fit, a 2-element tuple with:

  • array of best-fit parameters
  • array of jacobian

Therefore curve_fit is not very useful for detailed comparison of fit results. A better interface for curve fitting would be lmfit.Model (see this other notebook).

In the current notebook we keep exploring further options offered by scipy.optimize.

least_squares()

As an example, we use the least_squares function which supports robust loss functions and constraints.

We need to define the residuals:


In [ ]:
def residuals(p, x, y):
    return y - model_pdf(x, *p)

Then, we fit the N datasets with different loss functions storing result in a dict containing lists:


In [ ]:
losses = ('linear', 'huber', 'cauchy')
Results = {}
for loss in losses:
    Results[loss] = [so.least_squares(residuals, (1,1,0,1,1,1), args=(x, y), loss=loss, f_scale=0.5)
                     for y in Y_data]
**NOTE**: For more details on robust fitting and on the different loss functions see [Robust nonlinear regression in scipy](http://scipy-cookbook.readthedocs.io/items/robust_regression.html).

In [ ]:
# result = Results['cauchy'][0]
# for k in result.keys():
#     print(k, type(result[k]))

Tidying the results

Now we tidy the results, combining the results for the different loss functions in a single DataFrames.

We start with the glance function, which returns one row per fit result:


In [ ]:
dg_tot = br.glance(Results, var_names=['loss', 'dataset'])
dg_tot.head()

In [ ]:
dg_tot.success.all()

Then we apply tidy, which returns one row per parameter.

Since the object OptimzeResult returned by scipy.optimize does only contains an array of parameters, we need to pass the names as as additional argument:


In [ ]:
pnames = 'a1 a2 mu1 mu2 sig1 sig2'
dt_tot = br.tidy(Results, var_names=['loss', 'dataset'], param_names=pnames)
dt_tot.head()

Finally, we cannot apply the augment function, since the OptimizeResult object does not include much per-data-point information (it may contain the array of residuals).

Plots

First we plot the peak position and sigmas distributions:


In [ ]:
kws = dict(bins = np.arange(-2, 4, 0.1), histtype='step', lw=2)
for loss in losses:
    dt_tot.query('(name == "mu1" or name == "mu2") and loss == "%s"' % loss)['value'].hist(label=loss, **kws)
    kws['ax'] = plt.gca()
plt.title(' Distribution of peaks centers')
plt.legend();

In [ ]:
kws = dict(bins = np.arange(0, 4, 0.1), histtype='step', lw=2)
for loss in losses:
    dt_tot.query('(name == "sig1" or name == "sig2") and loss == "%s"' % loss)['value'].hist(label=loss, **kws)
    kws['ax'] = plt.gca()
plt.title(' Distribution of peaks sigmas')
plt.legend();

A more complete overview for all the fit paramenters can be obtained with a factorplot:


In [ ]:
sns.factorplot(x='loss', y='value', data=dt_tot, col='name', hue='loss',
               col_wrap=4, kind='box', sharey=False);

From all the previous plots we see that, as espected, using robust fitting with higher damping of outlier (i.e. cauchy vs huber or linear) results in more accurate fit results.

Finally, we can have a peek at the comparison of raw data and fitted models for a few datatsets.

Since OptimizeResults does not include "augmented" data we need to generate these data by evaluating the model with the best-fit parameters. We use seaborn's FacetGrid, passing a custom function _plot for model evaluation:


In [ ]:
def _plot(names, values, x, label=None, color=None):
    df = pd.concat([names, values], axis=1)
    kw_pars = br.tidy_to_dict(df)
    y = model_pdf(x, **kw_pars)
    plt.plot(x, y, lw=2, color=color, label=label)

In [ ]:
grid = sns.FacetGrid(dt_tot.query('dataset < 9'), col='dataset', hue='loss', col_wrap=3)
grid.map(_plot, 'name', 'value', x=x)
grid.add_legend()
for i, ax in enumerate(grid.axes):
    ax.plot(x, Y_data[i], 'o', ms=3, color='k')
plt.ylim(-1, 1.5)

For comparison, the ModelResult object returned by lmfit, contains not only the evaluated model but also the evaluation of the single components (each single peak in this case). Therefore the above plot can be generated more straighforwardly using the "augmented" data. See the notebook pybroom-example-multi-datasets for an example.