This notebook is part of pybroom.
This notebook demonstrate using pybroom when fitting a set of curves (curve fitting) using robust fitting and scipy. We will show that pybroom greatly simplifies comparing, filtering and plotting fit results from multiple datasets. See pybroom-example-multi-datasets for an example using
lmfit.Model
instead of directly scipy.
In [ ]:
%matplotlib inline
%config InlineBackend.figure_format='retina' # for hi-dpi displays
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.pylab import normpdf
import seaborn as sns
from lmfit import Model
import lmfit
print('lmfit: %s' % lmfit.__version__)
In [ ]:
sns.set_style('whitegrid')
In [ ]:
import pybroom as br
In [ ]:
N = 200
In [ ]:
x = np.linspace(-10, 10, 101)
In [ ]:
peak1 = lmfit.models.GaussianModel(prefix='p1_')
peak2 = lmfit.models.GaussianModel(prefix='p2_')
model = peak1 + peak2
In [ ]:
#params = model.make_params(p1_amplitude=1.5, p2_amplitude=1,
# p1_sigma=1, p2_sigma=1)
In [ ]:
Y_data = np.zeros((N, x.size))
Y_data.shape, x.shape
In [ ]:
for i in range(Y_data.shape[0]):
Y_data[i] = model.eval(x=x, p1_center=-1, p2_center=2,
p1_sigma=0.5, p2_sigma=1,
p1_height=1, p2_height=0.5)
Y_data += np.random.randn(*Y_data.shape)/10
Add some outliers:
In [ ]:
num_outliers = int(Y_data.size * 0.05)
idx_ol = np.random.randint(low=0, high=Y_data.size, size=num_outliers)
Y_data.reshape(-1)[idx_ol] = (np.random.rand(num_outliers) - 0.5)*4
In [ ]:
plt.plot(x, Y_data.T, 'ok', alpha=0.1);
plt.title('%d simulated datasets, with outliers' % N);
In [ ]:
import scipy.optimize as so
In [ ]:
from collections import namedtuple
In [ ]:
# Model PDF to be maximized
def model_pdf(x, a1, a2, mu1, mu2, sig1, sig2):
return (a1 * normpdf(x, mu1, sig1) +
a2 * normpdf(x, mu2, sig2))
In [ ]:
result = so.curve_fit(model_pdf, x, Y_data[0])
In [ ]:
type(result), type(result[0]), type(result[1])
In [ ]:
result[0]
Using a namedtuple
is a clean way to assign names to an array of paramenters:
In [ ]:
Params = namedtuple('Params', 'a1 a2 mu1 mu2 sig1 sig2')
In [ ]:
p = Params(*result[0])
p
Unfortunately, not much data is returned by curve_fit
, a 2-element tuple with:
Therefore curve_fit
is not very useful for detailed comparison of fit results.
A better interface for curve fitting would be lmfit.Model (see
this other notebook).
In the current notebook we keep exploring further options offered by scipy.optimize
.
As an example, we use the least_squares
function which supports robust loss functions and constraints.
We need to define the residuals:
In [ ]:
def residuals(p, x, y):
return y - model_pdf(x, *p)
Then, we fit the N datasets with different loss functions storing result in a dict containing lists:
In [ ]:
losses = ('linear', 'huber', 'cauchy')
Results = {}
for loss in losses:
Results[loss] = [so.least_squares(residuals, (1,1,0,1,1,1), args=(x, y), loss=loss, f_scale=0.5)
for y in Y_data]
In [ ]:
# result = Results['cauchy'][0]
# for k in result.keys():
# print(k, type(result[k]))
In [ ]:
dg_tot = br.glance(Results, var_names=['loss', 'dataset'])
dg_tot.head()
In [ ]:
dg_tot.success.all()
Then we apply tidy
, which returns one row per parameter.
Since the object OptimzeResult
returned by scipy.optimize
does
only contains an array of parameters, we need to pass the names as
as additional argument:
In [ ]:
pnames = 'a1 a2 mu1 mu2 sig1 sig2'
dt_tot = br.tidy(Results, var_names=['loss', 'dataset'], param_names=pnames)
dt_tot.head()
Finally, we cannot apply the
augment
function, since the OptimizeResult
object
does not include much per-data-point information
(it may contain the array of residuals).
First we plot the peak position and sigmas distributions:
In [ ]:
kws = dict(bins = np.arange(-2, 4, 0.1), histtype='step', lw=2)
for loss in losses:
dt_tot.query('(name == "mu1" or name == "mu2") and loss == "%s"' % loss)['value'].hist(label=loss, **kws)
kws['ax'] = plt.gca()
plt.title(' Distribution of peaks centers')
plt.legend();
In [ ]:
kws = dict(bins = np.arange(0, 4, 0.1), histtype='step', lw=2)
for loss in losses:
dt_tot.query('(name == "sig1" or name == "sig2") and loss == "%s"' % loss)['value'].hist(label=loss, **kws)
kws['ax'] = plt.gca()
plt.title(' Distribution of peaks sigmas')
plt.legend();
A more complete overview for all the fit paramenters can be obtained with a factorplot:
In [ ]:
sns.factorplot(x='loss', y='value', data=dt_tot, col='name', hue='loss',
col_wrap=4, kind='box', sharey=False);
From all the previous plots we see that, as espected, using robust fitting
with higher damping of outlier (i.e. cauchy
vs huber
or linear
)
results in more accurate fit results.
Finally, we can have a peek at the comparison of raw data and fitted models for a few datatsets.
Since OptimizeResults
does not include "augmented" data we need to
generate these data by evaluating the model with the best-fit parameters.
We use seaborn's FacetGrid
, passing a custom function _plot
for model evaluation:
In [ ]:
def _plot(names, values, x, label=None, color=None):
df = pd.concat([names, values], axis=1)
kw_pars = br.tidy_to_dict(df)
y = model_pdf(x, **kw_pars)
plt.plot(x, y, lw=2, color=color, label=label)
In [ ]:
grid = sns.FacetGrid(dt_tot.query('dataset < 9'), col='dataset', hue='loss', col_wrap=3)
grid.map(_plot, 'name', 'value', x=x)
grid.add_legend()
for i, ax in enumerate(grid.axes):
ax.plot(x, Y_data[i], 'o', ms=3, color='k')
plt.ylim(-1, 1.5)
For comparison, the ModelResult
object returned by lmfit,
contains not only the evaluated model but also the evaluation
of the single components (each single peak in this case).
Therefore the above plot can be generated more straighforwardly
using the "augmented" data.
See the notebook pybroom-example-multi-datasets
for an example.