Fitting Data to Theory

From an experimentalist's point of view, comparing experimental data to theory is a fundamental skill. In this tutorial we will introduce fitting methods in python, simulating some 'experimental data' using random numbers.

We will be using the curve_fit method, which is part of scipy and based on a Marquardt-Levenburg method of least-squares minimisation (see Ch. 7 of Measurements and their uncertainties, Hughes and Hase (2010), OUP).

Let's generate the data - imagine we have some data that we think fits a power-law and we want to know exactly what the power is.



In [388]:

    
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display

# x-data is evenly spaced - this might represent the time axis 
# on an oscilloscope trace, for example
x = np.linspace(2,80,100)

# y-data roughly follows a quadratic dependence 
# but with some extra noise added
a = 1
b = -0.5
c = 2000
y = a*x**2 + b*x + c + np.random.randn(len(x))*450

#y-error bars also have some random element
y_er = 450 + np.random.uniform(-50,50,len(x))
    
fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(x,y,yerr=y_er,
            # options for colors of errorbars and point markers...
            color='k',marker='s',mfc='w',mec='k',mew=1.25,linestyle='None')
plt.draw()

Now let's try and fit that data. We first need an analytic function to fit to, so let's define a quadratic function:



In [389]:

    
def quaddy(x,a,b,c):
    return a*x**2 + b*x + c

Now we use curve_fit to fit the function. In most cases, there are few options that need to be passed, but there are a lot of optional keyword arguments that may help in certain cases - check out the official documentation here for more information.

For now though, we just need to pass the x-data, y-data and y-error bars, along with the function we want it to fit to. For more involved functions it's a good idea to pass an initial guess at the correct parameters, but in this simple case we should be fine without.

curve_fit returns a list of the optimised parameters, and the covariance matrix which can be used to find the error on the fit parameters (which we will do) and information about how correlated the parameters are (which we won't go into this time).



In [390]:

    
from scipy.optimize import curve_fit

p_opt, p_cov = curve_fit(quaddy, x, y, 
                         sigma=y_er,  #error bars array
                         absolute_sigma=True)

Important Note!! The absolute_sigma option is relatively new (May 2014), and requires scipy version 0.14.0 or later. You may have to update your version to be able to take advantage of this. You can check what version of scipy you are using by running an interactive python session (type python and press enter on a command line / terminal window) then import scipy and scipy.__version__ (note the two underscores on either side).

Setting the absolute_sigma option to True means that the error bars represent one standard deviation of the data. Otherwise, curve_fit uses the sigma option only as a relative weighting, i.e. if all errorbars are the same size then the result is the same as not having error bars at all. Needless to say, you should use absolute_sigma=True in most cases!!

The errors on the function arguments are given by the square-root of the diagonal elements of the covariance matrix (Hughes and Hase, Ch 7). So we quote results as p_opt[i] +/- np.sqrt(p_cov[i][i]), or we can use the .diagonal() property of numpy 2D arrays to make things a bit easier:



In [391]:

    
p_errs = np.sqrt(p_cov.diagonal())

for paramlabel, p, err in zip(['a','b','c'],p_opt,p_errs):
    print paramlabel+':', p, '+/-', err









    



a: 1.0057988965 +/- 0.0960793019782
b: -3.16885611971 +/- 8.10014558537
c: 2062.33133492 +/- 143.880942065

Let's add the best fit line to the plot. We make use of the shorthand when calling functions that arguments can be passed as lists:



In [392]:

    
x_fit = np.arange(0,x.max(),0.1)
y_fit = quaddy(x_fit, *p_opt) # shorthand for passing arguments using *<list>

ax.plot(x_fit,y_fit,'b--',lw=3)
display(fig)

Not too bad, but we should also look at the residuals, i.e. the difference between the data and the fit line. Structure in this can hint that the fit is bad, even when the reduced chi-squared value is small.

To highlight this, let's look at what happens when we try to fit a curve of the form $\sqrt{a^2+bx^2}$ to the data:



In [393]:

    
def lin(x,a,b):
    return np.sqrt(a**2 + b*x**2)
    #return a* x + b
p_opt_lin, p_cov_lin = curve_fit(lin,x,y,sigma=y_er, absolute_sigma=True)#, p0=[15,1])
y_fit_lin = lin(x_fit,*p_opt_lin)
ax.plot(x_fit,y_fit_lin,'r--',lw=2.5)
display(fig)

# calculate reduced chi-squared for both quadratic and linear fits:
chisq_quad = (y-quaddy(x,*p_opt))**2 / y_er**2 #as array
chisq_quad_reduced = chisq_quad.sum()/(len(chisq_quad)-3)
print 'Reduced chi-squared (quadratic fit):', chisq_quad_reduced

chisq_lin = (y-lin(x,*p_opt_lin))**2 / y_er**2 #as array
chisq_lin_reduced = chisq_lin.sum()/(len(chisq_lin)-2)
print 'Reduced chi-squared (linear fit):', chisq_lin_reduced









    












    



Reduced chi-squared (quadratic fit): 1.25260026575
Reduced chi-squared (linear fit): 1.85413915105

Both reduced chi-squared values are reasonable in this case (the quadratic is slightly better), so we turn to the (normalised) residuals to try and identify which function is more representative of the data:



In [394]:

    
fig = plt.figure()
axQ = fig.add_subplot(211)
axL = fig.add_subplot(212)

# horizontal lines marking 0, +/- 2
for pos in [-2,0,2]:
    axQ.axhline(pos,linestyle='dashed',color='k',lw=1.25)
    axL.axhline(pos,linestyle='dashed',color='k',lw=1.25)

# normalised residuals
axQ.plot(x,(y-quaddy(x,*p_opt))/y_er,'ro')
axL.plot(x,(y-lin(x,*p_opt_lin))/y_er,'bo')

axL.set_ylabel('Norm. Res. (linear)')
axQ.set_ylabel('Norm. Res. (quadratic)')

plt.draw()

Now we clearly* see some structure in the residuals from the second fit, whereas the quadratic fit has no structure and most of the points are within +/- 2 error bars.

* Usually - though the simulation uses random numbers so occasionally it's not so clear...

We can go further by plotting histograms of the normalised residuals. If the function fits well and the central limit theorem applies then the normalised residuals should follow a gaussian (normal) distribution, which we can also fit to using an appropriate function and curve_fit. However, this is left as an excercise for the reader in this particular case...

Fitting Data to Theory

More information and further reading