Linear Regression, Optimization and Curve Fitting

Some of this lesson was taken from materials developed by Eric Ayars at CSU Chico for their Computational Physics Course. He wrote a nice textbook using Python. Appendix C lays out the derivation of least squares fitting that you see here. Some of the data and examples are from the SciPy Lecture Notes and materials developed by Matt Moelter and Jodi Christiansen for PHYS 202 at Cal Poly.


Instructions: Create a new directory called Optimization with a notebook called OptimizationTour. Give it a heading 1 cell title Linear Regression, Optimization and Curve Fitting. Read this page, typing in the code in the code cells and executing them as you go.

Do not copy/paste.

Type the commands yourself to get the practice doing it. This will also slow you down so you can think about the commands and what they are doing as you type them.</font>

Save your notebook when you are done, then try the accompanying exercises.



In [ ]:
%pylab inline
import numpy as np
import matplotlib.pyplot as plt

Linear Regression

It is very common in physics —and scientific fields in general— to have a set of ($x, y$) data points for which we want to know the equation $y(x)$. We usually know the form of $y(x)$ (linear, quadratic, exponential, power, and so on) but we’d like to know the exact parameters in the function.

The simplest non-trivial case is the linear case, and it turns out that if you solve the linear case you can often solve others with some appropriate algebraic work, so we’ll start with the linear case and work from there.

Imagine that you have some data such as that shown in the figure. Although there’s noise in the data, it’s clear that the data is linear: $y = mx + b$. The best line through this data would be the line for which the sum of the distances between the line and the data points is a minimum.

As it turns out, though, the distance between the line and the points could be either positive or negative. This means that there are some astonishingly bad lines that would have the same sum of deviations as the “right” line.

Taking the sum of the absolute values of the deviations would seem to be better, but the absolute value function causes severe mathematical difficulties when you attempt to take derivatives. The solution to these problems is to minimize the sum of the squares of the deviations of the data points from the line. This technique is thus known as "Least-squares" minimization.

The figure shows somewhat noisy ($x,y$) data and the ideal line through that data.

We want to minimize the sum of the deviations of the data points from the line. The line is given by

$$y=mx+b$$

so the vertical deviation of the $i^{th}$ data point ($x_i,y_i$) from the line is

$$ \delta_i \equiv y_i - (mx_i + b).$$

The sum of the squares of all these deviations is then

$$\chi^2 \equiv \sum\limits_{i=1}^{n}\delta_i^2 = \sum\limits_{i=1}^{n}\left[y_i^2 - 2mx_iy_i - 2by_i + m^2x_i^2 + 2mbx_i + b^2\right]$$

Note of course that this vertical deviation of a data point from the line is not generally the same as the perpendicular distance between the point and the line. An equivalent derivation using the perpendicular distance is considerably more complicated, and gives no significant improvement in results.

We want to minimize $\chi^2$, so we take the derivatives of $\chi^2$ with respect to both m and b and set them equal to zero.

$$\frac{\partial\chi^2}{\partial m} = 0 = -2\sum\limits_{i=1}^n x_iy_i + 2m\sum\limits_{i=1}^n x_i^2 + 2b\sum\limits_{i=1}^n x_i$$$$\frac{\partial\chi^2}{\partial b} = 0 = -2\sum\limits_{i=1}^n y_i + 2m\sum\limits_{i=1}^n x_i + 2\sum\limits_{i=1}^n b$$

To simplify these expressions, let's divide both equations by $n$ and make the following substitutions:

$$ \langle x \rangle \equiv \frac{1}{n}\sum\limits_{i=1}^n x_i\\ \langle y \rangle \equiv \frac{1}{n}\sum\limits_{i=1}^n y_i\\ \langle x^2 \rangle \equiv \frac{1}{n}\sum\limits_{i=1}^n x_i^2\\ \langle xy \rangle \equiv \frac{1}{n}\sum\limits_{i=1}^n x_iy_i $$

Rewriting our equations for the derivatives of $\chi^2$ gives

$$m\langle x^2 \rangle + b\langle x \rangle = \langle xy \rangle$$

and

$$m\langle x \rangle + b = \langle y \rangle$$

Once the averages $\langle x\rangle$, $\langle y\rangle$, $\langle xy\rangle$, and $\langle x^2\rangle$ are calculated, these expressions form a system of two equations with two unknowns, $m$ and $b$. These equations can be written in matrix form as

$$\begin{bmatrix} \langle x^2 \rangle & \langle x \rangle \\\\ \langle x \rangle & 1 \end{bmatrix} \begin{bmatrix} m \\\\ b \end{bmatrix} = \begin{bmatrix} \langle xy \rangle \\\\ \langle y \rangle \end{bmatrix} $$

The solution of this matrix equation is

$$\begin{bmatrix} m \\\\ b \end{bmatrix} = \begin{bmatrix} \langle x^2 \rangle & \langle x \rangle \\\\ \langle x \rangle & 1 \end{bmatrix}^{-1} \begin{bmatrix} \langle xy \rangle \\\\ \langle y \rangle \end{bmatrix} $$

where

$$\begin{bmatrix} \langle x^2 \rangle & \langle x \rangle \\\\ \langle x \rangle & 1 \end{bmatrix}^{-1} = \frac{1}{\langle x^2 \rangle - \langle x \rangle^2} \begin{bmatrix} 1 & - \langle x \rangle \\\\ - \langle x \rangle & \langle x^2 \rangle \end{bmatrix} $$

which gives

$$ m = \frac{\langle xy \rangle - \langle x \rangle \langle y \rangle}{\langle x^2 \rangle - \langle x \rangle^2 }, \hspace{1cm} b = \frac{\langle x^2 \rangle\langle y \rangle-\langle x \rangle\langle xy \rangle}{\langle x^2 \rangle - \langle x \rangle^2 }$$

We can also calculate our uncertainty in $m$ and $b$. The deviation of any point ($x_i,y_i$) from the line $y(x_i)$ is given by

$$\delta_i = y_i - (mx_i +b).$$

The uncertainty in $y$ can be expressed

$$\sigma_y = \sqrt{\frac{1}{(n-2)}\sum\delta_i^2}.$$

The reason for the factor $(n-2)$ is due to the number of degrees of freedom in the problem, which is the number of data points $n$ minus the number of parameters in the fitted line, in this case 2 ($m$ and $b$). The derivation of this form is left for another course, such as PHYS 340.

Likewise, we can calculate the uncertainties in our fitted parameters

$$ \sigma_m = \sqrt{\frac{1}{(n-2)}\frac{\langle \delta^2\rangle}{\langle x^2\rangle - \langle x \rangle ^2}}, \hspace{1cm} \sigma_b = \sqrt{\frac{1}{(n-2)}\frac{\langle \delta^2\rangle\langle x^2 \rangle}{\langle x^2\rangle - \langle x \rangle ^2}} $$

where $\langle \delta^2\rangle \equiv \frac{\sum\delta_i^2}{n}$. Again, we'll leave the details of the derivation of these equations to another course, such as PHYS 340.

As you can imagine, this technique lends itself well to computation. One can write a program to calculate a least-squares fit by calculating the various average values $\\{\langle x\rangle, \langle x^2\rangle, \langle \delta^2\rangle, . . .\\}$ and then combining these average parameters to report $m$, $b$, $\sigma_a$, and $\sigma_b$. In fact, let's do that now.

We can write a function that takes as arguments arrays representing the $x,y$ values for some linearly varying data and returns a tuple of the resulting slope, intercept, and uncertainties.

Note: Here is where numpy arrays and their built-in functions really shine. We should not have to write any for loops to calculate any of the average quantities.


In [ ]:
def linear_lsq_fit(x,y):
    """Take in arrays representing (x,y) values for a set of linearly varying data and 
    perform a linear least squares regression.  Return the resulting slope and intercept
    parameters of the best fit line with their uncertainties."""

    xave = x.mean()
    yave = y.mean()
    x2ave = (x*x).mean()
    xyave = (x*y).mean()
    denom = (x2ave - xave**2)
    m = (xyave - xave*yave)/denom
    b = (x2ave*yave - xave*xyave)/denom
    delta = y - (m*x + b)
    d2ave = (delta*delta).mean()
    n = len(y)
    merr = sqrt((n-2)**(-1)*d2ave/denom)
    berr = sqrt((n-2)**(-1)*d2ave*x2ave/denom)
    
    return m,b,merr,berr

Now let's try it out on some data:


In [ ]:
time = np.array([1., 2., 3., 4., 5., 6., 7., 8., 9., 10.]) #time in seconds
position = np.array([7.75, 7.33, 6.89, 6.45, 5.96, 5.55, 5.10, 4.49, 3.93, 3.58]) #position in meters
uncpos = np.array([0.02, 0.03, 0.03, 0.04, 0.05, 0.06, 0.08, 0.11, 0.14, 0.17]) #uncertainty in position, in meters

In [ ]:
m1,b1,m1err,b1err = linear_lsq_fit(time,position)
print "slope= %.4f +/- %.4f" % (m1, m1err)
print "intercept= %.4f +/- %.4f" % (b1, b1err)

Ok, but how does it look? Let's plot the results. We'll use a solid line for the fit solution $y=mx+b$, and dotted lines to represent the 1 standard deviation range of possible solutions: $y_+ = (m+\sigma_m)x+(b+\sigma_b)$ and $y_- = (m-\sigma_m)x+(b-\sigma_b)$.


In [ ]:
#Create a function to plot the fit function
f = lambda x,m,b: m*x + b
mylabel="y=%.4f x + %.4f" % (m1,b1)

#Plot the data and the fit line
fig = plt.figure(2,(10,6))
plt.scatter(time,position,label="data")
plt.plot(time,f(time,m1,b1),label=mylabel)

#Now plot a band around the result at the one-sigma error level
plt.plot(time,f(time,m1-m1err,b1-b1err),'b:')
plt.plot(time,f(time,m1+m1err,b1+b1err),'b:')

plt.xlim(0.,11.)
plt.ylim(3.,9.)

plt.legend(loc="best")
plt.xlabel("time (s)",fontsize=20)
plt.ylabel("position (m)",fontsize=20)

plt.show()

Weighted Linear Least Squares

When the data we are trying to fit also have experimental uncertainties, we need a way to take into account these measurement uncertainties in computing our best fit line. The figure below represents the same set of data as before but now with uncertainties that vary - the size of the error bars on each data point is unique. This could be a set of measurements taken in a laboratory with different equipment or data from different experiments taken on different days or by different groups being combined to obtain an overall best result.


In [ ]:
fig = plt.figure(3,(10,6))
plt.errorbar(time,position,uncpos,fmt="bo",label="data with errors")

plt.xlim(0.,11.)
plt.ylim(3.,9.)

plt.legend(loc="best")
plt.xlabel("time (s)",fontsize=20)
plt.ylabel("position (m)",fontsize=20)

plt.show()

Obviously, we want the data points with smaller errors to "count" more than data points with big errors. The easiest way to accomplish this is to use the uncertainties on the measurements to "weight" the points when we calculate the averages. Weighting by one over the square of the uncertainties, $w_i = 1/{\sigma_y}_i^2$, is the standard method of weighting.

Leaving out the derivation details, the equations to solve for our slope and intercept parameters can be compactly written

$$m = \frac{\sum w\sum wxy - \sum wx\sum wy}{\sum w\sum wx^2 - (\sum wx)^2}, \hspace{1cm} b = \frac{\sum wx^2\sum wy - \sum wx\sum wxy}{\sum w \sum wx^2 - (\sum wx)^2}$$

If the uncertainties on all of the fitted data are the same, the weights can all be set to 1. In this case, the fit parameters from the weighted linear least squares method are identical to the results from the non-weighted case.

The same cannot be said of the uncertainties on the fitted parameters. The expressions for the weighted linear least square (WLSQ) uncertainties are

$$\sigma_m = \sqrt{\frac{\sum w}{\sum w\sum wx^2 - (\sum wx)^2}}, \hspace{1cm} \sigma_b = \sqrt{\frac{\sum wx^2}{\sum w\sum wx^2 - (\sum wx)^2}}.$$

which do not reduce to the results from the unweighted linear least squares, even when the weights = 1, because of the $(n-2)$ factor.

In the accompanying exercises you will code a weighted linear least squares fitter using these equations and compare its results to the non-weighted case.


Linearizing Equations

So now we know how to use linear regression to fit data that lies along a straight line to determine the "best fit" slope and intercept parameters. But the equations we used only work for linear data, and there are lots of examples of functions that are nonlinear that we might wish to fit, e.g.

$y = Ae^{Bx}$

How do we fit such curves to data to determine the best fit parameters $(A,B)$ ?

There are two possibilities. First, if the function we are trying to fit to our data can be linearized, then we can make that transformation and use the LinearLSqFit function we already created.

For this example, just take the natural logarithm of both sides to obtain an equation that has the same form as $y = mx +b$:

$$\ln y = \ln A + B x$$

Here, $\ln A$ is the y-intercept and the slope of the line is given by the power $B$. To use this equation with the LinearLSqFit function you would need to take the natural log of your $y$ data array (but not $x$). Then convert your returned parameters from slope and intrercept back into the equation parameters.

Download the linked file radioactivedecay.dat that contains time-series data for the radioactive decay of an isotope of Uranium and read in the data with loadtxt. We will linearize the data and fit it with our LinearLSqFit function.


In [ ]:
!cat radioactivedecay.dat

In [ ]:
data = np.loadtxt('radioactivedecay.dat')
t = data[:,0]
num = data[:,1]

First plot the data as is:


In [ ]:
plt.plot(t,num,"b*")
plt.xlabel("time (s)",fontsize=20)
plt.ylabel("N (remaining isotopes)",fontsize=20)

plt.show()

Now let's linearize it:


In [ ]:
#numpy's natural logarithm is np.log()
plt.plot(t,np.log(num),"b*")
plt.xlabel("time (s)",fontsize=20)
plt.ylabel("log($N_{remaining}$)",fontsize=20)

plt.show()

Now let's fit that linear data with our LinearLSqFit function and report the fit parameters, $\ln A$ and $B$, and the derived parameter $A = e^{\ln A}$.


In [ ]:
B,lnA,Berr,lnAerr = linear_lsq_fit(t,np.log(num))
A = np.exp(lnA)
Aerr = lnA*np.exp(lnA)*lnAerr #propagation of errors

print "B = %.4f +/- %.4f" % (B, Berr)
print "lnA = %.4f +/- %.4f" % (lnA, lnAerr)
print "A = %.4f +/- %.4f" % (A, Aerr)

Now plot the fit function on top of the linearized data:


In [ ]:
plt.plot(t,np.log(num),"b*",label="data")
plt.xlabel("time (s)",fontsize=20)
plt.ylabel("log($N_{remaining}$)",fontsize=20)

plt.plot(t,f(t,B,lnA),label="linear fit")
plt.legend()

plt.show()

And show it with the exponential function on linear scale for the y-axis:


In [ ]:
f2 = lambda x, B, A: A*np.exp(B*x)

plt.plot(t,num,"b*",label="data")
plt.xlabel("time (s)",fontsize=20)
plt.ylabel("$N_{remaining}$",fontsize=20)

plt.plot(t,f2(t,B,A),label="fit")
plt.legend()

plt.show()

Generic Curve Fitting

Suppose that you want to fit a set data points $(x_i,y_i)$, where $i=1,2,...,N$ , to a function that cannot be linearized. For example, the function could be a second-order polynomial,

$$f(x,a,b,c)=ax^2 +bx+c .$$

There isn’t an analytical expression for finding the best-fit parameters $(a, b, c)$ as there is for linear regression.

The usual approach is to optimize the parameters to minimize the sum of the squares of the differences between the data and the function. If the uncertainty in $y_i$ is $\sigma_i$, then the quantity minimized is

$$\chi^2 = \sum\limits_{i=1}^{N}\left[\frac{y_i - f(x_i,a,b,...)}{\sigma_i}\right]^2$$

The scipy library for Python contains numerous functions for scientific computing and data analysis. It includes the function curve_fit, which performs a least squares fit of a function to data.

For advanced applications, one usually wants more than just the uncertainties on the fit parameters. One also wants to know how they are correlated, so Python returns the full covariance matrix with the best fit parameters. If all you want are the uncorrelated uncertainties, you have to extract them from the covariance matrix. These are just the square roots of the diagonal elements of the matrix.

Making an initial guess at the parameters of the function is optional. Otherwise, the parameters will all be one initially. It is useful to plot the data beforehand to help make reasonable guesses at the parameters. It is also optional to send a list of uncertainties as weights into the fit. If no weights are given, then they are assumed to be one.

Curve fitting example


In [ ]:
#Import the curve fitter from the scipy optimize package
from scipy.optimize import curve_fit

In [ ]:
# Define some data (including uncertainties)
x = np.array([0.0, 1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([-1.78, 4.09, 8.85, 17.9, 26.1, 35.2])
yerr = np.array([0.460, 0.658, 0.528, 1.34, 1.09, 0.786])

In [ ]:
#plot the data to estimate the shape and parameters
plt.errorbar(x,y,yerr,fmt="bo")
plt.xlim(-0.2,7)
plt.ylim(-3.,40.)

plt.show()

Hmmm... It looks kind of linear, but maybe also a bit curvy. Let's try using our LinearLSqFit function on this data and see what the result looks like.


In [ ]:
m,b,merr,berr = linear_lsq_fit(x,y)

plt.errorbar(x,y,yerr,fmt="bo",label="data")
plt.plot(x,f(x,m,b),"r-",label="linear fit")

plt.xlim(-0.2,7)
plt.ylim(-3.,40.)

plt.show()

What do you think? Does this data look linear? Let's try a different model. How about

$$y = ax^2+bx+c$$

This can't be linearized, so let's use curve_fit to fit the data.


In [ ]:
#Define the fit function
def func(x, a, b, c):
    return (a*x**2 + b*x + c)

In [ ]:
# Make initial guess at parameters
p0 = [1.0, 3.0, -2.0]  
     
#Call the curve fitter and have it return the optimized parameters (popt) and covariance matrix (pcov)
popt, pcov = curve_fit(func, x, y, p0, yerr)

In [ ]:
#Compute the parameter uncertainties from the covariance matrix
punc = zeros(len(popt))
for i in arange(0,len(popt)):
    punc[i] = sqrt(pcov[i,i])

#Print the result
print "optimal parameters: ", popt
print "uncertainties of parameters: ", punc

Plotting data with error bars and a best-fit function together gives some idea of whether or not the fit is good. If the curve passes within most of the error bars, the fit is probably reasonably good.


In [ ]:
#plot the fit result with the data
fitresult = func(x,popt[0],popt[1],popt[2])

plt.errorbar(x,y,yerr,fmt="b*",label="data")
plt.plot(x,fitresult,'b-',label="quadratic fit")
plt.plot(x,f(x,m,b),'r-',label="linear fit")

plt.xlim(-0.2,7)
plt.ylim(-3.,40.)

plt.legend(loc="best")
plt.show()

The quadratic fit is better because it passes within the error bars of all data points, whereas the linear fit does not. However, judging by eye is probably not the best way to evaluate the fits. Usually, if the theoretical expectation for the data is known, one uses that function to fit the data. In cases where it is not, one uses a quantitative measure like the reduced chi squared and the number of degrees of freedom to ascertain which fit function better describes the data. This can help decide which of two competing theoretical models is "right" by determining which one best fits the experimental data. We'll leave that exploration for another course, such as PHYS 340.

A final example

The file named waveform_1.npy is a binary data file that numpy understands (hence the file extension ".npy"). You can download it by right-clicking on the link and selecting "Save Link As". If you try to open the file with nano or use !cat or !head it will look like garbage. However, you can read them in using the load function in numpy.


In [ ]:
!cat waveform_1.npy

In [ ]:
waveform = np.load('waveform_1.npy')
t = np.arange(len(waveform))

In [ ]:
plt.plot(t, waveform)
plt.xlabel("Time (ns)",fontsize=20)
plt.ylabel("Intensity (bins)",fontsize=20)

plt.show()

The signal is very simple and can be modeled as a single Gaussian function and an offset corresponding to the background noise. To fit the signal with the function, we must define the model, propose an initial solution and call the curve_fit routine.

Let the data be modeled by the function

$$B + A \exp { \left[ -\left( \frac{t-\mu}{\sigma}\right) ^2 \right] }$$

where $B$ represents the random noise floor, $A$ is the amplitude of the Gaussian curve, $\mu$ is the location of the mean of the Gaussian and $\sigma$ is the width.


In [ ]:
#Gaussian plus offset
def model(t, B, A, mu, sigma):
    return B + A * exp( - ((t-mu)/sigma)**2 )

By inspecting the graph, you can choose a decent initial guess for each of the parameters and create an array with those guesses:


In [ ]:
x0 = array([3.,30.,15.,1.])

Then call the curve fitting routine, passing it the function, the data and our initial guesses:


In [ ]:
popt,pcov = curve_fit(model,t,waveform,x0)

plt.plot(t, waveform,label="data")
plt.plot(t, model(t, popt[0],popt[1],popt[2],popt[3]),"r-",label="fit",linewidth=2)
plt.legend(loc="best")

plt.xlabel("Time (ns)",fontsize=20)
plt.ylabel("Intensity (bins)",fontsize=20)

plt.show()

In [ ]:
#Compute the parameter uncertainties from the covariance matrix
punc = zeros(len(popt))
for i in arange(0,len(popt)):
    punc[i] = sqrt(pcov[i,i])

#Print the result
print "optimal parameters: ", popt
print "uncertainties of parameters: ", punc

Not too bad. You can also pass the uncertainties on the data to the curve_fit routine as an optional last argument. These are used as weights in the fit. If no uncertainties are provided, the weights used are ones. Look at the help on curve_fit for more information.


All content is under a modified MIT License, and can be freely used and adapted. See the full license text here.