Fitting Models

Learning Objectives: learn to fit models to data using linear and non-linear regression.

This material is licensed under the MIT license and was developed by Brian Granger. It was adapted from material from Jake VanderPlas and Jennifer Klay.


In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy import optimize as opt

In [3]:
from IPython.html.widgets import interact


:0: FutureWarning: IPython widgets are experimental and may change in the future.

Introduction

In Data Science it is common to start with data and develop a model of that data. Such models can help to explain the data and make predictions about future observations. In fields like Physics, these models are often given in the form of differential equations, whose solutions explain and predict the data. In most other fields, such differential equations are not known. Often, models have to include sources of uncertainty and randomness. Given a set of data, fitting a model to the data is the process of tuning the parameters of the model to best explain the data.

When a model has a linear dependence on its parameters, such as $a x^2 + b x + c$, this process is known as linear regression. When a model has a non-linear dependence on its parameters, such as $ a e^{bx} $, this process in known as non-linear regression. Thus, fitting data to a straight line model of $m x + b $ is linear regression, because of its linear dependence on $m$ and $b$ (rather than $x$).

Fitting a straight line

A classical example of fitting a model is finding the slope and intercept of a straight line that goes through a set of data points $\{x_i,y_i\}$. For a straight line the model is:

$$ y_{model}(x) = mx + b $$

Given this model, we can define a metric, or cost function, that quantifies the error the model makes. One commonly used metric is $\chi^2$, which depends on the deviation of the model from each data point ($y_i - y_{model}(x_i)$) and the measured uncertainty of each data point $ \sigma_i$:

$$ \chi^2 = \sum_{i=1}^N \left(\frac{y_i - y_{model}(x)}{\sigma_i}\right)^2 $$

When $\chi^2$ is small, the model's predictions will be close the data points. Likewise, when $\chi^2$ is large, the model's predictions will be far from the data points. Given this, our task is to minimize $\chi^2$ with respect to the model parameters $\theta = [m, b]$ in order to find the best fit.

To illustrate linear regression, let's create a synthetic data set with a known slope and intercept, but random noise that is additive and normally distributed.


In [4]:
N = 50
m_true = 2
b_true = -1
dy = 2.0 # uncertainty of each point

np.random.seed(0)
xdata = 10 * np.random.random(N) # don't use regularly spaced data
ydata = b_true + m_true * xdata + np.random.normal(0.0, dy, size=N) # our errors are additive

plt.errorbar(xdata, ydata, dy,fmt='.k', ecolor='lightgray')
plt.xlabel('x')
plt.ylabel('y');


Fitting by hand

It is useful to see visually how changing the model parameters changes the value of $\chi^2$. By using IPython's interact function, we can create a user interface that allows us to pick a slope and intercept interactively and see the resulting line and $\chi^2$ value.

Here is the function we want to minimize. Note how we have combined the two parameters into a single parameters vector $\theta = [m, b]$, which is the first argument of the function:


In [5]:
def chi2(theta, x, y, dy):
    # theta = [b, m]
    return np.sum(((y - theta[0] - theta[1] * x) / dy) ** 2)

In [6]:
def manual_fit(b, m):
    modely = m*xdata + b
    plt.plot(xdata, modely)
    plt.errorbar(xdata, ydata, dy,fmt='.k', ecolor='lightgray')
    plt.xlabel('x')
    plt.ylabel('y')
    plt.text(1, 15, 'b={0:.2f}'.format(b))
    plt.text(1, 12.5, 'm={0:.2f}'.format(m))
    plt.text(1, 10.0, '$\chi^2$={0:.2f}'.format(chi2([b,m],xdata,ydata, dy)))

In [7]:
interact(manual_fit, b=(-3.0,3.0,0.01), m=(0.0,4.0,0.01));


Go ahead and play with the sliders and try to:

  • Find the lowest value of $\chi^2$
  • Find the "best" line through the data points.

You should see that these two conditions coincide.

Minimize $\chi^2$ using scipy.optimize.minimize

Now that we have seen how minimizing $\chi^2$ gives the best parameters in a model, let's perform this minimization numerically using scipy.optimize.minimize. We have already defined the function we want to minimize, chi2, so we only have to pass it to minimize along with an initial guess and the additional arguments (the raw data):


In [8]:
theta_guess = [0.0,1.0]
result = opt.minimize(chi2, theta_guess, args=(xdata,ydata,dy))

Here are the values of $b$ and $m$ that minimize $\chi^2$:


In [9]:
theta_best = result.x
print(theta_best)


[-1.01441991  1.93854654]

These values are close to the true values of $b=-1$ and $m=2$. The reason our values are different is that our data set has a limited number of points. In general, we expect that as the number of points in our data set increases, the model parameters will converge to the true values. But having a limited number of data points is not a problem - it is a reality of most data collection processes.

We can plot the raw data and the best fit line:


In [10]:
xfit = np.linspace(0,10.0)
yfit = theta_best[1]*xfit + theta_best[0]

plt.plot(xfit, yfit)
plt.errorbar(xdata, ydata, dy,
         fmt='.k', ecolor='lightgray')
plt.xlabel('x')
plt.ylabel('y');


Minimize $\chi^2$ using scipy.optimize.leastsq

Performing regression by minimizing $\chi^2$ is known as least squares regression, because we are minimizing the sum of squares of the deviations. The linear version of this is known as linear least squares. For this case, SciPy provides a purpose built function, scipy.optimize.leastsq. Instead of taking the $\chi^2$ function to minimize, leastsq takes a function that computes the deviations:


In [11]:
def deviations(theta, x, y, dy):
    return (y - theta[0] - theta[1] * x) / dy

result = opt.leastsq(deviations, theta_guess, args=(xdata, ydata, dy), full_output=True)

Here we have passed the full_output=True option. When this is passed the covariance matrix $\Sigma_{ij}$ of the model parameters is also returned. The uncertainties (as standard deviations) in the parameters are the square roots of the diagonal elements of the covariance matrix:

$$ \sigma_i = \sqrt{\Sigma_{ii}} $$

A proof of this is beyond the scope of the current notebook.


In [12]:
theta_best = result[0]
theta_cov = result[1]
print('b = {0:.3f} +/- {1:.3f}'.format(theta_best[0], np.sqrt(theta_cov[0,0])))
print('m = {0:.3f} +/- {1:.3f}'.format(theta_best[1], np.sqrt(theta_cov[1,1])))


b = -1.014 +/- 0.626
m = 1.939 +/- 0.104

We can again plot the raw data and best fit line:


In [13]:
yfit = theta_best[0] + theta_best[1] * xfit

plt.errorbar(xdata, ydata, dy,
             fmt='.k', ecolor='lightgray');
plt.plot(xfit, yfit, '-b');


Fitting using scipy.optimize.curve_fit

SciPy also provides a general curve fitting function, curve_fit, that can handle both linear and non-linear models. This function:

  • Allows you to directly specify the model as a function, rather than the cost function (it assumes $\chi^2$).
  • Returns the covariance matrix for the parameters that provides estimates of the errors in each of the parameters.

Let's apply curve_fit to the above data. First we define a model function. The first argument should be the independent variable of the model.


In [14]:
def model(x, b, m):
    return m*x+b

Then call curve_fit passing the model function and the raw data. The uncertainties of each data point are provided with the sigma keyword argument. If there are no uncertainties, this can be omitted. By default the uncertainties are treated as relative. To treat them as absolute, pass the absolute_sigma=True argument.


In [15]:
theta_best, theta_cov = opt.curve_fit(model, xdata, ydata, sigma=dy)

Again, display the optimal values of $b$ and $m$ along with their uncertainties:


In [16]:
print('b = {0:.3f} +/- {1:.3f}'.format(theta_best[0], np.sqrt(theta_cov[0,0])))
print('m = {0:.3f} +/- {1:.3f}'.format(theta_best[1], np.sqrt(theta_cov[1,1])))


b = -1.014 +/- 0.591
m = 1.939 +/- 0.098

We can again plot the raw data and best fit line:


In [17]:
xfit = np.linspace(0,10.0)
yfit = theta_best[1]*xfit + theta_best[0]

plt.plot(xfit, yfit)
plt.errorbar(xdata, ydata, dy,
         fmt='.k', ecolor='lightgray')
plt.xlabel('x')
plt.ylabel('y');


Non-linear models

So far we have been using a linear model $y_{model}(x) = m x +b$. Remember this model was linear, not because of its dependence on $x$, but on $b$ and $m$. A non-linear model will have a non-linear dependece on the model parameters. Examples are $A e^{B x}$, $A \cos{B x}$, etc. In this section we will generate data for the following non-linear model:

$$y_{model}(x) = Ae^{Bx}$$

and fit that data using curve_fit. Let's start out by using this model to generate a data set to use for our fitting:


In [20]:
npoints = 20
Atrue = 10.0
Btrue = -0.2
xdata = np.linspace(0.0, 20.0, npoints)
dy = np.random.normal(0.0, 0.1, size=npoints)
ydata = Atrue*np.exp(Btrue*xdata) + dy

Plot the raw data:


In [21]:
plt.plot(xdata, ydata, 'k.')
plt.xlabel('x')
plt.ylabel('y');


Let's see if we can use non-linear regression to recover the true values of our model parameters. First define the model:


In [22]:
def exp_model(x, A, B):
    return A*np.exp(x*B)

Then use curve_fit to fit the model:


In [24]:
theta_best, theta_cov = opt.curve_fit(exp_model, xdata, ydata)

Our optimized parameters are close to the true values of $A=10$ and $B=-0.2$:


In [25]:
print('A = {0:.3f} +/- {1:.3f}'.format(theta_best[0], np.sqrt(theta_cov[0,0])))
print('B = {0:.3f} +/- {1:.3f}'.format(theta_best[1], np.sqrt(theta_cov[1,1])))


A = 9.945 +/- 0.077
B = -0.198 +/- 0.002

Plot the raw data and fitted model:


In [26]:
xfit = np.linspace(0,20)
yfit = exp_model(xfit, theta_best[0], theta_best[1])
plt.plot(xfit, yfit)
plt.plot(xdata, ydata, 'k.')
plt.xlabel('x')
plt.ylabel('y');


A note about transforming to a linear model

Another approach to dealing with non-linear models is to linearize them with a transformation. For example, the exponential model used above,

$$y_{model}(x) = Ae^{Bx},$$

can be linearized by taking the natural log of both sides:

$$ ln(y) = ln(A) + B x $$

This model is linear in the parameters $ln(A)$ and $B$ and can be treated as a standard linear regression problem. This approach is used in most introductory physics laboratories. **However, in most cases, transforming to a linear model will give a poor fit. The reasons for this are a bit subtle, but here is the basic idea:

  • Least squares regression assumes that errors are symmetric, additive and normally distributed. This assumption has been present throughout this notebook, when we generated data by adding a small amount of randomness to our data using np.random.normal.
  • Transforming the data with a non-linear transformation, such as the square root, exponential or logarithm will not lead to errors that follow this assumption.
  • However, in the rare case that there are no (minimal) random errors in the original data set, the transformation approach will give the same result as the non-linear regression on the original model.

Here is a nice discussion of this in the Matlab documentation.

Model selection

In all of the examples in this notebook, we started with a model and used that model to generate data. This was done to make it easy to check the predicted model parameters with the true values used to create the data set. However, in the real world, you almost never know the model underlying the data. Because of this, there is an additional step called model selection where you have to figure out a way to pick a good model. This is a notoriously difficult problem, especially when the randomness in the data is large.

  • Pick the simplest possible model. In general picking a more complex model will give a better fit. However, it won't be a useful model and will make poor predictions about future data. This is known as overfitting.
  • Whenever possible, pick a model that has a underlying theoretical foundation or motivation. For example, in Physics, most of our models come from well tested differential equations.
  • There are more advanced methods (AIC,BIC) that can assist in this model selection process. A good discussion can be found in this notebook by Jake VanderPlas.