Learning Objectives: learn to fit models to data using linear and non-linear regression.
This material is licensed under the MIT license and was developed by Brian Granger. It was adapted from material from Jake VanderPlas and Jennifer Klay.
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy import optimize as opt
In [3]:
from IPython.html.widgets import interact
In Data Science it is common to start with data and develop a model of that data. Such models can help to explain the data and make predictions about future observations. In fields like Physics, these models are often given in the form of differential equations, whose solutions explain and predict the data. In most other fields, such differential equations are not known. Often, models have to include sources of uncertainty and randomness. Given a set of data, fitting a model to the data is the process of tuning the parameters of the model to best explain the data.
When a model has a linear dependence on its parameters, such as $a x^2 + b x + c$, this process is known as linear regression. When a model has a non-linear dependence on its parameters, such as $ a e^{bx} $, this process in known as non-linear regression. Thus, fitting data to a straight line model of $m x + b $ is linear regression, because of its linear dependence on $m$ and $b$ (rather than $x$).
A classical example of fitting a model is finding the slope and intercept of a straight line that goes through a set of data points $\{x_i,y_i\}$. For a straight line the model is:
$$ y_{model}(x) = mx + b $$Given this model, we can define a metric, or cost function, that quantifies the error the model makes. One commonly used metric is $\chi^2$, which depends on the deviation of the model from each data point ($y_i - y_{model}(x_i)$) and the measured uncertainty of each data point $ \sigma_i$:
$$ \chi^2 = \sum_{i=1}^N \left(\frac{y_i - y_{model}(x)}{\sigma_i}\right)^2 $$When $\chi^2$ is small, the model's predictions will be close the data points. Likewise, when $\chi^2$ is large, the model's predictions will be far from the data points. Given this, our task is to minimize $\chi^2$ with respect to the model parameters $\theta = [m, b]$ in order to find the best fit.
To illustrate linear regression, let's create a synthetic data set with a known slope and intercept, but random noise that is additive and normally distributed.
In [4]:
N = 50
m_true = 2
b_true = -1
dy = 2.0 # uncertainty of each point
np.random.seed(0)
xdata = 10 * np.random.random(N) # don't use regularly spaced data
ydata = b_true + m_true * xdata + np.random.normal(0.0, dy, size=N) # our errors are additive
plt.errorbar(xdata, ydata, dy,fmt='.k', ecolor='lightgray')
plt.xlabel('x')
plt.ylabel('y');
It is useful to see visually how changing the model parameters changes the value of $\chi^2$. By using IPython's interact
function, we can create a user interface that allows us to pick a slope and intercept interactively and see the resulting line and $\chi^2$ value.
Here is the function we want to minimize. Note how we have combined the two parameters into a single parameters vector $\theta = [m, b]$, which is the first argument of the function:
In [5]:
def chi2(theta, x, y, dy):
# theta = [b, m]
return np.sum(((y - theta[0] - theta[1] * x) / dy) ** 2)
In [6]:
def manual_fit(b, m):
modely = m*xdata + b
plt.plot(xdata, modely)
plt.errorbar(xdata, ydata, dy,fmt='.k', ecolor='lightgray')
plt.xlabel('x')
plt.ylabel('y')
plt.text(1, 15, 'b={0:.2f}'.format(b))
plt.text(1, 12.5, 'm={0:.2f}'.format(m))
plt.text(1, 10.0, '$\chi^2$={0:.2f}'.format(chi2([b,m],xdata,ydata, dy)))
In [7]:
interact(manual_fit, b=(-3.0,3.0,0.01), m=(0.0,4.0,0.01));
Go ahead and play with the sliders and try to:
You should see that these two conditions coincide.
Now that we have seen how minimizing $\chi^2$ gives the best parameters in a model, let's perform this minimization numerically using scipy.optimize.minimize
. We have already defined the function we want to minimize, chi2
, so we only have to pass it to minimize
along with an initial guess and the additional arguments (the raw data):
In [8]:
theta_guess = [0.0,1.0]
result = opt.minimize(chi2, theta_guess, args=(xdata,ydata,dy))
Here are the values of $b$ and $m$ that minimize $\chi^2$:
In [9]:
theta_best = result.x
print(theta_best)
These values are close to the true values of $b=-1$ and $m=2$. The reason our values are different is that our data set has a limited number of points. In general, we expect that as the number of points in our data set increases, the model parameters will converge to the true values. But having a limited number of data points is not a problem - it is a reality of most data collection processes.
We can plot the raw data and the best fit line:
In [10]:
xfit = np.linspace(0,10.0)
yfit = theta_best[1]*xfit + theta_best[0]
plt.plot(xfit, yfit)
plt.errorbar(xdata, ydata, dy,
fmt='.k', ecolor='lightgray')
plt.xlabel('x')
plt.ylabel('y');
Performing regression by minimizing $\chi^2$ is known as least squares regression, because we are minimizing the sum of squares of the deviations. The linear version of this is known as linear least squares. For this case, SciPy provides a purpose built function, scipy.optimize.leastsq
. Instead of taking the $\chi^2$ function to minimize, leastsq
takes a function that computes the deviations:
In [11]:
def deviations(theta, x, y, dy):
return (y - theta[0] - theta[1] * x) / dy
result = opt.leastsq(deviations, theta_guess, args=(xdata, ydata, dy), full_output=True)
Here we have passed the full_output=True
option. When this is passed the covariance matrix $\Sigma_{ij}$ of the model parameters is also returned. The uncertainties (as standard deviations) in the parameters are the square roots of the diagonal elements of the covariance matrix:
A proof of this is beyond the scope of the current notebook.
In [12]:
theta_best = result[0]
theta_cov = result[1]
print('b = {0:.3f} +/- {1:.3f}'.format(theta_best[0], np.sqrt(theta_cov[0,0])))
print('m = {0:.3f} +/- {1:.3f}'.format(theta_best[1], np.sqrt(theta_cov[1,1])))
We can again plot the raw data and best fit line:
In [13]:
yfit = theta_best[0] + theta_best[1] * xfit
plt.errorbar(xdata, ydata, dy,
fmt='.k', ecolor='lightgray');
plt.plot(xfit, yfit, '-b');
SciPy also provides a general curve fitting function, curve_fit
, that can handle both linear and non-linear models. This function:
Let's apply curve_fit
to the above data. First we define a model function. The first argument should be the independent variable of the model.
In [14]:
def model(x, b, m):
return m*x+b
Then call curve_fit
passing the model function and the raw data. The uncertainties of each data point are provided with the sigma
keyword argument. If there are no uncertainties, this can be omitted. By default the uncertainties are treated as relative. To treat them as absolute, pass the absolute_sigma=True
argument.
In [15]:
theta_best, theta_cov = opt.curve_fit(model, xdata, ydata, sigma=dy)
Again, display the optimal values of $b$ and $m$ along with their uncertainties:
In [16]:
print('b = {0:.3f} +/- {1:.3f}'.format(theta_best[0], np.sqrt(theta_cov[0,0])))
print('m = {0:.3f} +/- {1:.3f}'.format(theta_best[1], np.sqrt(theta_cov[1,1])))
We can again plot the raw data and best fit line:
In [17]:
xfit = np.linspace(0,10.0)
yfit = theta_best[1]*xfit + theta_best[0]
plt.plot(xfit, yfit)
plt.errorbar(xdata, ydata, dy,
fmt='.k', ecolor='lightgray')
plt.xlabel('x')
plt.ylabel('y');
So far we have been using a linear model $y_{model}(x) = m x +b$. Remember this model was linear, not because of its dependence on $x$, but on $b$ and $m$. A non-linear model will have a non-linear dependece on the model parameters. Examples are $A e^{B x}$, $A \cos{B x}$, etc. In this section we will generate data for the following non-linear model:
$$y_{model}(x) = Ae^{Bx}$$and fit that data using curve_fit
. Let's start out by using this model to generate a data set to use for our fitting:
In [20]:
npoints = 20
Atrue = 10.0
Btrue = -0.2
xdata = np.linspace(0.0, 20.0, npoints)
dy = np.random.normal(0.0, 0.1, size=npoints)
ydata = Atrue*np.exp(Btrue*xdata) + dy
Plot the raw data:
In [21]:
plt.plot(xdata, ydata, 'k.')
plt.xlabel('x')
plt.ylabel('y');
Let's see if we can use non-linear regression to recover the true values of our model parameters. First define the model:
In [22]:
def exp_model(x, A, B):
return A*np.exp(x*B)
Then use curve_fit
to fit the model:
In [24]:
theta_best, theta_cov = opt.curve_fit(exp_model, xdata, ydata)
Our optimized parameters are close to the true values of $A=10$ and $B=-0.2$:
In [25]:
print('A = {0:.3f} +/- {1:.3f}'.format(theta_best[0], np.sqrt(theta_cov[0,0])))
print('B = {0:.3f} +/- {1:.3f}'.format(theta_best[1], np.sqrt(theta_cov[1,1])))
Plot the raw data and fitted model:
In [26]:
xfit = np.linspace(0,20)
yfit = exp_model(xfit, theta_best[0], theta_best[1])
plt.plot(xfit, yfit)
plt.plot(xdata, ydata, 'k.')
plt.xlabel('x')
plt.ylabel('y');
Another approach to dealing with non-linear models is to linearize them with a transformation. For example, the exponential model used above,
$$y_{model}(x) = Ae^{Bx},$$can be linearized by taking the natural log of both sides:
$$ ln(y) = ln(A) + B x $$This model is linear in the parameters $ln(A)$ and $B$ and can be treated as a standard linear regression problem. This approach is used in most introductory physics laboratories. **However, in most cases, transforming to a linear model will give a poor fit. The reasons for this are a bit subtle, but here is the basic idea:
np.random.normal
.Here is a nice discussion of this in the Matlab documentation.
In all of the examples in this notebook, we started with a model and used that model to generate data. This was done to make it easy to check the predicted model parameters with the true values used to create the data set. However, in the real world, you almost never know the model underlying the data. Because of this, there is an additional step called model selection where you have to figure out a way to pick a good model. This is a notoriously difficult problem, especially when the randomness in the data is large.