**Learning Objectives:** learn to fit models to data using linear and non-linear regression.

This material is licensed under the MIT license and was developed by Brian Granger. It was adapted from material from Jake VanderPlas and Jennifer Klay.

```
In [2]:
```%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from scipy import optimize as opt

```
In [3]:
```from IPython.html.widgets import interact

```
```

In Data Science it is common to start with data and develop a *model* of that data. Such models can help to explain the data and make predictions about future observations. In fields like Physics, these models are often given in the form of differential equations, whose solutions explain and predict the data. In most other fields, such differential equations are not known. Often, models have to include sources of uncertainty and randomness. Given a set of data, *fitting* a model to the data is the process of tuning the parameters of the model to *best* explain the data.

When a model has a linear dependence on its parameters, such as $a x^2 + b x + c$, this process is known as *linear regression*. When a model has a non-linear dependence on its parameters, such as $ a e^{bx} $, this process in known as non-linear regression. Thus, fitting data to a straight line model of $m x + b $ is linear regression, because of its linear dependence on $m$ and $b$ (rather than $x$).

A classical example of fitting a model is finding the slope and intercept of a straight line that goes through a set of data points $\{x_i,y_i\}$. For a straight line the model is:

$$ y_{model}(x) = mx + b $$Given this model, we can define a metric, or *cost function*, that quantifies the error the model makes. One commonly used metric is $\chi^2$, which depends on the deviation of the model from each data point ($y_i - y_{model}(x_i)$) and the measured uncertainty of each data point $ \sigma_i$:

When $\chi^2$ is small, the model's predictions will be close the data points. Likewise, when $\chi^2$ is large, the model's predictions will be far from the data points. Given this, our task is to minimize $\chi^2$ with respect to the model parameters $\theta = [m, b]$ in order to find the best fit.

To illustrate linear regression, let's create a synthetic data set with a known slope and intercept, but random noise that is additive and normally distributed.

```
In [4]:
```N = 50
m_true = 2
b_true = -1
dy = 2.0 # uncertainty of each point
np.random.seed(0)
xdata = 10 * np.random.random(N) # don't use regularly spaced data
ydata = b_true + m_true * xdata + np.random.normal(0.0, dy, size=N) # our errors are additive
plt.errorbar(xdata, ydata, dy,fmt='.k', ecolor='lightgray')
plt.xlabel('x')
plt.ylabel('y');

```
```

It is useful to see visually how changing the model parameters changes the value of $\chi^2$. By using IPython's `interact`

function, we can create a user interface that allows us to pick a slope and intercept interactively and see the resulting line and $\chi^2$ value.

Here is the function we want to minimize. Note how we have combined the two parameters into a single parameters vector $\theta = [m, b]$, which is the first argument of the function:

```
In [5]:
```def chi2(theta, x, y, dy):
# theta = [b, m]
return np.sum(((y - theta[0] - theta[1] * x) / dy) ** 2)

```
In [6]:
```def manual_fit(b, m):
modely = m*xdata + b
plt.plot(xdata, modely)
plt.errorbar(xdata, ydata, dy,fmt='.k', ecolor='lightgray')
plt.xlabel('x')
plt.ylabel('y')
plt.text(1, 15, 'b={0:.2f}'.format(b))
plt.text(1, 12.5, 'm={0:.2f}'.format(m))
plt.text(1, 10.0, '$\chi^2$={0:.2f}'.format(chi2([b,m],xdata,ydata, dy)))

```
In [7]:
```interact(manual_fit, b=(-3.0,3.0,0.01), m=(0.0,4.0,0.01));

```
```

Go ahead and play with the sliders and try to:

- Find the lowest value of $\chi^2$
- Find the "best" line through the data points.

You should see that these two conditions coincide.

`scipy.optimize.minimize`

. We have already defined the function we want to minimize, `chi2`

, so we only have to pass it to `minimize`

along with an initial guess and the additional arguments (the raw data):

```
In [8]:
```theta_guess = [0.0,1.0]
result = opt.minimize(chi2, theta_guess, args=(xdata,ydata,dy))

Here are the values of $b$ and $m$ that minimize $\chi^2$:

```
In [9]:
```theta_best = result.x
print(theta_best)

```
```

These values are close to the true values of $b=-1$ and $m=2$. The reason our values are different is that our data set has a limited number of points. In general, we expect that as the number of points in our data set increases, the model parameters will converge to the true values. But having a limited number of data points is not a problem - it is a reality of most data collection processes.

We can plot the raw data and the best fit line:

```
In [10]:
```xfit = np.linspace(0,10.0)
yfit = theta_best[1]*xfit + theta_best[0]
plt.plot(xfit, yfit)
plt.errorbar(xdata, ydata, dy,
fmt='.k', ecolor='lightgray')
plt.xlabel('x')
plt.ylabel('y');

```
```

*least squares* regression, because we are minimizing the sum of squares of the deviations. The linear version of this is known as *linear least squares*. For this case, SciPy provides a purpose built function, `scipy.optimize.leastsq`

. Instead of taking the $\chi^2$ function to minimize, `leastsq`

takes a function that computes the deviations:

```
In [11]:
```def deviations(theta, x, y, dy):
return (y - theta[0] - theta[1] * x) / dy
result = opt.leastsq(deviations, theta_guess, args=(xdata, ydata, dy), full_output=True)

Here we have passed the `full_output=True`

option. When this is passed the covariance matrix $\Sigma_{ij}$ of the model parameters is also returned. The uncertainties (as standard deviations) in the parameters are the square roots of the diagonal elements of the covariance matrix:

A proof of this is beyond the scope of the current notebook.

```
In [12]:
```theta_best = result[0]
theta_cov = result[1]
print('b = {0:.3f} +/- {1:.3f}'.format(theta_best[0], np.sqrt(theta_cov[0,0])))
print('m = {0:.3f} +/- {1:.3f}'.format(theta_best[1], np.sqrt(theta_cov[1,1])))

```
```

We can again plot the raw data and best fit line:

```
In [13]:
```yfit = theta_best[0] + theta_best[1] * xfit
plt.errorbar(xdata, ydata, dy,
fmt='.k', ecolor='lightgray');
plt.plot(xfit, yfit, '-b');

```
```

SciPy also provides a general curve fitting function, `curve_fit`

, that can handle both linear and non-linear models. This function:

- Allows you to directly specify the model as a function, rather than the cost function (it assumes $\chi^2$).
- Returns the covariance matrix for the parameters that provides estimates of the errors in each of the parameters.

Let's apply `curve_fit`

to the above data. First we define a model function. The first argument should be the independent variable of the model.

```
In [14]:
```def model(x, b, m):
return m*x+b

`curve_fit`

passing the model function and the raw data. The uncertainties of each data point are provided with the `sigma`

keyword argument. If there are no uncertainties, this can be omitted. By default the uncertainties are treated as relative. To treat them as absolute, pass the `absolute_sigma=True`

argument.

```
In [15]:
```theta_best, theta_cov = opt.curve_fit(model, xdata, ydata, sigma=dy)

Again, display the optimal values of $b$ and $m$ along with their uncertainties:

```
In [16]:
```print('b = {0:.3f} +/- {1:.3f}'.format(theta_best[0], np.sqrt(theta_cov[0,0])))
print('m = {0:.3f} +/- {1:.3f}'.format(theta_best[1], np.sqrt(theta_cov[1,1])))

```
```

We can again plot the raw data and best fit line:

```
In [17]:
```xfit = np.linspace(0,10.0)
yfit = theta_best[1]*xfit + theta_best[0]
plt.plot(xfit, yfit)
plt.errorbar(xdata, ydata, dy,
fmt='.k', ecolor='lightgray')
plt.xlabel('x')
plt.ylabel('y');

```
```

So far we have been using a linear model $y_{model}(x) = m x +b$. Remember this model was linear, not because of its dependence on $x$, but on $b$ and $m$. A non-linear model will have a non-linear dependece on the model parameters. Examples are $A e^{B x}$, $A \cos{B x}$, etc. In this section we will generate data for the following non-linear model:

$$y_{model}(x) = Ae^{Bx}$$and fit that data using `curve_fit`

. Let's start out by using this model to generate a data set to use for our fitting:

```
In [20]:
```npoints = 20
Atrue = 10.0
Btrue = -0.2
xdata = np.linspace(0.0, 20.0, npoints)
dy = np.random.normal(0.0, 0.1, size=npoints)
ydata = Atrue*np.exp(Btrue*xdata) + dy

Plot the raw data:

```
In [21]:
```plt.plot(xdata, ydata, 'k.')
plt.xlabel('x')
plt.ylabel('y');

```
```

```
In [22]:
```def exp_model(x, A, B):
return A*np.exp(x*B)

Then use `curve_fit`

to fit the model:

```
In [24]:
```theta_best, theta_cov = opt.curve_fit(exp_model, xdata, ydata)

Our optimized parameters are close to the true values of $A=10$ and $B=-0.2$:

```
In [25]:
```print('A = {0:.3f} +/- {1:.3f}'.format(theta_best[0], np.sqrt(theta_cov[0,0])))
print('B = {0:.3f} +/- {1:.3f}'.format(theta_best[1], np.sqrt(theta_cov[1,1])))

```
```

Plot the raw data and fitted model:

```
In [26]:
```xfit = np.linspace(0,20)
yfit = exp_model(xfit, theta_best[0], theta_best[1])
plt.plot(xfit, yfit)
plt.plot(xdata, ydata, 'k.')
plt.xlabel('x')
plt.ylabel('y');

```
```

Another approach to dealing with non-linear models is to linearize them with a transformation. For example, the exponential model used above,

$$y_{model}(x) = Ae^{Bx},$$can be linearized by taking the natural log of both sides:

$$ ln(y) = ln(A) + B x $$This model is linear in the parameters $ln(A)$ and $B$ and can be treated as a standard linear regression problem. This approach is used in most introductory physics laboratories. **However, in most cases, transforming to a linear model will give a poor fit. The reasons for this are a bit subtle, but here is the basic idea:

- Least squares regression assumes that errors are symmetric, additive and normally distributed. This assumption has been present throughout this notebook, when we generated data by
*adding*a small amount of randomness to our data using`np.random.normal`

. - Transforming the data with a non-linear transformation, such as the square root, exponential or logarithm will not lead to errors that follow this assumption.
- However, in the rare case that there are no (minimal) random errors in the original data set, the transformation approach will give the same result as the non-linear regression on the original model.

Here is a nice discussion of this in the Matlab documentation.

In all of the examples in this notebook, we started with a model and used that model to generate data. This was done to make it easy to check the predicted model parameters with the true values used to create the data set. However, in the real world, you almost never know the model underlying the data. Because of this, there is an additional step called *model selection* where you have to figure out a way to pick a good model. This is a notoriously difficult problem, especially when the randomness in the data is large.

- Pick the simplest possible model. In general picking a more complex model will give a better fit. However, it won't be a useful model and will make poor predictions about future data. This is known as overfitting.
- Whenever possible, pick a model that has a underlying theoretical foundation or motivation. For example, in Physics, most of our models come from well tested differential equations.
- There are more advanced methods (AIC,BIC) that can assist in this model selection process. A good discussion can be found in this notebook by Jake VanderPlas.