This notebook was put together by Jake VanderPlas for UW's [Astro 599](http://www.astro.washington.edu/users/vanderplas/Astr599_2014/) course. Source and license info is on [GitHub](https://github.com/jakevdp/2014_fall_ASTR599/).
adapted from material by Andrew Becker
Today we're going to go over several methods for fitting a model to data.
A classic example of this is finding a "line of best fit": that is, given some data $\{x_i, y_i\}$, find a best-fit line defined by the slope $m$ and the intercept $b$. Below we'll think more generally about model fitting and show some means of doing model fitting in Python.
We'll present some of the formalism (you can read about the statistical details elsewhere), and show how to approach this in Python.
Let's create some fake data to try this out:
In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
N = 50
m_true = 2
b_true = -1
dy = 2.0
np.random.seed(0)
xdata = 10 * np.random.random(N)
ydata = np.random.normal(b_true + m_true * xdata, dy)
plt.errorbar(xdata, ydata, dy,
fmt='.k', ecolor='lightgray');
For our straight-line, the model is given by
$$ y_{model}(x) = mx + b $$Given this model, we can define some metric of "goodness of fit". One commonly used metric is the $\chi^2$, defined as follows:
$$ \chi^2 = \sum_{i=1}^N \left(\frac{y_i - y_{model}(x)}{\sigma_y}\right)^2 $$(Note that if you're used to thinking in terms of likelihood, the likelihood here is $\mathcal{L}\propto e^{-\chi^2}$) Given this, our task is to minimize the $\chi^2$ with respect to the model parameters $\theta = [m, b]$ in order to find the best fit.
The scipy.optimize
package has many optimization routines which are useful in a variety of situations:
In [2]:
from scipy import optimize
optimize?
The simplest go-to method is optimize.fmin
, which minimizes any given function
In [3]:
optimize.fmin?
We'll start by defining a function for our $\chi^2$:
In [4]:
def chi2(theta, x, y, dy):
# theta = [b, m]
return np.sum(((y - theta[0] - theta[1] * x) / dy) ** 2)
Next we call fmin
to optimize this function. We need to give an initial guess for $\theta$:
In [5]:
theta_guess = [0, 1]
theta_best = optimize.fmin(chi2, theta_guess, args=(xdata, ydata, dy))
print(theta_best)
Now let's plot our data and error:
In [6]:
xfit = np.linspace(0, 10)
yfit = theta_best[0] + theta_best[1] * xfit
plt.errorbar(xdata, ydata, dy,
fmt='.k', ecolor='lightgray');
plt.plot(xfit, yfit, '-k');
Now you can see that we've found a best-fit line for our data!
The function we used above, scipy.optimize.fmin
, is a general function which works for (nearly) any cost function.
Above we've used it for a linear model; it turns out that there are much simpler ways to find results for a linear model. One of these is scipy.optimize.leastsq
:
In [7]:
optimize.leastsq?
Here rather than computing the $\chi^2$, we compute an array of deviations from the model:
In [8]:
def deviations(theta, x, y, dy):
return (y - theta[0] - theta[1] * x) / dy
theta_best, ier = optimize.leastsq(deviations, theta_guess, args=(xdata, ydata, dy))
print(theta_best)
Again, we can plot the result:
In [9]:
yfit = theta_best[0] + theta_best[1] * xfit
plt.errorbar(xdata, ydata, dy,
fmt='.k', ecolor='lightgray');
plt.plot(xfit, yfit, '-k');
The nice thing about least squares fitting is that it includes a means of estimating the error on the fit parameters, which we can find by setting the argument full_output=True
In [10]:
results = optimize.leastsq(deviations, theta_guess,
args=(xdata, ydata, dy),
full_output=True)
theta_best = results[0]
covariance = results[1]
print(covariance)
This gives the uncertainty covariance of the parameters; here we see that they are nearly uncorrelated (the off-diagonal terms are close to zero) so we can approximate the fit as follows:
In [11]:
print("y = ({0:.2f} +/- {1:.2f})x + ({2:.2f} +/- {3:.2f})"
"".format(theta_best[1], np.sqrt(covariance[1, 1]),
theta_best[0], np.sqrt(covariance[0, 0])))
Apparently the slope is much less certain than the intercept.
Getting even more specialized, for a linear equation like the one above, it is possible to solve for the slope and intercept via a straightforward linear algebraic operation. For the case where all errors $dy$ are the same, we simply construct matrices
$$ X = \left[ \begin{array}{lllll} 1&1&1&\cdots&1\\ x_1&x_2&x_3&\cdots&x_N \end{array} \right]^T $$$$ y = [y_1, y_2, y_3, \cdots y_N]^T $$We want to solve the linear equation
$$ y = X\theta $$for the optimal $\theta$. We can solve this equation using either np.linalg.solve
(for well-determined, full-rank problems) or np.linalg.lstsq
(for under-determined problems)
In [12]:
np.linalg.solve?
In [13]:
np.linalg.lstsq?
In [14]:
Xdata = np.vstack([np.ones_like(xdata), xdata]).T
theta_best, resid, rank, singvals = np.linalg.lstsq(Xdata, ydata)
print(theta_best)
In [15]:
yfit = theta_best[0] + theta_best[1] * xfit
plt.errorbar(xdata, ydata, dy,
fmt='.k', ecolor='lightgray');
plt.plot(xfit, yfit, '-k');
It is also possible to construct appropriate $X$ and $y$ matrices for the case when the errors are not the same; that construction can be found in many introductory texts.
As a side note, you should realize that np.linalg.lstsq
is essentially just a wrapper around some simple matrix operations; we can do them ourselves as follows:
In [16]:
# compute the least-squares solution by-hand!
theta_best = np.linalg.solve(np.dot(Xdata.T, Xdata),
np.dot(Xdata.T, ydata))
covariance = dy ** 2 * np.linalg.inv(np.dot(Xdata.T, Xdata))
print("y = ({0:.2f} +/- {1:.2f})x + ({2:.2f} +/- {3:.2f})"
"".format(theta_best[1], np.sqrt(covariance[1, 1]),
theta_best[0], np.sqrt(covariance[0, 0])))
Notice that this is the same result we got above. You can consult an intro statistics book to see where these expressions come from.
For general optimization problems, you can use scipy.optimize.fmin
.
For least-squares optimization problems, you can use scipy.optimize.leastsq
.
For linear least-squares optimization problems, you can use np.linalg.lstsq
, or you can go through the formalism and compute the result via standard linear algebraic operations.
How do you know what model to fit to data? Note that as your model gets more and more complex, you can do a better and better job fitting the data.
For example, rather than a linear model we could use
$$ y = \theta_0 + \theta_1 x + \theta_2 x^2 $$Let's use np.polyfit
to fit this model to our data:
In [17]:
def fit_polynomial(deg=2):
p = np.polyfit(xdata, ydata, deg=deg)
yfit = np.polyval(p, xfit)
plt.errorbar(xdata, ydata, dy,
fmt='.k', ecolor='lightgray');
plt.plot(xfit, yfit);
rms = np.sqrt(np.mean((ydata - np.polyval(p, xdata)) ** 2))
print("rms error (deg={0}): {1:.2f}".format(deg, rms))
In [18]:
fit_polynomial(2)
As a rule, when we increase the degree of the polynomial, we get a better fit!
In [19]:
fit_polynomial(10)
But is this actually a better fit? There are well-understood ways to determine this.
As you add more complexity to the model, the goodness of fit will always improve (unless your code has a bug, and until $\chi^2=0$). The rule of thumb is that the number of terms you add should be far greater than the improvement in $\chi^2$.
For more statistical rigor (Numerical Recipes section 15.6), when adding N parameters to a model, $\Delta \chi^2_N$ it distrubted as a $\chi^2$ distribution with N degrees of freedom. Below we outline $\Delta \chi^2_N$ as a function of confidence level and N:
p | N=1 | N=2 | N=3 |
---|---|---|---|
68.3% | $\Delta \chi^2$ = 1.00 | $\Delta \chi^2$ = 2.30 | $\Delta \chi^2$ = 3.53 |
95.4% | $\Delta \chi^2$ = 4.00 | $\Delta \chi^2$ = 6.17 | $\Delta \chi^2$ = 8.02 |
99.73% | $\Delta \chi^2$ = 9.00 | $\Delta \chi^2$ = 11.8 | $\Delta \chi^2$ = 14.2 |
"Overfitting" is a common problem in model selection. You typically want a "parsimonious" model that captures the important information using a small number of parameters. This model is frequently selected by looking at the bias vs. variance trade-off.
Bias = < data - model >
Variance = < (data - model)$^2$ >
Low Variance | High Variance | |
Low Bias | ||
High Bias |
When determining model complexity, there are other metrics to use than $\Delta \chi^2$, especially when comparing different clases of models. E.g. mean squared error:
MSE = bias$^2$ + variance
http://scott.fortmann-roe.com
There are principled methods that help you decide the order of your models by adding a penalty to the calculation of $\chi^2$. These include the Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC):
BIC = $\chi^2$ + Nterm $\cdot$ ln(Ndata)
AIC = $\chi^2$ + Nterm $\cdot$ 2
As an example, we will fit a Fourier series to a folded periodic variable star lightcurve from the LINEAR survey.
The code below requires the astroML package, which you can easily install by running
$ pip install astroML
We will examine how $\chi^2$, BIC, and AIC vary as we change the number of terms in the Fourier series. If we do not get through this whole exercise, your homework will be to finish the notebook, and then repeat the analysis for another variable star from the data. I would like to see as a function of the number of terms in your analysis: $\chi^2$, $\Delta \chi^2$ compared to the previous model, the significance of $\Delta \chi^2$, AIC, and BIC. Indicate the order of the series you would choose based on each of the above numbers (they may be different for $\Delta \chi^2$, AIC, and BIC).
In [20]:
# Uncomment and run this to install astroML
# !pip install astroML
# !pip install astroML_addons
In [21]:
%matplotlib inline
import astroML.datasets
import numpy as np
import matplotlib.pyplot as plt
In [22]:
lcs = astroML.datasets.fetch_LINEAR_sample()
In [23]:
print lcs
In [24]:
time, flux, dflux = lcs[18525697].T
Plot the raw lightcurve
In [25]:
plt.errorbar(time, flux, yerr=dflux, fmt="ro")
plt.gca().invert_yaxis()
plt.xlabel("Mean Julian Day")
plt.ylabel("Mag")
plt.show()
In this case, we know its a periodic variable star. The whole issue of if an object is periodic is the subject of a whole other lecture. In this case, we take it as fact, and attempt to find the optimal period by using a Lomb Scargle analysis (itself the subject of another lecture entirely).
The periodogram is essentially a normalized inverse $\chi^2$, so that the maximum of the periodogram corresponds to the best-fit Fourier series.
In [26]:
from astroML.time_series import lomb_scargle
periods = np.logspace(-1, 0, 10000, base=10)
periodogram = lomb_scargle(time, flux, dflux, omega=2 * np.pi / periods, generalized=True)
In [27]:
plt.plot(periods, periodogram)
plt.semilogx()
plt.xlabel("Period (days)")
plt.ylabel("Power")
plt.show()
Find the peak of this periodogram using argmax
In [28]:
idx = np.argmax(periodogram)
print periods[idx], periodogram[idx]
Now "fold" the lightcurve at this best period. By "folding" I mean find the phase at which the star is in its oscillation (phase is defined here as running from 0 to 1), for every data point. This is the MJD / period (e.g. 81761.7165) minus the integer portion of this value MJD // period (e.g. 81761) = 0.7165.
In [30]:
period = periods[idx]
phase = time / period - time // period
print min(phase), max(phase)
In [31]:
plt.errorbar(phase, flux, yerr=dflux, fmt="ro")
plt.gca().invert_yaxis()
plt.xlabel("Phase")
plt.ylabel("Mag")
plt.show()
Now we need to get serious. We want a function we can call to return $\chi^2$ for a model fit. It makes a lot of sense to have this be a method on a class. Then the class can contain the data it is fitting as member variables (e.g. self.flux). We also want an evaluate() method that returns the lightcurve for a given set of phases, so that we can plot it. Design a very basic class to do this. The lightcurve model that this class will fit to the data is of the form:
with model parameters [ A$_0$ A$_1$ $\phi_1$ A$_2$ $\phi_2$ ... A$_N$ $\phi_N$ ]
Let's create a class that implements this model:
In [32]:
class Fourier(object):
def __init__(self, phase, flux, dflux, nterms):
self.phase = phase
self.flux = flux
self.dflux = dflux
self.nterms = nterms
assert(self.nterms > 0)
def evaluate(self, phase, terms):
assert(len(terms) == 2 * self.nterms - 1)
model = terms[0] * np.ones(len(phase))
for i in range(self.nterms-1):
model += terms[2*i + 1] * np.cos(2 * np.pi * (i+1) * phase + terms[2*i + 2])
return model
def chi2(self, args):
model = self.evaluate(self.phase, args)
chi = (model - self.flux) / self.dflux
return np.sum(chi**2)
In [33]:
fourier = Fourier(phase, flux, dflux, 1)
mphase = np.arange(0, 1, 0.01)
model = fourier.evaluate(mphase, [10])
In [34]:
plt.plot(mphase, model, "k-")
plt.gca().invert_yaxis()
plt.xlabel("Phase")
plt.ylabel("Model mag")
plt.show()
Now for a more interesting model, with some actual values:
In [35]:
fourier = Fourier(phase, flux, dflux, 2)
mphase = np.arange(0, 1, 0.01)
model = fourier.evaluate(mphase, [10, 0.1, 0.0])
In [36]:
plt.plot(mphase, model, "k-")
plt.gca().invert_yaxis()
plt.xlabel("Phase")
plt.ylabel("Model mag")
plt.show()
Let's check what this looks like if we change the phase:
In [37]:
fourier = Fourier(phase, flux, dflux, 2)
mphase = np.arange(0, 1, 0.01)
model = fourier.evaluate(mphase, [10, 0.1, np.pi])
plt.plot(mphase, model, "k-")
plt.gca().invert_yaxis()
plt.xlabel("Phase")
plt.ylabel("Model mag")
plt.show()
Since we have a complicated model, let's use optimize.fmin
, as we did above.
We'll fit a 2 term model to the data (1 constant term, and 1 term with an amplitude and phase offset)
In [38]:
fourier = Fourier(phase, flux, dflux, 2)
guess = [] # this needs to be the number of paramters you want to fit for, here the mean brightness, the amplitude, and minimum
guess.append(np.mean(fourier.flux))
guess.append(np.std(fourier.flux))
We'll initialize the phase with the phase of the minimum flux
In [39]:
print(np.argmax(fourier.flux))
In [40]:
print(fourier.flux[np.argmax(fourier.flux)])
In [41]:
print(fourier.phase[np.argmax(fourier.flux)])
In [42]:
guess.append(2 * np.pi * (1 - fourier.phase[np.argmax(fourier.flux)]))
Finally, we have our initial guess:
In [43]:
print(guess)
Make it so!
In [44]:
optimize.fmin(fourier.chi2, x0=[guess], disp=0, full_output=1)
Out[44]:
Capture the output and look at the "best fit" parameters. I.e. those that minimized the return value of fourier.chi2.
In [45]:
result = optimize.fmin(fourier.chi2, x0=[guess], disp=0, full_output=1)
best_fit = result[0]
print(best_fit)
Plot the model up, compared to the data and the initial guess
In [46]:
plt.errorbar(fourier.phase, fourier.flux, yerr=fourier.dflux, fmt="ro")
plt.plot(mphase, fourier.evaluate(mphase, guess), "b--", label="Initial guess")
plt.plot(mphase, fourier.evaluate(mphase, best_fit), "g-", label="Best fit")
plt.gca().invert_yaxis()
plt.xlabel("Phase")
plt.ylabel("Mag")
plt.legend()
plt.show()
Looks "eh". What is the goodness of fit of the model? The reduced chi2 (i.e. chi2 per degree of freedom) gives an absolute "goodness of fit" to your model: you expect this to be a value near 1.0:
In [47]:
chi2_1 = result[1]
chi2_2 = fourier.chi2(best_fit)
print chi2_1, chi2_2, chi2_1 / (len(flux) - len(best_fit) - 1)
In [48]:
fourier2 = Fourier(phase, flux, dflux, 3)
guess2 = [] # this needs to be the number of paramters you want to fit for, now 2*3-1 = 5
guess2.append(np.mean(fourier2.flux))
guess2.append(np.std(fourier2.flux))
guess2.append(2 * np.pi * (1 - fourier2.phase[np.argsort(fourier2.flux)[-1]]))
guess2.append(0.0)
guess2.append(2 * np.pi * (1 - fourier2.phase[np.argsort(fourier2.flux)[-1]]))
result2 = optimize.fmin_bfgs(fourier2.chi2, x0=[guess2], disp=0, full_output=1)
best_fit2 = result2[0]
print best_fit2
In [49]:
plt.errorbar(fourier2.phase, fourier2.flux, yerr=fourier2.dflux, fmt="ro")
plt.plot(mphase, fourier2.evaluate(mphase, guess2), "b--", label="Initial guess")
plt.plot(mphase, fourier2.evaluate(mphase, best_fit2), "g-", label="Best fit")
plt.gca().invert_yaxis()
plt.xlabel("Phase")
plt.ylabel("Mag")
plt.show()
More cowbell!
In [50]:
fourier5 = Fourier(phase, flux, dflux, 5)
guess5 = [] # this needs to be the number of paramters you want to fit for, now 9
guess5.append(np.mean(fourier5.flux))
guess5.append(np.std(fourier5.flux))
guess5.append(2 * np.pi * (1 - fourier5.phase[np.argsort(fourier5.flux)[-1]]))
for i in range(2, 5):
guess5.append(0.0)
guess5.append(2 * np.pi * (1 - fourier5.phase[np.argsort(fourier5.flux)[-1]]))
result5 = optimize.fmin_bfgs(fourier5.chi2, x0=[guess5], disp=0, full_output=1)
best_fit5 = result5[0]
print best_fit5
In [51]:
plt.errorbar(fourier5.phase, fourier5.flux, yerr=fourier5.dflux, fmt="ro")
plt.plot(mphase, fourier5.evaluate(mphase, guess5), "b--", label="Initial guess")
plt.plot(mphase, fourier5.evaluate(mphase, best_fit5), "g-", label="Best fit")
plt.gca().invert_yaxis()
plt.xlabel("Phase")
plt.ylabel("Mag")
plt.show()
Am getting tired of initializing "guess" all the time. We can make a method on the class that does this for us! Lets inherit from the main Fourier class, so that we only have to implement that method. Part of the point of this is to demonstrate class inheritance.
In [52]:
class Fourier2(Fourier):
def __init__(self, phase, flux, dflux, nterms):
Fourier.__init__(self, phase, flux, dflux, nterms)
def makeGuess(self):
guess = []
guess.append(np.mean(self.flux))
if self.nterms > 1:
guess.append(np.std(self.flux))
guess.append(2 * np.pi * (1 - self.phase[np.argsort(self.flux)[-1]]))
for i in range(2, self.nterms):
guess.append(0.0)
guess.append(2 * np.pi * (1 - self.phase[np.argsort(self.flux)[-1]]))
return guess
Now that everything is pipelined, lets run this for a range of models, and plot how chi2 improves as you increase the model parameters. Note that it will always decrease as you add terms!!!! Up to a limit: the end point of this process is when you have 1 parameter for every data point, in which case your chi2 should be 0.0.
In [53]:
nterms = np.arange(1, 10)
chi2 = []
for nterm in nterms:
fourier = Fourier2(phase, flux, dflux, nterm)
guess = fourier.makeGuess()
print(guess)
chi2.append(optimize.fmin_bfgs(fourier.chi2, x0=[guess], disp=0, full_output=1)[1])
chi2 = np.array(chi2)
plt.plot(nterms, chi2)
plt.xlabel("N Terms")
plt.ylabel("Chi2")
plt.show()
And what is delta chi2? Note it converges to zero, i.e. you are getting less and less improvement with the addition of every new term. Now we have to start being careful.
In [54]:
delta_chi2 = chi2[:-1] - chi2[1:]
plt.plot(nterms[1:], delta_chi2)
plt.xlabel("N terms")
plt.ylabel("Delta chi2")
plt.show()
# You will need to figure out the significance of all these delta_chi2 as part of your homework!
Lets start to look at the penalties imposed by BIC/AIC.
In [55]:
nterms = np.arange(1, 10)
chi2 = []
BICpenalty = []
AICpenalty = []
for nterm in nterms:
fourier = Fourier2(phase, flux, dflux, nterm)
guess = fourier.makeGuess()
BICpenalty.append(len(guess) * np.log(len(fourier.phase)))
AICpenalty.append(len(guess) * 2.0)
chi2.append(optimize.fmin_bfgs(fourier.chi2, x0=[guess], disp=0, full_output=1)[1])
chi2 = np.array(chi2)
BICpenalty = np.array(BICpenalty)
AICpenalty = np.array(AICpenalty)
In [56]:
print chi2
print BICpenalty
print AICpenalty
In [57]:
plt.plot(nterms, chi2, label="chi2")
plt.plot(nterms, chi2 + BICpenalty, label="BIC")
plt.plot(nterms, chi2 + AICpenalty, label="AIC")
plt.xlabel("N Terms")
plt.ylabel("Chi2")
plt.semilogy()
plt.legend()
plt.show()
Can kinda see that BIC/AIC are telling you to stop. Lets make it more apparent.
In [58]:
nterms = np.arange(1, 30, 2)
chi2 = []
BICpenalty = []
AICpenalty = []
for nterm in nterms:
fourier = Fourier2(phase, flux, dflux, nterm)
guess = fourier.makeGuess()
BICpenalty.append(len(guess) * np.log(len(fourier.phase)))
AICpenalty.append(len(guess) * 2.0)
chi2.append(optimize.fmin_bfgs(fourier.chi2, x0=[guess], disp=0, full_output=1)[1])
chi2 = np.array(chi2)
BICpenalty = np.array(BICpenalty)
AICpenalty = np.array(AICpenalty)
In [59]:
plt.plot(nterms, chi2, label="chi2")
plt.plot(nterms, chi2 + BICpenalty, label="BIC")
plt.plot(nterms, chi2 + AICpenalty, label="AIC")
plt.xlabel("N Terms")
plt.ylabel("Chi2")
plt.semilogy()
plt.legend()
plt.show()
Which of the fits have the minimum BIC/AIC?
In [60]:
print np.argmin(chi2+BICpenalty), np.argmin(chi2+AICpenalty)
And what order model do they suggest?
In [61]:
print nterms[4], nterms[6]
Note that these will typically be different. And this is when you say to the referee "We examined the order of model selection using both the BIC and the AIC, and find that our conclusions are insensitive to which model is preferred. Accordingly, we chose the most parsimonious model for our final analysis". Because if you do find your results depend significantly on the model order at this level, you should rethink how you're doing things (e.g. is your model's functional form correct).
Homework: repeat the exercise with a different star of your choosing. Find the order of the model that delta chi2, AIC, and BIC would suggest.
Turn in the homework again via github, using the filename HW3.ipynb
.