This notebook was put together by Andrew Becker for UW's [Astro 599](http://www.astro.washington.edu/users/vanderplas/Astr599/) course. Source and license info is on [GitHub](https://github.com/jakevdp/2013_fall_ASTR599/).

Optimization and Minimization:

by Andrew Becker

Q1) How do you fit a model to data?

Define some metric that represents the "goodness of fit" of the model to data (e.g. $\chi^2$).

$\chi^2 = \sum_{i=1}^{Ndata} \left(\frac{data_i - model}{uncertainty_i}\right)^2$

And then run optimization algorithm whose sole purpose is to minimize/maximize this value by tweaking the model parameters. This is the process of maximizing likelihood / minimizing $\chi^2$.

Likelihood $\propto$ $e^{(-\chi^2/2)}$

Q2) How do you know when to stop?

As you add more complexity to the model, the goodness of fit will always improve (unless your code has a bug, and until $\chi^2=0$). The rule of thumb is that the number of terms you add should be far greater than the improvement in $\chi^2$.

For more statistical rigor (Numerical Recipes section 15.6), when adding N parameters to a model, $\Delta \chi^2_N$ it distrubted as a $\chi^2$ distribution with N degrees of freedom. Below we outline $\Delta \chi^2_N$ as a function of confidence level and N:

p	N=1	N=2	N=3
68.3%	$\Delta \chi^2$ = 1.00	$\Delta \chi^2$ = 2.30	$\Delta \chi^2$ = 3.53
95.4%	$\Delta \chi^2$ = 4.00	$\Delta \chi^2$ = 6.17	$\Delta \chi^2$ = 8.02
99.73%	$\Delta \chi^2$ = 9.00	$\Delta \chi^2$ = 11.8	$\Delta \chi^2$ = 14.2

Q3) How do you know when you've gone too far?

"Overfitting" is a common problem in model selection. You typically want a "parsimonious" model that captures the important information using a small number of parameters. This model is frequently selected by looking at the bias vs. variance trade-off.

Bias = < data - model >

Variance = < (data - model)$^2$ >

	Low Variance	High Variance
Low Bias
High Bias

Graphical illustration of bias and variance (scott.fortmann-roe.com)

When determining model complexity, there are other metrics to use than $\Delta \chi^2$, especially when comparing different clases of models. E.g. mean squared error:

MSE = bias$^2$ + variance

There are principled methods that help you decide the order of your models by adding a penalty to the calculation of $\chi^2$. These include the Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC):

BIC = $\chi^2$ + Nterm $\cdot$ ln(Ndata)

AIC = $\chi^2$ + Nterm $\cdot$ 2

As an example, we will fit a Fourier series to a folded periodic variable star lightcurve from the LINEAR survey. We will examine how $\chi^2$, BIC, and AIC vary as we change the number of terms in the Fourier series. If we do not get through this whole exercise, your homework will be to finish the notebook, and then repeat the analysis for another variable star from the data. I would like to see as a function of the number of terms in your analysis: $\chi^2$, $\Delta \chi^2$ compared to the previous model, the significance of $\Delta \chi^2$, AIC, and BIC. Indicate the order of the series you would choose based on each of the above numbers (they may be different for $\Delta \chi^2$, AIC, and BIC).



In [1]:

    
%matplotlib inline
import astroML.datasets
import numpy as np
import matplotlib.pyplot as plt



In [2]:

    
lcs = astroML.datasets.fetch_LINEAR_sample()



In [3]:

    
print lcs









    



<astroML.datasets.LINEAR_sample.LINEARdata object at 0x107a3c610>



In [4]:

    
time, flux, dflux = lcs[18525697].T

Plot the raw lightcurve



In [5]:

    
plt.errorbar(time, flux, yerr=dflux, fmt="ro")
plt.gca().invert_yaxis()
plt.xlabel("MJD")
plt.ylabel("Mag")
plt.show()

In this case, we know its a periodic variable star. The whole issue of *if* an object is periodic is the subject of a whole other lecture. In this case, we take it as fact, and attempt to find the optimal period by using a Lomb Scargle analysis (itself the subject of another lecture entirely).



In [6]:

    
from astroML.time_series import lomb_scargle
periods = np.logspace(-2, 0, 10000, base=10)
periodogram = lomb_scargle(time, flux, dflux, omega=2 * np.pi / periods, generalized=True)



In [7]:

    
plt.plot(periods, periodogram)
plt.semilogx()
plt.xlabel("Period (days)")
plt.ylabel("Power")
plt.show()

Find the peak of this periodogram using argsort.



In [8]:

    
idx = np.argsort(periodogram)
print periods[idx[-1]], periodogram[idx[-1]]









    



0.779810716073 0.519031234772

Now "fold" the lightcurve at this best period. By "folding" I mean find the phase at which the star is in its oscillation (phase is defined here as running from 0 to 1), for every data point. This is the MJD / period (e.g. 81761.7165) minus the integer portion of this value MJD // period (e.g. 81761) = 0.7165.



In [9]:

    
period = periods[idx[-1]]
phase = time / period - time // period 
print min(phase), max(phase)









    



0.00149950983177 0.996092835252



In [10]:

    
plt.errorbar(phase, flux, yerr=dflux, fmt="ro")
plt.gca().invert_yaxis()
plt.xlabel("Phase")
plt.ylabel("Mag")
plt.show()

Now we need to get serious. We want a function we can call to return chi2 for a model fit. It makes a lot of sense to have this be a method on a class. Then the class can contain the data it is fitting as member variables (e.g. self.flux). We also want an evaluate() method that returns the lightcurve for a given set of phases, so that we can plot it. Design a very basic class to do this. The lightcurve model that this class will fit to the data is of the form:

Model = A$_0$ + A$_1$ cos(2 $\pi$ phase + $\phi_1$) + A$_2$ cos(2 $\pi$ 2 $\cdot$ phase + $\phi_2$) + ... + A$_N$ * cos(2 $\pi$ N $\cdot$ phase + $\phi_N$)

with model parameters [ A$_0$ A$_1$ $\phi_1$ A$_2$ $\phi_2$ ... A$_N$ $\phi_N$ ]



In [11]:

    
class Fourier(object):
    def __init__(self, phase, flux, dflux, nterms):
        self.phase = phase
        self.flux = flux
        self.dflux = dflux
        self.nterms = nterms
        assert(self.nterms > 0)
        
    def evaluate(self, phase, terms):
        assert(len(terms) == 2 * self.nterms - 1)
        model = terms[0] * np.ones(len(phase))
        for i in range(self.nterms-1):
            model += terms[2*i + 1] * np.cos(2 * np.pi * (i+1) * phase + terms[2*i + 2])
        return model
    
    def chi2(self, args):
        model = self.evaluate(self.phase, args)
        chi   = (model - self.flux) / self.dflux
        return np.sum(chi**2)

Create a (very basic) instance of this class, and plot it up over a range of phases using its evaluate() method.



In [12]:

    
fourier = Fourier(phase, flux, dflux, 1)
mphase = np.arange(0, 1, 0.01)
model = fourier.evaluate(mphase, [10])



In [13]:

    
plt.plot(mphase, model, "k-")
plt.gca().invert_yaxis()
plt.xlabel("Phase")
plt.ylabel("Model mag")
plt.show()



In [14]:

    
fourier = Fourier(phase, flux, dflux, 2)
mphase = np.arange(0, 1, 0.01)
model = fourier.evaluate(mphase, [10, 0.1])









    



---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-14-4dbaac272725> in <module>()
      1 fourier = Fourier(phase, flux, dflux, 2)
      2 mphase = np.arange(0, 1, 0.01)
----> 3 model = fourier.evaluate(mphase, [10, 0.1])

<ipython-input-11-8536f39f16aa> in evaluate(self, phase, terms)
      8 
      9     def evaluate(self, phase, terms):
---> 10         assert(len(terms) == 2 * self.nterms - 1)
     11         model = terms[0] * np.ones(len(phase))
     12         for i in range(self.nterms-1):

AssertionError:

Oops. Note we need an odd number of terms...



In [15]:

    
fourier = Fourier(phase, flux, dflux, 2)
mphase = np.arange(0, 1, 0.01)
model = fourier.evaluate(mphase, [10, 0.1, 0.0])



In [16]:

    
plt.plot(mphase, model, "k-")
plt.gca().invert_yaxis()
plt.xlabel("Phase")
plt.ylabel("Model mag")
plt.show()



In [17]:

    
fourier = Fourier(phase, flux, dflux, 2)
mphase = np.arange(0, 1, 0.01)
model = fourier.evaluate(mphase, [10, 0.1, np.pi])
plt.plot(mphase, model, "k-")
plt.gca().invert_yaxis()
plt.xlabel("Phase")
plt.ylabel("Model mag")
plt.show()

The scipy.optimize package contains several useful minimization routines. We will use "fmin" but see also "fmin_bfgs".



In [18]:

    
import scipy.optimize as opt

The interface requires us to send it a func, which returns a value that fmin minimizes. In this case, its the chi2() method of our class! We also need to send it a starting point for the minimization, x0. This guess is important; too far off from the true answer and the function may not converge.



In [19]:

    
help(opt.fmin)









    



Help on function fmin in module scipy.optimize.optimize:

fmin(func, x0, args=(), xtol=0.0001, ftol=0.0001, maxiter=None, maxfun=None, full_output=0, disp=1, retall=0, callback=None)
    Minimize a function using the downhill simplex algorithm.
    
    This algorithm only uses function values, not derivatives or second
    derivatives.
    
    Parameters
    ----------
    func : callable func(x,*args)
        The objective function to be minimized.
    x0 : ndarray
        Initial guess.
    args : tuple, optional
        Extra arguments passed to func, i.e. ``f(x,*args)``.
    callback : callable, optional
        Called after each iteration, as callback(xk), where xk is the
        current parameter vector.
    xtol : float, optional
        Relative error in xopt acceptable for convergence.
    ftol : number, optional
        Relative error in func(xopt) acceptable for convergence.
    maxiter : int, optional
        Maximum number of iterations to perform.
    maxfun : number, optional
        Maximum number of function evaluations to make.
    full_output : bool, optional
        Set to True if fopt and warnflag outputs are desired.
    disp : bool, optional
        Set to True to print convergence messages.
    retall : bool, optional
        Set to True to return list of solutions at each iteration.
    
    Returns
    -------
    xopt : ndarray
        Parameter that minimizes function.
    fopt : float
        Value of function at minimum: ``fopt = func(xopt)``.
    iter : int
        Number of iterations performed.
    funcalls : int
        Number of function calls made.
    warnflag : int
        1 : Maximum number of function evaluations made.
        2 : Maximum number of iterations reached.
    allvecs : list
        Solution at each iteration.
    
    See also
    --------
    minimize: Interface to minimization algorithms for multivariate
        functions. See the 'Nelder-Mead' `method` in particular.
    
    Notes
    -----
    Uses a Nelder-Mead simplex algorithm to find the minimum of function of
    one or more variables.
    
    This algorithm has a long history of successful use in applications.
    But it will usually be slower than an algorithm that uses first or
    second derivative information. In practice it can have poor
    performance in high-dimensional problems and is not robust to
    minimizing complicated functions. Additionally, there currently is no
    complete theory describing when the algorithm will successfully
    converge to the minimum, or how fast it will if it does.
    
    References
    ----------
    .. [1] Nelder, J.A. and Mead, R. (1965), "A simplex method for function
           minimization", The Computer Journal, 7, pp. 308-313
    
    .. [2] Wright, M.H. (1996), "Direct Search Methods: Once Scorned, Now
           Respectable", in Numerical Analysis 1995, Proceedings of the
           1995 Dundee Biennial Conference in Numerical Analysis, D.F.
           Griffiths and G.A. Watson (Eds.), Addison Wesley Longman,
           Harlow, UK, pp. 191-208.

Fit a 2 term model to the data (1 constant term, and 1 term with an amplitude and phase offset)



In [20]:

    
fourier = Fourier(phase, flux, dflux, 2)
guess   = [] # this needs to be the number of paramters you want to fit for, here the mean brightness, the amplitude, and minimum
guess.append(np.mean(fourier.flux))
guess.append(np.std(fourier.flux))

Find the phase of minimum flux



In [21]:

    
print np.argsort(fourier.flux)[-1]



In [22]:

    
print fourier.flux[np.argsort(fourier.flux)[-1]]



In [23]:

    
print fourier.phase[np.argsort(fourier.flux)[-1]]









    



0.208005669003



In [24]:

    
guess.append(2 * np.pi * (1 - fourier.phase[np.argsort(fourier.flux)[-1]]))

Finally, we have our initial guess



In [25]:

    
print guess









    



[15.101739336492891, 0.26149871243334849, 4.9762471438905873]

Make it so!



In [26]:

    
opt.fmin_bfgs(fourier.chi2, x0=[guess], disp=0, full_output=1)









    Out[26]:





(array([ 15.07563501,  -0.3030786 ,   6.52267723]),
 22733.65208545331,
 array([ 0.00024414, -0.00097656,  0.00024414]),
 array([[  7.87052233e-07,   5.38210485e-11,   1.89246080e-08],
       [  5.38210485e-11,  -2.22943642e-10,  -1.59727137e-09],
       [  1.89246080e-08,  -1.59727137e-09,   3.03422313e-07]]),
 302,
 58,
 2)

Capture the output and look at the "best fit" parameters. I.e. those that minimized the return value of fourier.chi2.



In [27]:

    
result = opt.fmin_bfgs(fourier.chi2, x0=[guess], disp=0, full_output=1)
best_fit = result[0]
print best_fit









    



[ 15.07563501  -0.3030786    6.52267723]

Plot the model up, compared to the data and the initial guess



In [28]:

    
plt.errorbar(fourier.phase, fourier.flux, yerr=fourier.dflux, fmt="ro")
plt.plot(mphase, fourier.evaluate(mphase, guess), "b--", label="Initial guess")
plt.plot(mphase, fourier.evaluate(mphase, best_fit), "g-", label="Best fit")
plt.gca().invert_yaxis()
plt.xlabel("Phase")
plt.ylabel("Mag")
plt.legend()
plt.show()

Looks "eh". What is the goodness of fit of the model? The reduced chi2 (i.e. chi2 per degree of freedom) gives an absolute "goodness of fit" to your model: you expect this to be a value near 1.0:

$\chi^2_\nu$ = $\frac{\chi2}{Npts - Nterms - 1}$



In [29]:

    
chi2_1 = result[1]
chi2_2 = fourier.chi2(best_fit)
print chi2_1, chi2_2, chi2_1 / (len(flux) - len(best_fit) - 1)









    



22733.6520855 22733.6520855 109.82440621

This is where we start to go down the rabbit hole. Start to add more terms...



In [30]:

    
fourier2 = Fourier(phase, flux, dflux, 3)
guess2   = [] # this needs to be the number of paramters you want to fit for, now 2*3-1 = 5
guess2.append(np.mean(fourier2.flux))
guess2.append(np.std(fourier2.flux))
guess2.append(2 * np.pi * (1 - fourier2.phase[np.argsort(fourier2.flux)[-1]]))
guess2.append(0.0)
guess2.append(2 * np.pi * (1 - fourier2.phase[np.argsort(fourier2.flux)[-1]]))
result2 = opt.fmin_bfgs(fourier2.chi2, x0=[guess2], disp=0, full_output=1)
best_fit2 = result2[0]
print best_fit2









    



[ 15.09556741  -0.30780817   6.67006057   0.07940817   8.70953331]



In [31]:

    
plt.errorbar(fourier2.phase, fourier2.flux, yerr=fourier2.dflux, fmt="ro")
plt.plot(mphase, fourier2.evaluate(mphase, guess2), "b--", label="Initial guess")
plt.plot(mphase, fourier2.evaluate(mphase, best_fit2), "g-", label="Best fit")
plt.gca().invert_yaxis()
plt.xlabel("Phase")
plt.ylabel("Mag")
plt.show()

More cowbell!



In [32]:

    
fourier5 = Fourier(phase, flux, dflux, 5)
guess5   = [] # this needs to be the number of paramters you want to fit for, now 9
guess5.append(np.mean(fourier5.flux))
guess5.append(np.std(fourier5.flux))
guess5.append(2 * np.pi * (1 - fourier5.phase[np.argsort(fourier5.flux)[-1]]))
for i in range(2, 5):
    guess5.append(0.0)
    guess5.append(2 * np.pi * (1 - fourier5.phase[np.argsort(fourier5.flux)[-1]]))
result5 = opt.fmin_bfgs(fourier5.chi2, x0=[guess5], disp=0, full_output=1)
best_fit5 = result5[0]
print best_fit5









    



[ 15.10841338  -0.32043214   6.76707983  -0.10038609   5.99748415
   0.0596793    1.42504523  -0.0645155    2.46986537]



In [33]:

    
plt.errorbar(fourier5.phase, fourier5.flux, yerr=fourier5.dflux, fmt="ro")
plt.plot(mphase, fourier5.evaluate(mphase, guess5), "b--", label="Initial guess")
plt.plot(mphase, fourier5.evaluate(mphase, best_fit5), "g-", label="Best fit")
plt.gca().invert_yaxis()
plt.xlabel("Phase")
plt.ylabel("Mag")
plt.show()

Am getting tired of initializing "guess" all the time. We can make a method on the class that does this for us! Lets inherit from the main Fourier class, so that we only have to implement that method. Part of the point of this is to demonstrate class inheritance. Also, I did get real tired of initializing the guess...



In [34]:

    
class Fourier2(Fourier):
    def __init__(self, phase, flux, dflux, nterms):
        Fourier.__init__(self, phase, flux, dflux, nterms)
                
    def makeGuess(self):
        guess = []
        guess.append(np.mean(self.flux))
        if self.nterms > 1:
            guess.append(np.std(self.flux))
            guess.append(2 * np.pi * (1 - self.phase[np.argsort(self.flux)[-1]]))
            for i in range(2, self.nterms):
                guess.append(0.0)
                guess.append(2 * np.pi * (1 - self.phase[np.argsort(self.flux)[-1]]))
        return guess

Now that everything is pipelined, lets run this for a range of models, and plot how chi2 improves as you increase the model parameters. Note that it will *always* decrease as you add terms!!!! Up to a limit: the end point of this process is when you have 1 parameter for every data point, in which case your chi2 should be 0.0.



In [35]:

    
nterms = np.arange(1, 10)
chi2   = []
for nterm in nterms:
    fourier = Fourier2(phase, flux, dflux, nterm)
    guess   = fourier.makeGuess()
    print guess
    chi2.append(opt.fmin_bfgs(fourier.chi2, x0=[guess], disp=0, full_output=1)[1])
chi2 = np.array(chi2)
plt.plot(nterms, chi2)
plt.xlabel("N Terms")
plt.ylabel("Chi2")
plt.show()









    



[15.101739336492891]
[15.101739336492891, 0.26149871243334849, 4.9762471438905873]
[15.101739336492891, 0.26149871243334849, 4.9762471438905873, 0.0, 4.9762471438905873]
[15.101739336492891, 0.26149871243334849, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873]
[15.101739336492891, 0.26149871243334849, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873]
[15.101739336492891, 0.26149871243334849, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873]
[15.101739336492891, 0.26149871243334849, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873]
[15.101739336492891, 0.26149871243334849, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873]
[15.101739336492891, 0.26149871243334849, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873, 0.0, 4.9762471438905873]






    



/Users/jakevdp/anaconda/python.app/Contents/lib/python2.7/site-packages/scipy/optimize/optimize.py:816: RuntimeWarning: divide by zero encountered in double_scalars
  rhok = 1.0 / (numpy.dot(yk, sk))

And what is delta chi2? Note it converges to zero, i.e. you are getting less and less improvement with the addition of every new term. Now we have to start being careful.



In [36]:

    
delta_chi2 = chi2[:-1] - chi2[1:]
plt.plot(nterms[1:], delta_chi2)
plt.xlabel("N terms")
plt.ylabel("Delta chi2")
plt.show()
# You will need to figure out the significance of all these delta_chi2 as part of your homework!

Lets start to look at the penalties imposed by BIC/AIC.



In [37]:

    
nterms     = np.arange(1, 10)
chi2       = []
BICpenalty = []
AICpenalty = []
for nterm in nterms:
    fourier = Fourier2(phase, flux, dflux, nterm)
    guess   = fourier.makeGuess()
    BICpenalty.append(len(guess) * np.log(len(fourier.phase)))
    AICpenalty.append(len(guess) * 2.0)
    chi2.append(opt.fmin_bfgs(fourier.chi2, x0=[guess], disp=0, full_output=1)[1])
chi2 = np.array(chi2)
BICpenalty = np.array(BICpenalty)
AICpenalty = np.array(AICpenalty)



In [38]:

    
print chi2
print BICpenalty
print AICpenalty









    



[ 47266.37929305  22733.65208545  21490.84914701  21196.53110605
  20358.44872971  18972.69091448  17906.84909677  17795.47831309
  17050.04649975]
[  5.35185813  16.0555744   26.75929067  37.46300693  48.1667232
  58.87043947  69.57415574  80.277872    90.98158827]
[  2.   6.  10.  14.  18.  22.  26.  30.  34.]



In [39]:

    
plt.plot(nterms, chi2, label="chi2")
plt.plot(nterms, chi2 + BICpenalty, label="BIC")
plt.plot(nterms, chi2 + AICpenalty, label="AIC")
plt.xlabel("N Terms")
plt.ylabel("Chi2")
plt.semilogy()
plt.legend()
plt.show()

Can kinda see that BIC/AIC are telling you to stop. Lets make it more apparent.



In [40]:

    
nterms     = np.arange(1, 30, 2)
chi2       = []
BICpenalty = []
AICpenalty = []
for nterm in nterms:
    fourier = Fourier2(phase, flux, dflux, nterm)
    guess   = fourier.makeGuess()
    BICpenalty.append(len(guess) * np.log(len(fourier.phase)))
    AICpenalty.append(len(guess) * 2.0)
    chi2.append(opt.fmin_bfgs(fourier.chi2, x0=[guess], disp=0, full_output=1)[1])
chi2 = np.array(chi2)
BICpenalty = np.array(BICpenalty)
AICpenalty = np.array(AICpenalty)



In [41]:

    
plt.plot(nterms, chi2, label="chi2")
plt.plot(nterms, chi2 + BICpenalty, label="BIC")
plt.plot(nterms, chi2 + AICpenalty, label="AIC")
plt.xlabel("N Terms")
plt.ylabel("Chi2")
plt.semilogy()
plt.legend()
plt.show()

Which of the fits have the minimum BIC/AIC?



In [42]:

    
print np.argsort(chi2+BICpenalty)[0],np.argsort(chi2+AICpenalty)[0]

And what order model do they suggest?



In [43]:

    
print nterms[4], nterms[6]

Note that these will typically be different. And this is when you say to the referee "We examined the order of model selection using both the BIC and the AIC, and find that our conclusions are insensitive to which model is preferred. Accordingly, we chose the most parsimonious model for our final analysis". Because if you do find your results depend significantly on the model order at this level, you should rethink how you're doing things (e.g. is your model's functional form correct).