In [ ]:

    
import numpy as np
import matplotlib.pyplot as plt

%matplotlib notebook

The Assumptions of Least Squares

========

Version 0.1

By AA Miller 2018 Nov 04

Today we will focus on a relatively simple problem, while highlighting several challenges for the standard astronomical workflow, as a way of setting up the various lectures that will happen over the course of this week.

A lot of the lessons in this lecture are inspired by the paper Data Analysis Recipes: Fitting a Model to Data by Hogg, Bovy, & Lang. [This paper has been mentioned previously in the DSFP, though today we will only be able to scratch the surface of its content.]

In some sense - the goal right now is to make you really nervous about the work that you've previously done.

(Though this lecture should not be met with too much consternation –– we will discuss new approaches to fitting a line)

Problem 1) Data

At the core of everything we hope to accomplish with the DSFP stands a single common connection: data.

There are many things we (may) want to do with these data: reduce them, visualize them, model them, develop predictions from them, use them to infer fundamental properties of the universe (!).

Before we dive into that really fun stuff, we should start with some basics:

Problem 1a

What is data?

Take a few min to discuss this with your partner

Solution 1a

write your answer here

Problem 1b

Below, I provide some data (in the form of numpy arrays). As good data scientists, what is the first thing you should do with this data?

Feel free to create a new cell if necessary.



In [ ]:

    
y = np.array([203, 58, 210, 202, 198, 158, 
              165, 201, 157, 131, 166, 160, 
              186, 125, 218, 146])
x = np.array([495, 173, 479, 504, 510, 416, 
              393, 442, 317, 311, 400, 337, 
              423, 334, 533, 344])

Problem 1c

[You may have already done this] Now that we understand the origin of the data, make a scatter plot showing their distribution.



In [ ]:

    
plt.scatter( # complete
# complete
# complete

Probelm 2) Fitting a Line to Data

There is a very good chance, though I am not specifically assuming anything, that upon making the previous plot you had a thought along the lines of "these points fall on a line" or "these data represent a linear relationship."

Problem 2a

Is the assumption of linearity valid for the above data?

Is it convenient?

Take a few min to discuss this with your partner

Solution 2a

write your answer here

Let us proceed with convenience and assume the data represent a linear relationship. In that case, in order to make predictions for future observations, we need to fit a line to the data.

The "standard" proceedure for doing so is least-squares fitting. In brief, least-squares minimizes the sum of the squared value of the residuals between the data and the fitting function.

I've often joked that all you need to be a good data scientist is google and stack overflow. Via those two tools, we can quickly deduce that the easiest way to perform a linear least-squares fit to the above data is with np.polyfit, which performs a least-squares polynomial fit to two numpy arrays.

Problem 2b

Use np.polyfit() to fit a line to the data. Overplot the best-fit line on the data.



In [ ]:

    
p = np.polyfit( # complete

plt.scatter( # complete
# complete

There is a very good chance, though, again, I am not specifically assuming anything, that for the previous plots that you plotted x along the abscissa and y along the ordinate.

[Honestly, there's no one to blame if this is the case, this has essentially been drilled into all of us from the moment we started making plots. In fact, in matplotlib we cannot change the name of the abscissa label without adjusting the xlabel.]

This leads us to an important question, however. What if y does not depend on x and instead x depends on y? Does that in any way change the results for the fit?

Problem 2c

Perform a linear least-squares fit to x vs. y (or if you already fit this, then reverse the axes). As above, plot the data and the best-fit model.

To test if the relation is the same between the two fits, compare the predicted y value for both models corresponding to x = 300.



In [ ]:

    
p_yx = np.polyfit( # complete

plt.scatter( # complete
# complete

print("For y vs. x, then x=50 would predict y={:.2f}".format( # complete
print("For x vs. y, then x=50 would predict y={:.2f}".format( # complete

So we have now uncovered one of the peculiariaties of least-squares. Fitting y vs. x is not the same as fitting x vs. y.

There are a couple essential assumptions that go into standard least-squares fitting:

There is one dimension along which the data have negligible uncertainties
Along the other dimension all of the uncertainties can be described via Gaussians of known variance

These two conditions are rarely met for astronomical data. While condition 1 can be satisfied (e.g., time series data where there is essentially no uncertainty on the time of the observations), I contend that condition 2 is rarely, if ever, satisfied.

Speaking of uncertainties(1), we have not utilized any thus far. [I hope this has raised some warning bells.]

We will now re-organize our data to match what is originally in Hogg, Bovy, & Lang (previously x and y were swapped).

(1) There is an amazing footnote in Hogg, Bovy, & Lang about "errors" vs. "uncertainties" - I suggest everyone read this.

Problem 2d

Re-plot the data including the uncertatines.



In [ ]:

    
x = np.array([203, 58, 210, 202, 198, 158, 
              165, 201, 157, 131, 166, 160, 
              186, 125, 218, 146])
y = np.array([495, 173, 479, 504, 510, 416, 
              393, 442, 317, 311, 400, 337, 
              423, 334, 533, 344])
sigma_y = np.array([21, 15, 27, 14, 30, 16, 
                    14, 25, 52, 16, 34, 31, 
                    42, 26, 16, 22])

plt.errorbar( # complete

We are now assuming that x has negligible uncertainties and that y has uncertainties that can be perfectly described by Gaussians of known variance.

A portion of the appeal of least-squares is that it provides a deterministic method for determining the best fit. To understand that we now need to do a little linear algebra.

We can arrange the data in the following matricies:

$$ \mathbf{Y} = \left[ {\begin{array}{c} y_1 \\ y_2 \\ \dots \\ y_N \end{array} } \right] , $$$$ \mathbf{A} = \left[ {\begin{array}{cc} 1 & x_1 \\ 1 & x_2 \\ \dots & \dots \\ 1 & x_N \end{array} } \right] , $$$$ \mathbf{C} = \left[ {\begin{array}{cccc} \sigma_{y_1}^2 & 0 & \dots & 0 \\ 0 & \sigma_{y_2}^2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \sigma_{y_1}^2 \\ \end{array} } \right] , $$

where $\mathbf{Y}$ is a vector, and $\mathbf{C}$ is the covariance matrix.

Ultimately, we need to solve the equation

$$\mathbf{Y} = \mathbf{A}\mathbf{X}.$$

I am skipping the derivation, but the solution to this equations is:

$$ \left[ {\begin{array}{c} b \\ m \\ \end{array} } \right] = \mathbf{X} = \left[ \mathbf{A}^T \mathbf{C}^{-1} \mathbf{A}\right]^{-1} \left[ \mathbf{A}^T \mathbf{C}^{-1} \mathbf{Y}\right].$$

As noted in Hogg, Bovy, & Lang, this procedure minimizes the $\chi^2$ function, which is the total squared error, after appropriately scaling by the uncertainties:

$$ \chi^2 = \Sigma_{i = 1}^{N} \frac{[y_i - f(x_i)]^2}{\sigma_{y_i}^2} = \left[ \mathbf{Y} - \mathbf{A}\mathbf{X}\right]^{T} \mathbf{C}^{-1} \left[ \mathbf{Y} - \mathbf{A} \mathbf{X}\right].$$

Problem 2e

Using the linear algebra equations above (i.e. avoid np.polyfit or any other similar functions), determine the weighted least-squares best-fit values for $b$ and $m$, the intercept and slope, respectively.

Plot the results of the best-fit line. How does this compare to the above estimates?



In [ ]:

    
Y = # complete
A = # complete
C = # complete

X = # complete

best_fit = # complete

plt.errorbar( # complete
plt.plot( # complete

print("The best-fit value for the slope and intercept are: {:.4f} and {:.4f}".format( # complete

Problem 2f

Confirm the results of this fit are the same as those from np.polyfit.

Hint - be sure to include the uncertainties.



In [ ]:

    
p = np.polyfit( # complete
print("The best-fit value for the slope and intercept are: {:.4f} and {:.4f}".format( # complete

Problem 3) Are the Uncertainties Actually Gaussian?

Previously we noted that there are two essential assumptions that are required for least-squares fitting to be correct. We are now going to examine the latter requirement, namely, that the uncertainties can be perfectly described as Gaussians with known variance.

Earlier I stated this assumption is rarely satisfied. Why might this be the case?

In my experience (meaning this is hardly universal), if it's astro, it's got systematics. While I cannot prove this, I contend that systematic uncertainties are rarely Gaussian. If you are lucky enough to be in a regime where you can be confident that the systematics are Gaussian, I further contend that it is extremely difficult to be certain that the variance of that Gaussian is known.

Then there's another (astro-specific) challenge: in many circumstances, we aren't actually working with data, but rather with the results of other models applied to the data.

Let's take an optical astronomy (biased, but this is LSST after all) example. What are the data? In many cases inference is being based on measurements of brightness, but the true data in this case is simply a bunch of electron counts in a CCD. The brightness (or mag) is based on the application of a model (e.g., PSF, aperture, Kron) that is applied to the data. Thus, to assume that a flux (or mag) measurement has Gaussian uncertainties with known variance is to assume that whatever flux-measurement model has been applied always produces perfectly Gaussian uncertainties (and a lot of different assumptions go into flux-measurement models...)

As a demonstration associated with the challenges of these assumptions, we will examine the data set presented in Hogg, Bovy, & Lang.



In [ ]:

    
x = np.array([201, 201, 287, 166,  58, 157, 146, 218, 203, 186, 160,  47, 210,
       131, 202, 125, 158, 198, 165, 244])
y = np.array([592, 442, 402, 400, 173, 317, 344, 533, 495, 423, 337, 583, 479,
       311, 504, 334, 416, 510, 393, 401])
sigma_y = np.array([61, 25, 15, 34, 15, 52, 22, 16, 21, 42, 31, 38, 27, 16, 14, 26, 16,
       30, 14, 25])

Problem 3a

Using the least-squares methodology developed in Problem 2, determine the best-fit slope and intercept for a line fit to the data above.

Make a scatter plot of the data, and overplot the best-fit line. What if anything, do you notice about the data and the fit?



In [ ]:

    
Y = # complete
# complete

# complete

plt.errorbar( # complete
plt.plot( # complete

print("The best-fit value for the slope and intercept are: {:.4f} and {:.4f}".format( # complete

Unlike the data in Problems 1 and 2, there appear to be some significant outliers (of course - this appearance of outliers is entirely dependent upon the assumption of linearity, there may actually be no outliers and a complex relation between x and y). As such, it does not appear (to me) as though the best-fit line provides a good model for the data.

Problem 3b

Perform a least-squares 2nd order polynomial fit to the data. Overplot the bestfit curve.

How does this compare to the linear model fit?



In [ ]:

    
Y = # complete
# complete

# complete

plt.errorbar( # complete
plt.plot( # complete

By eye (a metric that is hardly quantitative, but nevertheless worth developing because talks never provide all of the details), the quadratic fit appears "better" than the linear fit.

But, there are still "outliers" and in the realm of polynomial fitting, it is always possible to get a better fit by adding more degrees to the polynomial. Should we keep going here, or should we stop? (Again - we will discuss model selection with Adam on Friday)

[As a reminder - in machine learning we'd call this low training error, but the generalization error is likely huge]

How should we deal with these potential outliers?

(and to re-iterate, we cannot be certain that these points are, in fact, outliers)

Amazingly, if you scroll through the literature you can find solutions like the following: "We do not believe the data point at (x, y) for reasons A, B, C. Thus, we exclude that point from the fit."

This, obviously, lacks any sort of rigor. If data is going to be removed, and it is worth asking if data should ever be removed, it should not be subject to the "beliefs" of an individual person or group.

A more common approach that you might encounter is known as $k\sigma$ clipping, which is an iterative procedure to identify and remove outliers from a data set. The procedure is as follows:

Fit the model to the data
Identify any data points that are $k\sigma$ discrepant from the best-fit model
Remove the discrepant points, repeat steps 1 & 2 until there are no data beyond $k\sigma$

The motivation for this procedure is the following: in a small data set (such as the one above, 20 points), the likelihood of having large $\sigma$ deviations (let's say $k = 5$) is vanishingly small. Thus, it "makes sense" to remove those points.

Of course, this only makes sense if the uncertainties are truly gaussian with low variance. So again, specific, and likely untrue, assumptions have to be made.

Problem 3c

Develop a $k\sigma$ clipping procedure to fit a line to the data set. Set $k = 5$ and determine the best-fit line to the data.

Overplot the results of the procedure on the data. How does this fit to the data look?



In [ ]:

    
# complete

By eye, the results above are not that satisfying. Several of those points do look like outliers, but there are also 2 points being rejected that are well within the other cluster of data.

Furthermore, the point at $(x, y) \approx (60, 170)$ was clipped in an early iteration of the algorithm, but now with the final model this point is actually within $k\sigma$ of the best-fit line. What should one do with points like this?

This is one of the great problems with $k\sigma$ clipping: how does one select the appropriate value for $k$?

Even if there was a good heuristic argument for selecting $k$, is there fundamentally any difference between an observation that is $k\sigma + \epsilon$ away from the model versus a point at $k\sigma - \epsilon$, where $\epsilon \ll 1$? Any choice of $k$ will automatically remove one of these points and not the other, which seems somewhat arbitrary...

Problem 3d

Perform the $k\sigma$ procedure on the data, but this time set $k = 7$. Plot the results as above.

How do these results compare to our previous fits?



In [ ]:

    
# complete

By eye, this appears superior to the previous fit. At the same time, we have not actually optimized anything to definitively show that this is the case.

If there are outliers in the data, then it stands to reason that the uncertainties are likely not correctly estimated.

Problem 3e

Perform the $k\sigma$ procedure on the data, with $k = 5$, but the variance increased by a factor 4 (i.e. sigma_y increased by a factor of 2). Plot the results as above.

How do these results compare to our previous fits?



In [ ]:

    
# complete

If the uncertainties were underestimated then none of the data get clipped from the fit!

I assume (and I hope I've convinced you that assumptions are dangerous) that if you have worked with astronomical data that at one point or another you have (i) encountered data that appear to have outliers, and/or (ii) plotted data in 2 dimensions and then performed a (least-squares) linear fit to that data.

If that is true, I sincerely hope you have the following two thoughts:

Crap, crap, crap - I really messed up my paper on [REDACTED].
There has to be a better way!

First, let's address point 1. Don't freak out. Yes, it is probably the case that some work somewhere was not completed in the absolute most rigorous fashion possible. This is okay (and I feel fairly confident saying everyone has done it. There are definitely examples of each of the "bad" things described above in my papers if you go back and look hard enough.)

There are even some situations where the above prescriptions are appropriate and preferred (think engineering solutions as opposed to strict scientific inference).

More importantly - point 2: there is a better way. Especially for the data/problem that we have been investigating above.

We can approach this problem as a Bayesian.

Bayes' Theorem

We do not have enough time for a full lecture on Bayes' Theorem, but we will provide a (way too) short description, while defining some key terms.

We start with a quick derivation (Bayes' Theorem is all about conditional probabilities):

$$ P(X|Y) = \frac{P(X \cap Y)}{P(Y)} \\ P(Y|X) = \frac{P(Y \cap X)}{P(X)} \\ \cdots \\ P(X|Y) = \frac{P(Y|X) P(X)}{P(Y)} $$

This is Bayes' Theorem.

$P(X|Y)$ is the posterior. This is the probability of the model parameters, $X$, given the observations $Y$. This is the quantity that we ultimately want to compute.

$P(Y|X)$ is the likelihood. This is the same as the likelihood that we did computed above.

$P(X)$ is the prior. This is the probability of the model parameters, and it encompases everything you believe about the model prior to running the experiment. The use of a prior is what makes the analysis Bayesian.

$P(Y)$ is the probability of data, or as David Hogg says, "the fully marginalized likelihood (FML)". This is extremely hard to calculate (though it is possible), and often is ignored as it is just a normalization factor.

$$P(X|Y) \propto P(Y|X) P(X)$$

For many astronomical applications we look to sample from the posterior while ignoring the probability of the data (ignoring the normalization factor means that model comparison is very difficult, though there are ways to do this and we will cover them in great detail in the future).

Challenge Problem

Take a Bayesian approach to the above problem, whereby you define a generative model that allows you to compute the likelihood of the observations. The computation of the likelihood or posterior probability distribution requires complex integrals, compute these integrals using your favorite Markov Chain Monte Carlo techniques.

The combination of each of these tools provides a far more satisfying result than the various procedures previously discussed in this lecture.



In [ ]: