Yay! Time for a new topic: linear regression by the method of least squares.
We will do this topic in two parts: one which is generically about the errors that can arise in carrying out a numerical computation, and the other which is specifically about the linear regression problem.
This lab kicks off the last part of our course, which is a survey of data analysis and mining methods. We will assume you will learn the theory behind these methods in more detail in other classes; our focus will be on algorithmic and implementation concepts.
For today's topic, let's use the following dataset, which is a crimes dataset from 1960: http://cse6040.gatech.edu/fa15/uscrime.csv
This dataset comes from: http://www.statsci.org/data/general/uscrime.html
Other useful references and downloads for today's class:
cse6040utils.py
float
type: https://docs.python.org/2/tutorial/floatingpoint.htmlAlso, much of the discussion of round-off error is taken from an excellent book on numerical linear algebra.
You may be able to use the Georgia Tech library's web proxy service to download electronic chapters of this book. Also, this book was written by Rich's former PhD advisor, so a referral thereto is probably not completely objective.
:)
In [1]:
import numpy as np
import pandas as pd
from IPython.display import display
import cse6040utils as cse6040
In [3]:
df = pd.read_csv ('uscrime.csv', skiprows=1)
display (df.head ())
Each row of this dataset is a US State. The columns are described here: http://www.statsci.org/data/general/uscrime.html
Let's take a quick peek at the dataset.
In [4]:
df.describe ()
Out[4]:
In [5]:
import seaborn as sns
%matplotlib inline
In [6]:
# Look at a few relationships
sns.pairplot (df[['Crime', 'Wealth', 'Ed', 'U1']])
Out[6]:
Suppose we wish to build a model of some quantity, called the response variable, given some set of predictors. In the US crimes dataset, the response might be the crime rate (Crime
), which we wish to predict from the predictors of income (Wealth
), education (Ed
), and the unemployment rate of young males (U1
).
In a linear regression model, we posit that the response is a linear function of the predictors. That is, suppose there are $m$ observations in total and consider the $i$-th observation. Let $b_i$ be the response of that observation. Then denote the $n$ predictors for observation $i$ as $\{a_{i,1}, a_{i,2}, \ldots, a_{i,n}\}$. From this starting point, we might then posit a linear model of $b$ having the form,
$b_i = x_0 + a_{i,1} x_1 + a_{i,2} x_2 + \cdots + a_{i,n} x_n$,
where we wish to compute the "best" set of coefficients, $\{x_0, x_1, \ldots, x_n\}$. Note that this model includes a constant offset term, $x_0$. Since we want this model to hold for observations, then we effectively want to solve the system of $m$ equations in $n+1$ unknowns,
$\left( \begin{array}{c} b_1 \\ b_2 \\ \vdots \\ b_m \end{array} \right)
$\left( \begin{array}{ccccc}
1. & a_{1,1} & a_{1,2} & \ldots & a_{1,n} \\
1. & a_{2,1} & a_{2,2} & \ldots & a_{2,n} \\
& & \cdots & & \\
1. & a_{m,1} & a_{m,2} & \ldots & a_{m,n}
\end{array} \right) \left(\begin{array}{c} x_0 \\ x_1 \\ \vdots \\ x_n \end{array}\right), $
or just $Ax=b$. Typically, there are many more observations than parameters ($m \gg n$), in which case we say this linear system is overdetermined.
So how do we compute $x$? Most statistical software and libraries hide the details of computing $x$ from you. However, it's important to understand at least a little bit about what happens under the hood, which is today's topic.
The main task of data analysis is to help explain and understand some phenomenon by building a mathematical model from data, and analyzing the model or using it to make predictions. During this process, there are several potential sources of error, including:
In this course, we will mostly leave measurement and modeling errors as a topics for other courses. Instead, we will focus on truncation and rounding errors, and to a lesser extent, assessing the potential impact of measurement error.
Exercise. Are there other kinds of errors not captured in the list above?
Exercise. Give examples of the kinds of errors that might arise in our motivating problem (i.e., building a linear regression model of crime rates).
@YOUSE: Enter your discussion / solutions here
Real values are typically stored in IEEE floating-point format. If you have programmed in other languages, you may have seen scalar data types for single-precision and double-precision formats (e.g., float
and double
in C/C++/Java). A "floating-point" encoding is basically a normalized scientific notation consisting of a base, a sign, a fractional mantissa, and an integer exponent. Let's look at an example to see how this might work.
Consider the value 0.125. In a normalized scientific notation, we would write this number as $+1.25 \times 10^{-1}$, where the base is 10, the mantissa is 1.25, and the exponent is -1. Conceptually, if we always used base 10 for all our floating-point values, then our floating-point encoding of this value would, conceptually, be a tuple $(+, 1.25, -1)$.
However, we cannot store an infinite number of digits for the mantissa and exponent values. Thus, we would normally also limit the number of digits that may appear in either. We might use, say, 6 digits for the mantissa and 2 digits (ignoring the sign) for the exponent, i.e., a tuple of the form $(\pm, m.mmmmm, \pm xx)$.
Exercise. What is the largest value we could represent in this format? What is the smallest value? What is the smallest positive value we could represent? How would we encode these values?
@YOUSE: Enter your solutions here
Exercise. Encode the following values as tuples:
@YOUSE: Enter your solutions here
0.000001 == 10^(-6)
Exercise. Suppose we instead use a finite-precision floating-point encoding, using base 10 digits with 6 digits of precision for the mantissa and 2 digits for the exponent, plus separate sign "digits" for each. What is the final value of t
in each of these two programs?
@YOUSE: Enter your solutions here
The preceding examples assume the digits are represented in base 10. However, computers encode all values using bits, which are base 2 digits. All the same ideas as above apply, but on base 2 values rather than base 10 values.
One consequence of this difference is that certain finite-precision decimal fractions cannot be represented exactly!
Can you see why? Consider the decimal value 0.1 represented in a binary format.
In addition, the IEEE floating-point standard defines the encoding a little differently than we've used it. First, if the value is not 0, then the mantissa always has an implicit "1" as the leading digit; therefore, it needn't be stored explicitly, thereby saving a bit and effectively increasing the precision a little. Secondly, the range of the exponent is not symmetric. In our hypothetical base-10 "6 + 2" encoding, we assumed the exponent would range from -99 to 99, which is a symmetric interval; in IEEE floating-point, there will be a slight asymmetry in this range. Part of the reason is that the IEEE floating-point encoding can also represent several kinds of special values, such as infinities and an odd bird called "not-a-number" or NaN
. This latter value, which you may have seen if you have used any standard statistical packages, can be used to encode certain kinds of floating-point exceptions that result when, for instance, you try to divide by zero.
In IEEE floating-point, there are two main encodings, known as single-precision (float
in C/C++/Java) and double-precision (double
in C/C++/Java). In brief, these differ as follows:
Single-precision: 32 bits total, with 24 bits for the mantissa and an exponent range of [-126, 127].
Double-precision: 64 bits total, with 53 bits for the mantissa and an exponent range of [-1022, 1023].
Exercise. What is the smallest positive value that can be represented in IEEE single-precision? What about in double-precision?
@YOUSE: Enter your solutions here
Exercise. Consider the smallest possible value greater than 1.0 that can be represented in floating-point. Let's call this value, $1.0 + \epsilon$.
Determine $\epsilon_s$ and $\epsilon_d$, the corresponding values of $\epsilon$ in single-precision and double-precision, respectively.
@YOUSE: Enter your solutions here
Another important consequence of the binary format is that when you print things in base ten, what you see may not be what you get! For instance, try running the code below.
In [8]:
from decimal import Decimal
x = 1.0 + 2.0**(-52)
print x
print Decimal (x) # What does this do?
Aside: If you ever need true decimal storage with no loss of precision, turn to the
Decimal
package. Just be warned it will come at a price:
In [9]:
a_native = 1.0
b_native = 2.0
a_decimal = Decimal ('1.0')
b_decimal = Decimal ('2.0')
%timeit a_native + b_native
%timeit a_decimal + b_decimal
For today's lesson, it will be helpful to occasionally peek at floating-point values "in the raw." For this purpose, we've provided you with a handy routine called float_to_bin(x)
, which given a floating-point value x
will return its IEEE representation as a binary string. (It is defined in the cse6040utils
module.)
The following code uses float_to_bin()
to define another function that dumps a floating-point number's complete Decimal
form along with its binary form.
In [10]:
def print_float_bin (x, prefix=""):
print ("%s: %s\n%s %s" % (prefix,
Decimal (x),
' ' * len (prefix),
cse6040.float_to_bin (x)))
a = -1.0
b = 2.**(-52) # Recall: \epsilon_d
c = b / 2.
print_float_bin (a, prefix="a")
print_float_bin (b, prefix="b")
print_float_bin (c, prefix="c")
Exercise. Recall the two program fragments from above:
Program 1:
s = a - b
t = s + b
Program 2:
s = a + b
t = s - b
Let $a=1.0$ and $b=\epsilon_d / 2 \approx 1.11 \times 10^{-16}$. Write some Python code to evaluate these two programs and compare their outputs. (To look closely at their outputs, use the print_float_bin()
function from above.)
In [13]:
a = 1.0
b = 2.**(-53) # What value goes here?
s1 = a - b
t1 = s1 + b
s2 = a + b
t2 = s2 - b
print_float_bin (s1, prefix="s1")
print_float_bin (t1, prefix="t1")
print ("\n")
print_float_bin (s2, prefix="s2")
print_float_bin (t2, prefix="t2")
print "\n", t1, t2
print ("\n(t1 == t2) == %s" % (t1 == t2))
By the way, in Numpy you can determine machine epsilon in a more portable way:
In [14]:
EPS_S = np.finfo (np.float32).eps
EPS_D = np.finfo (float).eps
print_float_bin (float (EPS_S), prefix="eps_s")
print_float_bin (float (EPS_D), prefix="eps_d")
Given the various sources of error, how do we know whether a given algorithm is "good" for computing the solution to a given problem? An important tool in the area of numerical analysis is perturbation theory.
To see perturbation theory in action, suppose we wish to determine by how much a "small" measurement error, $\Delta x$, affects the output of some function, $f(x)$. Barring any other information and assuming $f(x)$ is continuous and differentiable, we could try to estimate the error of evaluating $f(x + \Delta x)$ compared to $f(x)$ by the following linear approximation, which comes from a Taylor series expansion:
$$f(x + \Delta x) \approx f(x) + \Delta x \cdot f'(x),$$where $f'(x)$ is the first derivative of $f(x)$ at the point $x$.
Absolute condition numbers. From this relation, we can compute an upper-bound on the absolute value of the error:
$$\left|f(x + \Delta x) - f(x)\right| \approx \left|\Delta x\right| \cdot \left|f'(x)\right|.$$This calculation says that the error depends not only on the measurement error, $\Delta x$, but also the nature of the function itself at $x$ through the factor, $\left|f'(x)\right|$. Indeed, we will give this factor a special name of absolute condition number of evaluating $f$ at $x$. For any given computational problem, we will try to find condition numbers to help us quantify the "hardness" of the problem.
That is, for the problem of evaluating $f(x)$, the preceding analysis says that if this factor is not too large, then small measurement errors will lead to only small errors in the output. In this case, we say the problem is well-conditioned. If instead this factor is very large, then even very small errors will lead to large errors in the output. In this case, we say the problem is ill-conditioned.
Put differently, the problem of evaluating $f(x)$ when the condition number is large is inherently more difficult than doing so when the condition number is small.
Relative condition numbers. The error considered above is the absolute error, in contrast to the relative error,
$$\left|f(x + \Delta x) - f(x)\right| / \left|f(x)\right|.$$For this case and problem of evaluating $f(x)$, let's rewrite this slightly as,
$$\frac{\left|f(x + \Delta x) - f(x)\right|} {\left|f(x)\right|} \approx \frac{|\Delta x|} {|x|} \cdot \underbrace{ \frac{\left|f'(x)\right| \cdot |x|} {\left|f(x)\right|} }_{\equiv\ \kappa_r(x)} ,$$where $\kappa_r(x)$ is the relative condition number of evaluating $f(x)$ at $x$.
Observe that this relation expresses the relative change in the output as a function of some relative change in the input ($|\Delta x| / |x|$).
Backward stability. Let's say someone devises an algorithm to compute $f(x)$. For a given value $x$, let's suppose this algorithm produces the value $\mathrm{alg}(x)$. One important question might be, is that output "good" or "bad?"
One property to measure "goodness" will be backward stability. In particular, we will say that $\mathrm{alg}(x)$ is a backward stable algorithm to compute $f(x)$ if, for all $x$, there exists a "small" $\Delta x$ such that
$$\mathrm{alg}(x) = f(x + \Delta x).$$That should look familiar! It means we can estimate the (absolute) backward error using our perturbation analysis from before, i.e.,
$$\left|\mathrm{alg}(x) - f(x)\right| \approx \left|f'(x)\right| \cdot \left|\Delta x\right|.$$Round-off errors. We already know that numerical values can only be represented finitely, which introduces round-off error. Thus, at the very least we should hope that a scheme to compute $f(x)$ is as insensitive to round-off errors as possible.
To quantify sensitivity, the standard technique is to assume that every scalar floating-point operation incurs some bounded error. Let $a \odot b$ be the exact result of some mathematical operation on $a$ and $b$, and let $\mathrm{fl}(a \odot b)$ be the computed value, after rounding in finite-precision. We will model the difference by,
$$\mathrm{fl}(a \odot b) \equiv (a \odot b) (1 + \delta),$$where $|\delta| \leq \epsilon$, machine epsilon.