CSE 6040, Fall 2015 [22]: Linear regression via least squares, Part 2 -- Error analysis and algorithms

Last time we motivated the linear regression modeling problem using some data. The next question is, how do you actually compute the model? Rather than relying on a canned library (which is what we recommend you do in practice), the goal of the previous lesson and this one is to give you a glimpse into the inner-workings of these methods, which bring in a lot of subtle (but hopefully also fun and interesting!) numerical computing issues.

To complete today's notebook, you may find these resources helpful:

SciPy solvers documentation, including linear systems solvers: http://docs.scipy.org/doc/scipy/reference/linalg.html
A numerical linear algebra textbook by Demmel

Review: Perturbation theory and round-off error

Let's recall a few of the "big ideas" from the previous notebook:

How numbers are stored (IEEE-754 format)
How basic operations work -- rounding (or round-off) errors
How errors propagate -- perturbation theory
How to measure "hardness" of a numerical problem -- conditioning
How to assess an algorithm -- stability

IEEE-754 floating-point arithmetic

The first idea is to store real values in a standard floating-point format. Most machines implement the IEEE-754 standard. It specifies two main precisions (i.e., formats with a certain number of digits):

IEEE single-precision, which is a 32-bit format with an explicit sign bit, 24 bits of mantissa (23 bits stored explicitly, with an implicit leading 1 bit for all values except 0), and an exponent range of $[-126, 127]$. It has a machine epsilon of $\epsilon_s = 2^{-23} \approx 1.19 \times 10^{-7}$.
IEEE double-precision, which is a 64-bit format with an explicit sign bit, 53 bits of mantissa (52 bits stored explicitly), and an exponent range of $[-1022, 1023]$. It has a machine epsilon of $\epsilon_d = 2^{-52} \approx 2.23 \times 10^{-16}$.

Recall that "machine epsilon" refers to the smallest $\epsilon$ such that $1.0 + \epsilon$ is representable.

Rounding (round-off) errors

One consequence of a finite-precision representation of numbers is that the results of a sequence of arithmetic operations may depend on the order in which we perform them. For instance, let $\mathtt{eps} \equiv \epsilon/2$ denote machine epsilon for a given numerical type. Then the following two programs deliver different results, even though mathematically they appear to be equivalent.

Program 1:

s = 1.0 - eps
t = s + eps # Does 't' equal 1.0?

Program 2:

s = 1.0 + eps
t = s - eps # Does 't' equal 1.0?

Program 1 subtracts first. Both the values of s and t can be represented exactly in the floating-point encoding. By contrast, Program 2 adds first. Since the intermediate result, s, cannot be represented exactly, the subsequent value of t will differ from that produced by the first program.

Note: The IEEE standard guarantees that given two finite-precision floating-point values, the result of applying any binary operator to them is the same as if the operator were applied in infinite-precision and then rounded back to finite-precision. The precise nature of rounding can be controlled by so-called rounding modes; the default rounding mode is "round-half-to-even."

Last time, we defined a function called, print_float_bin(), which you can use for debugging and printing the raw bit-strings corresponding to a given floating-point value. This function is now a part of the cse6040utils module, so grab the latest version thereof to use it.



In [ ]:

    
import cse6040utils as cse6040

a = -1.0
b = 2.**(-52)  # Recall: \epsilon_d
c = b / 2.

cse6040.print_float_bin (a, prefix="a")
cse6040.print_float_bin (b, prefix="b")
cse6040.print_float_bin (c, prefix="c")

Lastly, how do we quantify errors mathematically? Here is a "simple" model: every scalar floating-point operation incurs some bounded relative error. Let $a \odot b$ be the exact result of some mathematical operation on $a$ and $b$, and let $\mathrm{fl}(a \odot b)$ be the computed value, after rounding in finite-precision. We will model their relationship by,

$$\mathrm{fl}(a \odot b) \equiv (a \odot b) (1 + \delta),$$

where $|\delta| \leq \epsilon$, machine epsilon.

Note: Every operation in an algorithm or program might incur a different value of $\delta$; when you analyze a computation, you will allow these errors to accumulate individually, and then estimate an error bound by invoking the relationship that $|\delta| \leq \epsilon$.

Perturbation theory

Given a computational problem, will it be "easy" or "hard" to compute numerically? Given a numerical algorithm, is its behavior "good" or "bad," given the intrinsic difficulty of the computational problem? Last time, we discussed how we might assess "goodness" analytically, using a mathematical tool from numerical analysis called perturbation theory.

The basic idea is simple. Given a target mathematical function that we wish to compute, $f(x)$, start by considering a perturbation to its input, $f(x + \Delta x)$; then, ask how far $f(x + \Delta x)$ is from $f(x)$. For example, suppose $f(x)$ is continuous and twice differentiable. Then, you might estimate this difference by the usual Taylor series expansion and truncation you learned in Calculus 101, namely,

$$f(x + \Delta x) = f(x) + \Delta x \cdot f'(x) + \mathcal{O}(\Delta x^2),$$

where $f'(x)$ is the first derivative of $f(x)$ at the point $x$.

Absolute condition numbers -- on the "hardness" of a problem

From the above relation, you can compute an upper-bound on the absolute value of the error, which would be

This calculation says that the error depends not only on the measurement error, $\Delta x$, but also the nature of the function itself at $x$ through the factor, $\left|f'(x)\right|$. Indeed, we will give this factor a special name of absolute condition number of evaluating $f$ at $x$. For any given computational problem, we will try to find condition numbers to help us quantify the "hardness" of the problem.

By way of terminology, if the condition number is small we say the problem of computing $f(x)$ is well-conditioned, meaning it should be relatively "easy" to compute; otherwise, we say the problem is ill-conditioned, meaning it is relatively difficult to compute.

Relative condition numbers

Besides the absolute value of the difference, you can also ask about the relative difference (error),

$$\left|f(x + \Delta x) - f(x)\right| / \left|f(x)\right|.$$

For this case and problem of evaluating $f(x)$, the relative error becomes,

$$\frac{\left|f(x + \Delta x) - f(x)\right|} {\left|f(x)\right|} \approx \frac{|\Delta x|} {|x|} \cdot \underbrace{ \frac{\left|f'(x)\right| \cdot |x|} {\left|f(x)\right|} }_{\equiv\ \kappa_r(x)} ,$$

where $\kappa_r(x)$ is the relative condition number of evaluating $f(x)$ at $x$.

Observe that this relation expresses the relative change in the output as a function of some relative change in the input, $\frac{|\Delta x|}{|x|}$.

Backward stability: on the "quality" of an algorithm

Let's say someone devises an algorithm to compute $f(x)$. For a given value $x$, let's suppose this algorithm produces the value $\mathrm{alg}(x)$. Again, we might ask whether $\mathrm{alg}(x)$ produces a reasonable output.

One way to answer this question is to determine if the algorithm is backward stable. In particular, a backward stable algorithm is one that computes the exact answer to a slightly different input, i.e.,

$$\mathrm{alg}(x) = f(x + \Delta x).$$

That should look familiar! If an algorithm is backward stable, it means you can estimate its (absolute) backward error using our perturbation analysis from before, i.e.,

$$\left|\mathrm{alg}(x) - f(x)\right| \approx \left|f'(x)\right| \cdot \left|\Delta x\right|,$$

or its relative backward error by suitable normalization, i.e.,

$$ \frac{\left|\mathrm{alg}(x) - f(x)\right|} {\left|f(x)\right|} \approx \frac{|\Delta x|} {|x|} \cdot \frac{\left|f'(x)\right| \cdot |x|} {\left|f(x)\right|} . $$

Example: Computing a sum

Let $p$ be a vector (array) of length $n$. Suppose we wish to sum its values. Mathematically, denote the exact sum by,

$$s_{n-1} = \sum_{i=0}^{n-1} p_i = p_0 + p_1 + p_2 + \cdots + p_{n-1}.$$

Now consider the following Python program to compute its sum:



In [ ]:

    
def alg_sum (p): # p = p[0:n]
    s = 0.
    for p_i in p:
        s = s + p_i
    return s

We want to know to what extent this algorithm is "good." We will carry out this analysis by estimating the total round-off error of this algorithm,

$$\mbox{alg_sum}(\mathtt{p[:]}) - s_{n-1}.$$

We will assume that each element of the input array, p[i], is exactly $p_i$, and that the for loop enumerates each element p[i] in increasing order from i=0 to n-1.

The alg_sum program performs $n$ additions. Denote the numerical result of each addition by $\hat{s}_i$, where $i \in [0, n)$. Since the accumulator variable s is initially 0, let's further suppose that the "zero-th" addition, which is 0 + p[0] is performed exactly. In other words, $\hat{s}_0 = p_0 (1 + \delta_0) = p_0$, if we take the round-off error $\delta_0 = 0$.

Now consider the $i$-th addition in exact arithmetic, $s_i$, for each $i > 0$. When $\mbox{alg_sum}$ carries out this addition, it starts with the previously computed result, $\hat{s}_{i-1}$, adds the next input, $p_i$, and incurs a round-off error, $\delta_i$:

$$ \begin{array}{rcl} \hat{s}_{i} & = & (\hat{s}_{i-1} + p_i) (1 + \delta_i) \\ & = & \hat{s}_{i-1} (1 + \delta_i) + p_i (1 + \delta_i). \end{array} $$

(expand)

$$ \begin{array}{rcl} \hat{s}_{i} & = & \left(\hat{s}_{i-2} (1 + \delta_{i-1}) + p_{i-1} (1 + \delta_{i-1})\right) (1 + \delta_i) + p_i (1 + \delta_i) \\ & = & \hat{s}_{i-2} (1 + \delta_{i-1})(1 + \delta_i) + p_{i-1} (1 + \delta_{i-1})(1 + \delta_i) + p_i (1 + \delta_i) \end{array} $$

(expand one more time)

$$ \begin{array}{rcl} \hat{s}_{i} & = & \left[\hat{s}_{i-3} (1 + \delta_{i-2}) + p_{i-2} (1 + \delta_{i-2})\right] (1 + \delta_{i-1})(1 + \delta_i) + p_{i-1} (1 + \delta_{i-1})(1 + \delta_i) + p_i (1 + \delta_i) \\ & = & \hat{s}_{i-3} (1 + \delta_{i-2})(1 + \delta_{i-1})(1 + \delta_i) + p_{i-2} (1 + \delta_{i-2})(1 + \delta_{i-1})(1 + \delta_i) + p_{i-1} (1 + \delta_{i-1})(1 + \delta_i) + p_i (1 + \delta_i) \end{array} $$

(pattern!)

$$ \begin{array}{rcl} \hat{s}_n & = & \sum_{i=0}^{n-1} p_i \cdot (1 + \delta_0) (1 + \delta_1) \cdots (1 + \delta_i). \end{array} $$

(approx)

$$ \begin{array}{rcl} (1 + \delta_0) (1 + \delta_1) \cdots (1 + \delta_i) & = & 1 + \delta_0 + \delta_1 + \cdots + \delta_i + \sum_{k, l} \mathcal{O}\left(\delta_k \delta_l\right). \end{array} $$

(ignore low order terms)

$$ \begin{array}{rcl} (1 + \delta_0) (1 + \delta_1) \cdots (1 + \delta_i) & \approx & 1 + \delta_0 + \delta_1 + \cdots + \delta_i. \end{array} $$

(final answer)

$$ \begin{array}{rcl} \hat{s}_n & \approx & \sum_{i=0}^{n-1} p_i \cdot (1 + \delta_0 + \delta_1 + \cdots + \delta_i) \\ & = & \sum_{i=0}^{n-1} p_i \cdot \left(1 + \sum_{j=0}^{i} \delta_j\right). \end{array} $$

So, this formula for $\hat{s}_n$ is our mathematical estimate of what the $\mathrm{alg\_sum}$ program produces.

To simplify the notation a little, define

$$\Delta_i \equiv \sum_{j=0}^i \delta_i,$$

so the computed sum becomes

$$\hat{s}_n \approx \sum_{i=0}^{n-1} p_i \cdot \left(1 + \Delta_i\right).$$

In the analyses below, we will use one more fact about $\Delta_i$, namely,

$$ \begin{array}{rcl} \left| \Delta_i \right| & = & \left| \sum_{j=0}^i \delta_i \right| \\ & \leq & \sum_{j=0}^i \left| \delta_i \right| \\ & \leq & \sum_{j=0}^i \epsilon \\ & = & (i+1) \cdot \epsilon. \end{array} $$

Next, let's apply this analysis in two ways: one to show when we should expect the algorithm to be backward stable, and another to derive a bound on the algorithm's relative error.

Analysis 1: Backward stability. The computed sum is

$$ \hat{s}_n \approx \sum_{i=0}^{n-1} \left(p_i + p_i \Delta_i\right). $$

In other words, the algorithm computes the exact sum for a perturbed problem. If $p_i \Delta_i$ is small, then this algorithm for computing the sum is backward stable.

So when will $\Delta_i$ be small? Observe that the summand above may also be written as

$$p_i + p_i \Delta_i = p_i \cdot \left(1 + \Delta_i\right).$$

Each $\Delta_i$ can be at most $n \cdot \epsilon$, since $i \leq n-1$. In double-precision, $\epsilon_d = 2^{-52} \approx 10^{-16}$. So, if you are summing $n$ values, then you can use this algorithm to sum on the order of $n \sim \mathcal{O}(1\mbox{ trillion})$ double-precision values while keeping $n \cdot \epsilon_d \lesssim 1/1000$.

Analysis 2: A relative error bound. Our model of floating-point arithmetic says that $|\delta_i| \leq \epsilon$. Thus, you can bound the absolute error as follows:

$$ \begin{array}{rcl} \left| \hat{s}_n - s_{n-1} \right| & \approx & \left| \sum_{i=0}^{n-1} p_i \Delta_i \right| \\ & \leq & \sum_{i=0}^{n-1} \left| p_i \Delta_i \right| \\ & = & \sum_{i=0}^{n-1} \left| p_i \right| \cdot \left| \Delta_i \right| \\ & \leq & \sum_{i=0}^{n-1} \left| p_i \right| \cdot (i+1) \cdot \epsilon \\ & \leq & \sum_{i=0}^{n-1} \left| p_i \right| \cdot n \cdot \epsilon \\ & \leq & n \cdot \epsilon \cdot \sum_{i=0}^{n-1} \left| p_i \right|. \end{array} $$

Treating $p$ as a vector, you can express this error more compactly using the vector 1-norm, i.e.,

$$ \left|\hat{s}_{n-1} - s_{n-1}\right| \lesssim n \cdot \epsilon \cdot \|p\|_1 .$$

Finally, you can translate this absolute error into a relative one.

$$ \frac{\left|\hat{s}_n - s_{n-1}\right|} {\left|s_{n-1}\right|} \lesssim n \cdot \epsilon \cdot \frac{\|p\|_1} {\left|s_{n-1}\right|} = n \cdot \epsilon \cdot \sum_{i=0}^{n-1} \left| \frac{p_i} {s_{n-1}} \right| . $$

While this bound can be small, it can also be quite conservative. (Why?)

A numerical experiment.

Let's compare the bounds above to an experiment.



In [ ]:

    
n = [10, 100, 1000, 10000, 100000, 1000000, 10000000]
x = [0.1] * max (n)
s = [1., 10., 100., 1000., 10000., 100000., 1000000.] # exact result
t = [alg_sum (x[0:n_i]) for n_i in n]



In [ ]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [ ]:

    
rel_err_computed = [abs (ti-si) / abs (si) for (si, ti) in zip (s, t)]
rel_err_bound = [float (ni) * (2.**(-52)) for ni in n]

# Plot of the relative error bound
plt.loglog (n, rel_err_computed, 'b*', n, rel_err_bound, 'r--')

Exercise: Computing dot products

Let $x$ and $y$ be two vectors of length $n$, and denote their dot product by $f(x, y) \equiv x^T y$.

Now suppose we store the values of $x$ and $y$ exactly in two Python arrays, x[0:n] and y[0:n]. Further suppose we compute their dot product by the program, alg_dot().



In [ ]:

    
def alg_dot (x, y):
    p = [xi*yi for (xi, yi) in zip (x, y)]
    s = alg_sum (p)
    return s

Derive absolute and relative error bounds for alg_dot(). Is alg_dot() backward stable?

Answer: Let $\hat{p}_i$ denote the error in the $i$-th product, i.e.,

$$\hat{p}_i = x_i y_i \left(1 + \gamma_i\right).$$

We then invoke alg_sum() to sum these rounded products, producing a computed dot product, $\hat{s}_n$. Therefore, you can invoke the corresponding error bound, on these products:
$$ \begin{array}{rcl} \hat{s}_n & \approx & \sum_{i=0}^{n-1} \hat{p}_i \cdot \left(1 + \Delta_i\right) \\ & = & \sum_{i=0}^{n-1} x_i y_i \left(1 + \gamma_i\right) \cdot \left(1 + \Delta_i\right) \\ & = & \sum_i x_i y_i + \sum_i x_i y_i \left(\gamma_i + \Delta_i\right) + \sum_i x_i y_i \cdot \gamma_i \Delta_i \\ & \approx & \left( \sum_i x_i y_i \right) + \sum_i x_i y_i (\gamma_i + \Delta_i), \end{array} $$
where we have invoked an argument that the $\gamma_i \Delta_i$ terms are small. This computation shows what the absolute error in the computed result is, namely, the absolute value of the last summation:
$$ \begin{array}{rcl} \left| \hat{s}_{n-1} - \sum_i x_i y_i \right| & \approx & \left| \sum_i x_i y_i (\gamma_i + \Delta_i) \right| \\ & \leq & \sum_i \left| x_i \right| \cdot \left| y_i \right| \cdot \left( \left| \gamma_i \right| + \left| \Delta_i \right| \right) \\ & \leq & (n+1) \cdot \epsilon \cdot |x|^T |y|, \end{array} $$
where $|x|$ and $|y|$ refer to componentwise absolute values.

Relative error bounds. Just divide both sides by the absolute value of the exact dot product:
$$ \begin{array}{rcl} \frac{\left| \hat{s}_{n-1} - x^T y \right|} {\left| x^T y \right|} & \approx & (n+1) \epsilon \frac{|x|^T |y|}{|x^T y|}. \end{array} $$
Backward stability. The following result shows that, with respect to absolute error, alg_dot() will be backward stable.

Start with the expression for the computed dot product:
$$ \begin{array}{rcl} \hat{s}_n & \approx & \sum_{i=0}^{n-1} x_i y_i \cdot \left(1 + \gamma_i\right) \cdot \left(1 + \Delta_i\right) \\ & \approx & \sum_{i=0}^{n-1} x_i \left(1 + \gamma_i\right) \cdot y_i \left(1 + \Delta_i\right). \end{array} $$
Let $\tilde{x} \equiv x_i \left(1 + \gamma_i\right)$ and $\tilde{y} \equiv y_i \left(1 + \Delta_i\right)$. Then the computed sum is the exact solution for perturbed inputs:
$$ \hat{s}_n \approx \sum_{i=0}^{n-1} \tilde{x}_i \tilde{y}_i = \tilde{x}^T \tilde{y}. $$
Provided $\gamma_i$ and $\Delta_i$ are small compared to 1, $\tilde{x}$ and $\tilde{y}$ will be small perturbations to the input vectors, $x$ and $y$, and therefore alg_dot() will compute the exact solution to a slightly different problem, $\tilde{x}^T \tilde{y}$.

Solving overdetermined linear systems via least squares

Having taken that detour, we can now return to our original motivating problem: solving an overdetermined system of linear equations, $Ax=b$ where the real-valued $m \times n$ matrix $A \in \mathbb{R}^{m \times n}$ has at least as many rows as columns ($m \geq n$). Let's further assume that $A$ has full rank ($\mathrm{rank}(A) = n$), i.e., the columns of $A$ are linearly independent.

Since the system is overdetermined, it will not have a unique solution. Therefore, we will need to compute a "best fit" approximate solution. We will look at a couple different algorithms for solving this system. Then, using the analysis techniques mentioned above, see how we might determine what method we should use.

You will need several facts from linear algebra, some of which appear as exercises, like this one:

Exercise. Let $x \in \mathbb{R}^n$ be a real-valued vector of length $n$. Show that $\|x\|_2^2 = x^T x$.

@YOUSE: Enter your answer(s) here.

Observe that the vector two-norm, $\|\cdot\|_2$, gives you a way to measure the "length" of a vector.

Exercise. Given two vectors, $x$ and $y$, show that the dot product is commutative, i.e., $x^T y = y^T x$.

Answer: $x^T y = \sum_i x_i y_i = \sum_i y_i x_i = y^T x.$

Perturbation theory for linear systems

Let's start by asking how "hard" it is to solve a given linear system, $Ax=b$. You will apply perturbation theory to answer this question.

First, some notation. To measure the magnitudes of the perturbations, we will use vector and matrix norms. Assume that the norm of a vector, $\|x\|_2$, denotes the vector 2-norm; further assume that the norm of a matrix, $\|A\|_F$, denotes the matrix Frobenius norm. If you need a refesher on these definitions, see our linear algebra notes. The most important identities for the discussion below are:

Triangle inequality: $\|x + y\|_2 \leq \|x\|_2 + \|y\|_2$
Norm of a matrix-vector product: $\|Ax\|_2 \leq \|A\|_F\cdot\|x\|_2$
Norm of matrix-matrix product: $\|AB\|_F \leq \|A\|_F\cdot\|B\|_F$

To simplify the notation a little, we will drop the "$2$" and "$F$" subscripts.

Next, suppose all of $A$, $b$, and the eventual solution $x$ undergo additive perturbations, denoted by $A + \Delta A$, $b + \Delta b$, and $x + \Delta x$, respectively. Then, subtracting the original system from the perturbed system, you would obtain the following.

$$ \begin{array}{rrcll} & (A + \Delta A)(x + \Delta x) & = & b + \Delta b & \\ - [& Ax & = & b & ] \\ \hline & \Delta A x + (A + \Delta A) \Delta x & = & \Delta b & \\ \end{array} $$

Now look more closely at the perturbation, $\Delta x$, of the solution. Let $\hat{x} \equiv x + \Delta x$ be the perturbed solution. Then the above can be rewritten as,

$$\Delta x = A^{-1} \left(\Delta b - \Delta A \hat{x}\right),$$

where we have assumed that $A$ is invertible. (That won't be true for our overdetermined system, but let's not worry about that for the moment.)

How large is $\Delta x$? Let's use a norm to measure it and bound it using

$$ \begin{array}{rcl} \|\Delta x\| & = & \|A^{-1} \left(\Delta b - \Delta A \hat{x}\right)\| \\ & \leq & \|A^{-1}\|\cdot\left(\|\Delta b\| + \|\Delta A\|\cdot\|\hat{x}\|\right). \end{array} $$

You can rewrite this as follows:

$$ \begin{array}{rcl} \frac{\|\Delta x\|} {\|\hat{x}\|} & \leq & \|A^{-1}\| \cdot \|A\| \cdot \left( \frac{\|\Delta A\|} {\|A\|} + \frac{\Delta b} {\|A\| \cdot \|\hat{x}\|} \right). \end{array} $$

This bound says that the relative error of the perturbed solution, compared to relative perturbations in $A$ and $b$, scales with the product, $\|A^{-1}\| \cdot \|A\|$. This factor is the linear systems analogue of the condition number for evaluating the function $f(x)$! As such, we define

$$\kappa(A) \equiv \|A^{-1}\| \cdot \|A\|$$

as the condition number of $A$ for solving linear systems.

(Left off here)

A numerical example of an ill-conditioned system

Let's look at a system that is ill-conditioned and see what happens when we make a tiny perturbation to it.



In [ ]:

    
import numpy as np



In [ ]:

    
A = np.array([(1., 1000.),
              (2.**(-10) + 2.**(-11), 1.)])

print "A ==\n", A
print "\ncond (A) == ", np.linalg.cond (A)



In [ ]:

    
Delta_A = np.array ([(0., 0.),
                     (-2.**(-11), 0.)
                    ])
B = A + Delta_A

print "B := A + dA ==\n", B
print "\ncond (B) / cond (A) == ", \
      np.linalg.cond (B) / np.linalg.cond (A)



In [ ]:

    
b = np.array([1., 1.])

x_A = np.linalg.solve (A, b)
print "x ~= A^(-1)*b == ", x_A

x_B = np.linalg.solve (B, b)
print "x ~= B^(-1)*b == ", x_B

Least squares minimization

If $Ax=b$ is overdetermined, then there are more equations (rows of $Ax$) than unknowns (entries of $x$) and no solution in general. Therefore, we ask for an approximate solution $x$. How do we choose $x$?

One intuitive idea is to choose an $x$ such that the residual, $r = r(x) \equiv b - Ax$, is minimized in some way, such as measuring the length of $r$ using the vector two-norm:

$$ \begin{array}{rcl} \arg\!\min_{x} \|r(x)\|_2^2 & = & \arg\!\min_{x} \|b - Ax\|_2^2 \\ & = & \arg\!\min_{x} (b - Ax)^T(b - Ax) \\ & = & \arg\!\min_{x} \left\{ b^T b - 2 x^T A^T b + x^T A^T A x \right\}. \end{array} $$

Gradients. To find the minimum $x$, we need to do the moral equivalent of taking a "vector derivative," setting it to 0, and then solving for $x$. The right mathematical tool is the gradient. Given a scalar function $f(x)$, where $x$ is a vector, the function's gradient, $\nabla_x f(x)$, is a vector whose $k$-th entry is the partial derivative of $f(x)$ with respect to $x_k$. That is,

$$ \nabla_x f(x) \equiv \left(\begin{array}{c} \frac{\partial f}{\partial x_0} \\ \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_{n-1}} \end{array}\right). $$

Exercise. Let $x$ and $y$ be vectors, and let $M$ be a matrix. Verify the following identities related to the gradient.

$\nabla_x (x^T y) = y$
$\nabla_x (x^T x) = 2x$
$\nabla_x (x^T M x) = (M + M^T)x$

@YOUSE: Enter your answer(s) here.

Exercise. Let $f(x) \equiv (b - Ax)^T(b - Ax)$, where $x \in \mathbb{R}^n$, $b \in \mathbb{R}^m$, and $A \in \mathbb{R}^{m \times n}$. Show that

$$\nabla_x f(x) = 2 (A^T A x - A^T b).$$

@YOUSE: Enter your answer(s) here.

Normal equations. In the previous exercise, $f(x)$ is precisely the objective function we wish to minimize by a suitable choice of $x$. The minimum occurs when $\nabla_x f(x) = 0$; per the execises above, this $x$ is the solution to the normal equations,

$$A^T A x = A^T b.$$

Exercise. You could solve this system directly, by first forming $C \leftarrow A^TA$ and $c \leftarrow A^T b$, and then solving $Cx=c$. But is this a good algorithm? (You may assume $C$ is invertible.) Why or why not?

@YOUSE: Enter your answer here.

QR factorization. In fact, the standard method to solve the linear least squares problem is to use a so-called QR factorization.

As it happens, every full-rank matrix $A \in \mathbb{R}^{m \times n}$, with $m \geq n$, may be written as the product $A = QR$, where $Q$ is an $m \times n$ orthogonal matrix (i.e., $Q^T Q = I$) and $R$ is upper-triangular with positive diagonals ($r_{ii} > 0$).

Exercise. Suppose you are given a QR decomposition, $A = QR$. Show how to compute the solution $x$ of the linear least squares problem.

@YOUSE: Enter your answer here.

Solving the normal equations is generally cheaper than performing a QR factorization. However, using QR is more numerically accurate. Algorithms to compute a QR factorization are numerically stable, and solving the system using QR has a lower condition number than solving by using the normal equations. Additionally, applying (multiplying by) an orthogonal matrix is a stable operation.



In [ ]:

    
# This file implements the experiment in Lecture 19 of
# Trefethen and Bau, Numerical Linear Algebra, SIAM 1997.
#
# Python implementation originally by Da Kuang (2014)

import numpy as np
import scipy.linalg as lin
import matplotlib.pyplot as plt
%matplotlib inline

m = 100
n = 15
A = np.zeros ((m, n))

a = np.arange (0, m, dtype=np.float64)
a /= (m-1)

for i in range (n):
    A[:, i] = np.power (a, i)
    
print 'Condition number of A:', np.linalg.cond (A)



In [ ]:

    
b = np.exp (np.sin (4*a)) # exp (sin (4*a))
b /= 2006.787453080206

plt.plot (a, b)



In [ ]:

    
result = np.linalg.lstsq (A, b)
x1 = result[0]
print 'Last element of x1 (possibly from SVD):', x1[-1]



In [ ]:

    
Q, R = np.linalg.qr (A)
tmp = Q.T.dot (b)
x2 = np.linalg.solve (R, tmp)
print 'Last element of x2 (from QR):', x2[-1]



In [ ]:

    
ATA = A.T.dot (A)
tmp = A.T.dot (b)
x3 = np.linalg.solve (ATA, tmp)
print 'Last element of x3 (from normal equation):', x3[-1]

try:
    L = np.linalg.cholesky (ATA)
except np.linalg.linalg.LinAlgError, e:
    print 'Cholesky factorization error:', e

print 'Condition number of A\'A:', np.linalg.cond (ATA)

Example: Nearly collinear predictors

One practical situation in which poor conditioning can arise in linear regression modeling is when you include two strongly correlated predictors.



In [ ]:

    
tmp = np.arange (100)
A = np.zeros ((100, 2))

delta_0 = np.random.rand (100) * 0.0001
delta_1 = np.random.rand (100) * 0.0001

A[:, 0] = tmp + delta_0
A[:, 1] = tmp + delta_1

ATA = A.T.dot(A)

print 'cond (A):', np.linalg.cond (A)
print 'cond (A^T*A):', np.linalg.cond (ATA)