CSE 6040, Fall 2015 [24]: "Online" regression

This notebook continues the linear regression problem from last time, but asks about a method that can estimate the regression coefficients when you only get to see samples "one-at-a-time." We refer to such a fitting procedure as being "online" (or "incrementally"), rather than "offline" (or "batched").



In [ ]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Scalability with the problem size

To start, here is some code to help generate synthetic problems of a certain size, namely, $m \times (d+1)$, where $m$ is the number of observations and $d$ the number of predictors. The $+1$ comes from our usual dummy coefficient for a non-zero intercept.



In [ ]:

    
def generate_model (d):
    """Returns a set of (random) d+1 linear model coefficients."""
    return np.random.rand (d+1, 1)

def generate_data (m, x, sigma=1.0/(2**0.5)):
    """
    Generates 'm' noisy observations for a linear model whose
    predictor (non-intercept) coefficients are given in 'x'.
    Decrease 'sigma' to decrease the amount of noise.
    """
    assert (type (x) is np.ndarray) and (x.ndim == 2) and (x.shape[1] == 1)
    n = len (x)
    A = np.random.rand (m, n)
    A[:, 0] = 1.0
    b = A.dot (x) + sigma*np.random.randn (m, 1)
    return (A, b)

def estimate_coeffs (A, b):
    """
    Solves Ax=b by a linear least squares method.
    """
    result = np.linalg.lstsq (A, b)
    x = result[0]
    return x



In [ ]:

    
def rel_diff (x, y, ord=2):
    """
    Computes ||x-y|| / ||y||. Uses 2-norm by default;
    override by setting 'ord'.
    """
    return np.linalg.norm (x - y, ord=ord) / np.linalg.norm (y, ord=ord)



In [ ]:

    
# Demo the above routines for a 2-D dataset.

m = 50
x_true = generate_model (1)
(A, b) = generate_data (m, x_true, sigma=0.1)

print A.shape
print x_true.shape
print b.shape

print "Condition number of the data matrix: ", np.linalg.cond (A)
print "True model coefficients:", x_true.T

x = estimate_coeffs (A, b)

print "Estimated model coefficients:", x.T
print "Relative error in the coefficients:", rel_diff (x, x_true)

fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.plot (A[:, 1], b, 'b+') # Noisy observations
ax1.plot (A[:, 1], A.dot (x), 'r*') # Fit
ax1.plot (A[:, 1], A.dot (x_true), 'go') # True solution

Benchmark varying $m$

Let's benchmark the time to compute $x$ when the dimension $n$ of each point is fixed but the number $m$ of points varies.



In [ ]:

    
# Benchmark, as 'm' varies:

n = 32 # dimension
M = [100, 1000, 10000, 100000, 1000000]
times = [0.] * len (M)
for (i, m) in enumerate (M):
    x_true = generate_model (n)
    (A, b) = generate_data (m, x_true, sigma=0.1)
    t = %timeit -o estimate_coeffs (A, b)
    times[i] = t.best

Exercise. How does the running time scale with $m$?



In [ ]:

    
t_linear = [times[0]/M[0]*m for m in M]

fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.loglog (M, times, 'bo')
ax1.loglog (M, t_linear, 'r--')

Exercise. Now fix the number $m$ of observations but vary the dimension $n$. How does time scale with $n$? Complete the benchmark code below to find out. In particular, given the array N[:], compute an array, times[:], such that times[i] is the running time for a problem of size m$\times$(N[i]+1).

Hint: You can basically copy and modify the preceding benchmark. Also, note that the code cell following the one immediately below will plot your results against $\mathcal{O}(n)$ and $\mathcal{O}(n^2)$.



In [ ]:

    
N = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 3072]
m = 5000
times = [0.] * len (N)

# @YOUSE: Implement a benchmark to compute the time,
# `times[i]`, to execute a problem of size `N[i]`.
for (i, n) in enumerate (N):
    pass



In [ ]:

    
t_linear = [times[0]/N[0]*n for n in N]
t_quadratic = [times[0]/N[0]/N[0]*n*n for n in N]

fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.loglog (N, times, 'bo')
ax1.loglog (N, t_linear, 'r--')
ax1.loglog (N, t_quadratic, 'g--')

An online algorithm

The empirical scaling appears to be pretty good, being roughly linear in $m$ or at worst quadratic in $n$. But there is still a downside in time and storage: each time there is a change in the data, you appear to need to form the data matrix all over again and recompute the solution from scratch, possibly touching the entire data set again!

This approach, which requires the full data, is often referred to as a batched or offline procedure.

This begs the question, is there a way to incrementally update the model coefficients whenever a new data point, or perhaps a small batch of new data points, arrives? Such a procedure would be considered incremental or online, rather than batched or offline.

Setup: Key assumptions and main goal

In the discussion that follows, assume that you only get to see the observations one-at-a-time. Let $(b_k, a_k^T)$ denote the current observation. (Relative to our previous notation, this tuple is just element $k$ of $b$ and row $k$ of $A$.)

Additionally, assume that, at the time the $k$-th observation arrives, you start with a current estimate of the parameters, $\hat{x}_k$, which is a vector. If for whatever reason you need to refer to element $i$ of that vector, use $\hat{x}_{i,k}$. You will then compute a new estimate, $\hat{x}_{k+1}$ using $\hat{x}_k$ and $(b_k, a_k^T)$. For the discussion below, further assume that you throw out $\hat{x}_k$ once you have $\hat{x}_{k+1}$.

As for your goal, recall that in the batch setting you start with all the observations, $(b, A)$. From this starting point, you may estimate the linear regression model's parameters, $x$, by solving $Ax=b$. In the online setting, you compute estimates one at a time. After seeing all $m$ observations in $A$, your goal is to compute an $\hat{x}_{m-1} \approx x$.

An initial idea

Indeed, there is a technique from the signal processing literature that we can apply to the linear regression problem, known as the least mean square (LMS) algorithm. Before describing it, let's start with an initial idea.

Suppose that you have a current estimate of the parameters, $x_k$, when you get a new sample, $(b_k, a_k^T)$. The error in your prediction will be,

$$b_k - a_k^T \hat{x}_k.$$

Ideally, this error would be zero. So, let's ask if there exists a correction, $\Delta_k$, such that

$$ \begin{array}{rrcl} & b_k - a_k^T (\hat{x}_k + \Delta_k) & = & 0 \\ \iff & b_k - a_k^T \hat{x}_k & = & a_k^T \Delta_k \end{array} $$

Then, you could compute a new estimate of the parameter by $\hat{x}_{k+1} = \hat{x}_k + \Delta_k$.

This idea has a major flaw, which we will discuss below. But before we do, please try the following exercise.

Exercise. Verify that the following choice of $\Delta_k$ would make the preceding equation true.

$$ \begin{array}{rcl} \Delta_k & = & \frac{a_k}{\|a_k\|_2^2} (b_k - a_k^T \hat{x}_k). \end{array} $$

Refining (or rather, "hacking") the basic idea: The least mean square (LMS) procedure

The basic idea sketched above has at least one major flaw: the choice of $\Delta_k$ might allow you to correctly predicts $b_k$ from $a_k$ and the new estimate $\hat{x}_{k+1} = \hat{x}_k + \Delta_k$, but there is no guarantee that this new estimate $\hat{x}_{k+1}$ preserves the quality of predictions made at all previous iterations!

There are a number of ways to deal with this problem, which includes carrying out an update with respect to some (or all) previous data. However, there is also a simpler "hack" that, though it might require some parameter tuning, can be made to work in practice.

That hack is as follows. Rather than using $\Delta_k$ as computed above, let's compute a different update that has a "fudge" factor, $\phi$:

$$ \begin{array}{rrcl} & \hat{x}_{k+1} & = & \hat{x}_k + \Delta_k \\ \mbox{where} & \Delta_k & = & \phi \cdot a_k (b_k - a_k^T \hat{x}_k). \end{array} $$

A big question is how to choose $\phi$. There is some analysis out there that can help. We will just state the results of this analysis without proof.

Let $\lambda_{\mathrm{max}}(A^TA)$ be the largest eigenvalue of $A^TA$. The result is that as the number of samples $m \rightarrow \infty$, any choice of $\phi$ that satisfies the following condition will eventually converge to the best least-squares estimator of $x$, that is, the estimate of $x$ you would have gotten by solving the least squares with all the data.

$$ 0 < \phi < \frac{2}{\lambda_{\mathrm{max}}(A^TA)}. $$

This condition is not very satisfying, because you cannot really know $\lambda_{\mathrm{max}}(A^TA)$ until you've seen all the data, whereas we would like to apply this procedure online as the data arrive. Nevertheless, in practice you can imagine hybrid schemes that, given a batch of data points, use the QR fitting procedure to get a starting estimate for $x$ as well as to estimate a value of $\phi$ to use for all future updates.

Summary of the LMS algorithm

To summarize, the algorithm is as follows:

Choose any initial guess, $x_0$, such as $x_0 \leftarrow 0$.
For each observation $(b_k, a_k^T)$, do the update:
- $x_{k+1} \leftarrow x_k + \Delta_k$,
where $\Delta_k = \phi \cdot a_k (b_k - a_k^T x_k)$.

Trying out the LMS idea

Now you should implement the LMS algorithm and see how it behaves.

To start, let's generate an initial 1-D problem (2 regression coefficients, a slope and intercept), and solve it using the batch procedure.



In [ ]:

    
m = 100000
d = 1
x_true = generate_model (d)
(A, b) = generate_data (m, x_true, sigma=0.1)

print "Condition number of the data matrix: ", np.linalg.cond (A)

x = estimate_coeffs (A, b)
e_rel = rel_diff (x, x_true)

print "Relative error:", e_rel

Recall that we need a value for $\phi$, for which we have an upper-bound of $\lambda_{\mathrm{max}}(A^TA)$. Let's cheat by computing it explicitly, even though in practice we would need to do something different.



In [ ]:

    
LAMBDA_MAX = max (np.linalg.eigvals (A.T.dot (A)))
print LAMBDA_MAX

Exercise. Implement the online LMS algorithm in the code cell below where indicated. It should produce a final parameter estimate, x_lms, as a column vector.

In addition, the skeleton code below uses rel_diff() to record the relative difference between the estimate and the true vector, storing the $k$-th relative difference in rel_diffs[k]. Doing so will allow you to see the convergence behavior of the method.

Lastly, to help you out, we've defined a constant in terms of $\lambda_{\mathrm{max}}(A^TA)$ that you can use for $\phi$.

In practice, you would only maintain the current estimate, or maybe just a few recent estimates, rather than all of them. Since we want to inspect these vectors later, let's just store them all.



In [ ]:

    
PHI = 1.99 / LAMBDA_MAX # Fudge factor
rel_diffs = np.zeros ((m+1, 1))

x_k = np.zeros ((d+1))
for k in range (m):
    rel_diffs[k] = rel_diff (x_k, x_true)
    
    # @YOUSE: Implement the online LMS algorithm.
    # Use (b[k], A[k, :]) as the k-th observation.
    pass
    
x_lms = x_k
rel_diffs[m] = rel_diff (x_lms, x_true)

Let's compare the true coefficients against the estimates, both from the batch algorithm and the online algorithm.



In [ ]:

    
print x_true.T
print x.T
print x_lms.T

Let's also compute the relative differences between each estimate X[:, k] and the true coefficients x_true, measured in the two-norm, to see if the estimate is converging to the truth.



In [ ]:

    
plt.plot (range (len (rel_diffs)), rel_diffs)

Finally, if the dimension is d=1, let's go ahead and do a sanity-check regression fit plot.



In [ ]:

    
STEP = A.shape[0] / 500
if d == 1:
    fig = plt.figure()
    ax1 = fig.add_subplot(111)
    ax1.plot (A[::STEP, 1], b[::STEP], 'b+') # blue - data
    ax1.plot (A[::STEP, 1], A.dot (x_true)[::STEP], 'r*') # red - true
    ax1.plot (A[::STEP, 1], A.dot (x)[::STEP], 'go') # green - batch
    ax1.plot (A[::STEP, 1], A.dot (x_lms)[::STEP], 'mo') # magenta - pure LMS
else:
    print "Plot is multidimensional; I live in Flatland, so I don't do that."

Exercise. Try playing with the dimensionality, number of points, the fudge factor $\phi$, and the initial guess, to see how (or perhaps even whether) the online algorithm behaves.



In [ ]:

Exercise. We said previously that, in practice, you would probably do some sort of hybrid scheme that mixes full batch updates (possibly only initially) and incremental updates. Try out such a scheme and describe what you observe.



In [ ]: