Linear Regression

Linear Regression

Learning regression (which many of you have learned about before) is a robust method for modeling continuous data. We're teaching it because it's used a lot in practice, and it's fairly simple to get your head around.

Briefly, let's set the situation up: suppose you have a list of coordinate pairs $(x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)$, and you would like to predict $y$ values corresponding to a particular $x'$. Our model will look like $f(x) = mx + b$, where our answer to the question will be $f(x')$ of course. Now, since this is real, messy data, of course all of the $x_i$ and $y_i$ are not going to follow the model. In reality, there will be some error, which we call the residuals (i.e., the difference between $f(x_i)$ and $y_i$).

When to use linear regression?

Use linear regression when you have the following:

  1. The x values are fixed, known values (or at least have very low variance compared to the residuals).
  2. Reasonable expectation that the data follows a linear model (this is usually the case, since by assumption (1) you can use as inputs any function of $x$ as a parameter to your regression).
  3. The residuals are normally distributed with constant variance, and are independent.

Usually, the last assumption is the most difficult to satisfy, since if you have data across a large $x$-span, then the errors not usually univariate (think about what a 10% measurement error at 1mm looks like and what it looks like at 100m). Remember that variance, (or the square of standard deviation) is a scalar quantity, unlike z-score.

Ordinary Least Squares (OLS) Regression

The goal with OLS is to minimize the L2 error, or the sum of squares error. Consider a candidate $m$ and $b$. The sum of squares error is $$\sum_{i = 1}^n (mx_i + b - y_i)^2$$ To minimize this, we can take the derivative w.r.t. $m$ and the derivative w.r.t. $b$, and set them equal to 0. We get:

$$\sum_{i = 1}^n 2(mx_i + b - y_i) = 0$$$$\sum_{i = 1}^n 2x_i(mx_i + b - y_i) = 0$$

Remember that we know every value except for $m$ and $b$. This gives us, then, a system of 2 equations in 2 variables. Solving that system gives us the OLS solution:

$$m = \frac{\overline{xy} - \overline{x} \cdot \overline{y}}{\overline{x^2} - \overline{x}^2}$$$$b = \overline{y} - m\cdot\overline{x}$$

where we've used the notation $\overline{x}$ to mean the average of the $x_i$. Note in particular that the average $\overline{xy}$ means $(\sum_{i = 1}^n x_iy_i)/N$.

Ok, let's implement that


In [10]:
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

def ols(xes, yes):
    """Do Ordinary Least Squares (OLS) regression on a set of x and y positions"""
    n = len(xes)
    xbar = np.mean(xes)
    print(xbar)
    ybar = np.mean(yes)
    print(ybar)
    xvar = np.mean([x*x for x in xes])
    xybar = np.mean([x*y for (x,y) in zip(xes, yes)])
    m = (xybar - (xbar * ybar))/(xvar - xbar**2)
    b = ybar - m*xbar
    return m, b

x = np.arange(100)/100.
# generate y using a 'true' mean and y intercept, and then add noise
y = (x*2.02 + 5.) + (np.random.rand(100) * 0.1 - 0.05)
plt.plot(x, y, 'ro')


Out[10]:
[<matplotlib.lines.Line2D at 0x104e8eb50>]

In [12]:
m, b = ols(x, y)
print(m, b)
plt.plot(x, y, 'ro')
plt.plot(x, m*x + b)


0.495
5.99983861901
(2.01894579070568, 5.0004604526139484)
Out[12]:
[<matplotlib.lines.Line2D at 0x10516e510>]

So that works!

We basically (to within a reasonable level of error) recovered the original slope and intercept of that data. We have a good model for predicting points in the future. However, we'd like to know how good our model is, which means we need to do cross validation. Luckily, there's a great framework for this built into sklearn:


In [20]:
from sklearn.cross_validation import train_test_split
from sklearn import linear_model

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15)

# unfortunately, the shape of x_train and x_test isn't quite right
x_train = x_train.reshape(x_train.shape[0], 1)
x_test = x_test.reshape(x_test.shape[0], 1)

# let's use sklearn's nice linear regression API
regr = linear_model.LinearRegression()

# train the model using the training sets
regr.fit(x_train, y_train)

# The coefficients
print('Coefficients: \n', regr.coef_)

# The mean squared error
print("Mean squared error: %.2f" % np.mean((regr.predict(x_test) - y_test) ** 2.0))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(x_test, y_test))

# Plot outputs
plt.scatter(x_test, y_test, marker='o')
plt.plot(x_test, regr.predict(x_test), color='blue', linewidth=1)

plt.xticks(())
plt.yticks(())

plt.show()


('Coefficients: \n', array([ 2.02200671]))
Mean squared error: 0.00
Variance score: 1.00

We used sklearn's API here to teach you in general how prediction in linear regression land works. Notice that we had to reshape the x arrays in order to get them to work - that's because in general, x values aren't a single value, but instead an array of features, which are things that you look for. For example, a feature ($f_1$) could be the number of eyes in the picture, or the sum of all the pixel values, etc. Any measurable quantity can be a feature.

More than linear regression?

Suppose we wanted to learn some nonlinear function of the data, like functions of the form

$$y = a_1x^2 + a_2\ln(x)$$

(this is a crazy function I've never seen in practice). Linear regression actually generalizes really powerfully, because we can just transform the input to get the features we're looking for. Let's try it in this example:


In [34]:
# so we don't divide by 0!!!
epsilon = 0.0000001

x_train_transformed = np.array([[x**2, np.log(x + epsilon)] for x in x_train]).reshape(85, 2)
x_test_transformed = np.array([[x**2, np.log(x + epsilon)] for x in x_test]).reshape(15, 2)

# let's use sklearn's nice linear regression API
regr2 = linear_model.LinearRegression()

# train the model using the training sets
regr2.fit(x_train_transformed, y_train)

# The coefficients
print('Coefficients: \n', regr2.coef_)

# The mean squared error
print("Mean squared error: %.2f" % np.mean((regr2.predict(x_test_transformed) - y_test) ** 2.0))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr2.score(x_test_transformed, y_test))

# Plot outputs
plt.scatter(x_test, y_test, marker='o')
plt.plot(sorted(x_test), sorted(regr2.predict(x_test_transformed)), marker='o')

plt.xticks(())
plt.yticks(())

plt.show()


('Coefficients: \n', array([ 1.76249762,  0.0514794 ]))
Mean squared error: 0.01
Variance score: 0.95

In [ ]: