Learning regression (which many of you have learned about before) is a robust method for modeling continuous data. We're teaching it because it's used a *lot* in practice, and it's fairly simple to get your head around.

Briefly, let's set the situation up: suppose you have a list of coordinate pairs $(x_1, y_1), (x_2, y_2), \ldots (x_n, y_n)$, and you would like to predict $y$ values corresponding to a particular $x'$. Our model will look like $f(x) = mx + b$, where our answer to the question will be $f(x')$ of course. Now, since this is real, messy data, of course all of the $x_i$ and $y_i$ are not going to follow the model. In reality, there will be some *error*, which we call the residuals (i.e., the difference between $f(x_i)$ and $y_i$).

Use linear regression when you have the following:

- The
`x`

values are fixed, known values (or at least have very low variance compared to the residuals). - Reasonable expectation that the data follows a linear model (this is usually the case, since by assumption (1) you can use as inputs any function of $x$ as a parameter to your regression).
- The residuals are normally distributed with constant variance, and are independent.

Usually, the last assumption is the most difficult to satisfy, since if you have data across a large $x$-span, then the errors not usually univariate (think about what a 10% measurement error at 1mm looks like and what it looks like at 100m). Remember that *variance*, (or the square of standard deviation) is a scalar quantity, unlike z-score.

The goal with OLS is to minimize the L2 error, or the sum of squares error. Consider a candidate $m$ and $b$. The sum of squares error is $$\sum_{i = 1}^n (mx_i + b - y_i)^2$$ To minimize this, we can take the derivative w.r.t. $m$ and the derivative w.r.t. $b$, and set them equal to 0. We get:

$$\sum_{i = 1}^n 2(mx_i + b - y_i) = 0$$$$\sum_{i = 1}^n 2x_i(mx_i + b - y_i) = 0$$Remember that we know every value except for $m$ and $b$. This gives us, then, a system of 2 equations in 2 variables. Solving that system gives us the OLS solution:

$$m = \frac{\overline{xy} - \overline{x} \cdot \overline{y}}{\overline{x^2} - \overline{x}^2}$$$$b = \overline{y} - m\cdot\overline{x}$$where we've used the notation $\overline{x}$ to mean the average of the $x_i$. Note in particular that the average $\overline{xy}$ means $(\sum_{i = 1}^n x_iy_i)/N$.

```
In [10]:
```%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
def ols(xes, yes):
"""Do Ordinary Least Squares (OLS) regression on a set of x and y positions"""
n = len(xes)
xbar = np.mean(xes)
print(xbar)
ybar = np.mean(yes)
print(ybar)
xvar = np.mean([x*x for x in xes])
xybar = np.mean([x*y for (x,y) in zip(xes, yes)])
m = (xybar - (xbar * ybar))/(xvar - xbar**2)
b = ybar - m*xbar
return m, b
x = np.arange(100)/100.
# generate y using a 'true' mean and y intercept, and then add noise
y = (x*2.02 + 5.) + (np.random.rand(100) * 0.1 - 0.05)
plt.plot(x, y, 'ro')

```
Out[10]:
```

```
In [12]:
```m, b = ols(x, y)
print(m, b)
plt.plot(x, y, 'ro')
plt.plot(x, m*x + b)

```
Out[12]:
```

We basically (to within a reasonable level of error) recovered the original slope and intercept of that data. We have a good model for predicting points in the future. However, we'd like to know **how** good our model is, which means we need to do cross validation. Luckily, there's a great framework for this built into `sklearn`

:

```
In [20]:
```from sklearn.cross_validation import train_test_split
from sklearn import linear_model
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.15)
# unfortunately, the shape of x_train and x_test isn't quite right
x_train = x_train.reshape(x_train.shape[0], 1)
x_test = x_test.reshape(x_test.shape[0], 1)
# let's use sklearn's nice linear regression API
regr = linear_model.LinearRegression()
# train the model using the training sets
regr.fit(x_train, y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f" % np.mean((regr.predict(x_test) - y_test) ** 2.0))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(x_test, y_test))
# Plot outputs
plt.scatter(x_test, y_test, marker='o')
plt.plot(x_test, regr.predict(x_test), color='blue', linewidth=1)
plt.xticks(())
plt.yticks(())
plt.show()

```
```

We used `sklearn`

's API here to teach you in general how prediction in linear regression land works. Notice that we had to *reshape* the x arrays in order to get them to work - that's because in general, x values aren't a single value, but instead an array of *features*, which are things that you look for. For example, a feature ($f_1$) could be the number of eyes in the picture, or the sum of all the pixel values, etc. Any measurable quantity can be a feature.

Suppose we wanted to learn some nonlinear function of the data, like functions of the form

$$y = a_1x^2 + a_2\ln(x)$$(this is a crazy function I've never seen in practice). Linear regression actually generalizes really powerfully, because we can just *transform* the input to get the features we're looking for. Let's try it in this example:

```
In [34]:
```# so we don't divide by 0!!!
epsilon = 0.0000001
x_train_transformed = np.array([[x**2, np.log(x + epsilon)] for x in x_train]).reshape(85, 2)
x_test_transformed = np.array([[x**2, np.log(x + epsilon)] for x in x_test]).reshape(15, 2)
# let's use sklearn's nice linear regression API
regr2 = linear_model.LinearRegression()
# train the model using the training sets
regr2.fit(x_train_transformed, y_train)
# The coefficients
print('Coefficients: \n', regr2.coef_)
# The mean squared error
print("Mean squared error: %.2f" % np.mean((regr2.predict(x_test_transformed) - y_test) ** 2.0))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr2.score(x_test_transformed, y_test))
# Plot outputs
plt.scatter(x_test, y_test, marker='o')
plt.plot(sorted(x_test), sorted(regr2.predict(x_test_transformed)), marker='o')
plt.xticks(())
plt.yticks(())
plt.show()

```
```

```
In [ ]:
```