This will be the last post in the series on Linear Regression. We may revisit it in other contexts but it has served its purpose in establishing many fundamental idea of Machine Learning. There are a couple more fundamentals that are well illustrated with linear regression. Namely
For background information the past 5 post in this series are listed below.
These posts along with the current one were converted to html from Jupyter notebooks. The notebooks are available at https://github.com/dbkinghorn/blog-jupyter-notebooks
Up until now we have looked mostly at fitting our data to a straight line with a model function $h(x) a_0 + a_1x$. In the last post we showed how the regression problem could be trivially extended to multi-features with the vectorized form of the equations. That gave us models like $h(x) = a_0 + a_1x_1 + a_2x_2 \dots$ Those models would be planer or hyper-planer. Those models are great when the features in a data set closely follow lines. What if they don't do that? What if the data you are looking at is curved? You can use non-linear feature variable terms in your model function! This is true for both the single feature variable or multi-variate cases.
The "Linear" in Linear Regression is referring to the model parameters, $a$ The feature varaibles can be most anything that makes sense for the data you are trying to fit your model to.
You can use terms like $x^2$, $\sqrt{x}$, $\exp{(-x^2)}$, $(1 - \exp{(-x)})^2$, ... as long as the parameters $a$ are just simple scaler multipliers you can use linear regression.
Side Note: Non-Linear Regression is also very useful and you can use "least-squares" as a cost function but the problem is then non-linear optimization and there are extra difficulties to consider. I'll probably do a non-linear regression post sometime in the future. A good example of that would be fitting multi-core CPU parallel performance of some software package to Amdahl's Law. In the notation we have been using in these posts it would look like $h_a(x) = 1/( (1-a) + a/x)$ where $a$ is the "parallel fraction" and $x$ is the number of CPU cores. $h_a(x)$ is then the "speedup". Here are a couple of posts using that, LAMMPS Optimized for Intel on Quad Socket Xeon, and Xeon E5v3 All Core Turbo and Amdahl's Law.
One of the things we are trying to achieve with non-linear feature variables is getting a better fit to our data. What does "better fit" mean, how can you measure that? If you have simple data and a simple model you may be able to make a plot and just look at how well your model is fitting the data. If you have many variables, and a complicated model visualization may be difficult. We want a number that tells us if we have made an improvement in out model. An obvious number to use would be the value of the cost function. It is made up of the sum of squared errors between the training data and the values predicted by the model. The linear-regression procedure gives parameters for the model that minimize that. If a non-linear feature variable is used to improve the model then the value of the cost function using the optimized parameters should decrease i.e. less error. That is a reasonable number to look at.
There is a common value that is often used to evaluate the "goodness of fit" for a regression, $R^2$. This is called various things like, coefficient of (determination, correlation, variation) etc. It's most common for people to just say "R squared".
$R^2$ uses the sum of squared errors like the cost function together with a term that is proportional to the variance of the training data. Below is a list of the equations we derived in the last post together with an expression for $R^2$ so you can see how everything relates. We will use some of these equations in the Python code for the examples.
$$ \bbox[25px,border:2px solid green]{ \begin{align} h_a(X) & = Xa \\ \\ J(a) & = \frac{1}{2m} (Xa - y)'(Xa - y) \\ \\ \nabla_aJ(a) & = \frac{1}{m} (X'Xa - X'y) \\ \\ a & = (X'X)^{-1}X'y \\ \\ R^2 & = 1 - \frac{(Xa - y)'(Xa - y)}{(y - \overline{y})'(y - \overline{y})} \end{align} }$$$\overline{y}$ is the mean of the $y$ values. Values of $R^2$ close to 1 indicate a better fit of the model to the training data.
Let's do an example of linear regression with a non-linear feature variable model. This is data that I made up for an example but it's is typical of data you may see from some kinds of experiments or other measurements. It starts off with a steep decline and then levels off to a constant value.
It's time to write some code and make some plots so we'll load our usual Python modules into the notebook and write a few function definitions.
In [1]:
import pandas as pd # data handeling
import numpy as np # numeriacal computing
import matplotlib.pyplot as plt # plotting core
import seaborn as sns # higher level plotting tools
%matplotlib inline
sns.set()
In [2]:
def h(X,a) : # model h(X) = Xa
h = np.dot(X,a)
return h
def a_opt(X,y) : # linear regression solution a = (X'X)^-1 X'y = pinv(X)y
a_opt = np.dot( np.linalg.pinv(X), y)
return a_opt
def J(a,X,y) : # Cost function
m = y.size
J = 1/(2*m) * np.dot( ( h(X,a) - y ).T, ( h(X,a) - y ) )
return J
def R2(a,X,y) : # R squared
y_mean = y.mean()
R2 = 1 - ( np.dot( ( h(X,a) - y ).T, ( h(X,a) - y ) ) / np.dot( ( y - y_mean ).T, ( y - y_mean ) ) )
return R2
Here's the example data and a plot,
In [3]:
x = np.array([1, 1.2, 1.4, 1.7, 2, 2.5, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([7, 6 , 5, 4, 3, 2, 1, .5, .4, .35, .3, .25 ,.2,.15])
plt.plot( x,y, "o", color='g', label='training data'); plt.legend();
In [4]:
X = np.column_stack( (np.ones((x.size,1)) , x )) # construct the augmented matrix X
a = a_opt(X,y) # solve the regression
fig, ax = plt.subplots()
ax.plot( x,y, "o", color='g', label='training data') # plot the test data
xt = np.linspace(1,10,100)
ax.plot(xt, a[0] + a[1]*xt , label='h(x) = %.2f + %.2f x' %(a[0], a[1])) # plot the model function
ax.legend()
Out[4]:
In [5]:
print('a =', a) # print the values of a
print('J =', J(a,X,y)) # value of Cost function
print("R2 =", R2(a,X,y)) # R squared
In [6]:
X = np.column_stack( (np.ones((x.size,1)) , x**2 )) # construct the augmented matrix X
a = a_opt(X,y) # solve the regression
fig, ax = plt.subplots()
ax.plot( x,y, "o", color='g', label='training data') # plot the test data
xt = np.linspace(1,10,100)
ax.plot(xt, a[0] + a[1]*xt**2 , label='h(x) = %.2f + %.2f x**2' %(a[0], a[1])) # plot the model function
ax.legend()
Out[6]:
In [7]:
print('a =', a) # print the values of a
print('J =', J(a,X,y)) # value of Cost function
print("R2 =", R2(a,X,y)) # R squared
In [8]:
X = np.column_stack( (np.ones((x.size,1)) , x, x**2 )) # construct the augmented matrix X
a = a_opt(X,y) # solve the regression
fig, ax = plt.subplots()
ax.plot( x,y, "o", color='g', label='training data') # plot the test data
xt = np.linspace(1,11,100)
ax.plot(xt, a[0] + a[1]*xt + a[2]*xt**2 , label='h(x) = %.2f + %.2f x + %.2f x**2' %(a[0], a[1], a[2])) # plot the model function
ax.legend()
Out[8]:
In [9]:
print('a =', a) # print the values of a
print('J =', J(a,X,y)) # value of Cost function
print("R2 =", R2(a,X,y)) # R squared
In [10]:
X = np.column_stack( (np.ones((x.size,1)) , 1/x )) # construct the augmented matrix X
a = a_opt(X,y) # solve the regression
fig, ax = plt.subplots()
ax.plot( x,y, "o", color='g', label='training data') # plot the test data
xt = np.linspace(1,11,100)
ax.plot(xt, a[0] + a[1]/xt , label='h(x) = %.2f + %.2f * 1/x' %(a[0], a[1])) # plot the model function
ax.legend()
Out[10]:
In [11]:
print('a =', a) # print the values of a
print('J =', J(a,X,y)) # value of Cost function
print("R2 =", R2(a,X,y)) # R squared
In [12]:
X = np.column_stack( (np.ones((x.size,1)) , np.exp(-x) )) # construct the augmented matrix X
a = a_opt(X,y) # solve the regression
fig, ax = plt.subplots()
ax.plot( x,y, "o", color='g', label='training data') # plot the test data
xt = np.linspace(1,11,100)
ax.plot(xt, a[0] + a[1]*np.exp(-xt) , label='h(x) = %.2f + %.2f * exp(-x)' %(a[0], a[1])) # plot the model function
ax.legend()
Out[12]:
In [13]:
print('a =', a) # print the values of a
print('J =', J(a,X,y)) # value of Cost function
print("R2 =", R2(a,X,y)) # R squared
That is a very good fit and the model has the behavior that we want. The $R^2$ is also very close to 1 indication that the fit to the training data is very good and the cost, $J$ is near 0.
Another point in addition to the above is that if your data has very complex behavior it may be difficult to come up with a suitable model. One solution to this problem is to try a very general model with lots of terms. The regression may be able to pick out the terms that are useful to reduce the error by making the parameters of important terms large and unimportant terms small. There is a very serious problem that can occur if you use a complex model with many terms -- over-fitting!
Under-fitting is usually fairly obvious.If your model is inadequate to represent the data you are trying to fit you can usually see that visually in a plot or just observe that the value of the cost function, $J$, is high or $R^2$ is far from 1. Looking at how well or poorly your model predicts a result from a few data points in your training set will tell you right away if your model is inadequate. [It's better to have a separate test set but we will get to that later.]
Over-fitting can be more subtle. You may have a model that fits your training data very well, maybe even perfectly, and still have a model that does a terrible job of predicting unknown values.
The goal of fitting model parameters to a training set is to produce a model that "generalizes" well for making predictions from data that is NOT in your training set. Over-fitting your model to your training set can produce a model that has very little predictive value.
Over-fitting is serious problem in machine learning and we will discuss it in more detail and look at ways to avoid it in later posts. Right now I just want to introduce the problem in a way that is very clear and easy to visualize. To do this I will generate some training data from a known function and then show what can happen by over-fitting that data with a polynomial model function.
This will be illustrated with a series of regression fits to 6 points of training-set data that is selected from a known function, $x\sin(x/2)$. There will also be 6 points of data for a test-set that is taken from other places along the curve. The training and test sets also include a small amount of random "noise". We will calculate the cost function, $J$, and $R^2$ values to evaluate the goodness of fit. This goodness of fit will be checked against the training-set and the test-set.
There will be some good lessons to learn from this!
In [14]:
def f(x) : # sourse of the data
f = x*np.sin(x/2)
return f
np.random.seed(22)
noise = np.random.randn(12) # some random "normal" noise
x = np.linspace(0,10,6) # x for training-set
y = f(x) + noise[0:6] # y for training-set
xtest = np.linspace(-1,9,6) # x for test-set
ytest = f(xtest) + noise[6:12] # y for test-set
Here is a plot of all the data,
In [15]:
fig, ax = plt.subplots()
ax.plot(x,y, 'o', color='g', label='training data' )
ax.plot(xtest,ytest, 'o', color='r',label='testset data')
xt = np.linspace(-2,11,100)
ax.plot(xt, f(xt), '.', color='c')
plt.legend()
Out[15]:
In [16]:
X = np.column_stack( (np.ones((x.size,1)) , x )) # construct the augmented matrix X
Xtest = np.column_stack( (np.ones((x.size,1)) , xtest )) # needed for error analysis
a = a_opt(X,y) # solve the regression
fig, ax = plt.subplots()
ax.plot( x,y, "o", color='g', label='training data') # plot the test data
ax.plot(xtest,ytest, 'o', color='r', label='testset data')
xt = np.linspace(-1,10,100)
ax.plot(xt, f(xt), '.', color='c', label='source function')
ax.plot(xt, a[0] + a[1]*xt ,
label='h(x) = %.2f + %.2f*x'
%(a[0],a[1])) # plot the model function
ax.legend()
Out[16]:
In [17]:
print('a =', a) # print the values of a
print('training-set: ','J =', J(a,X,y), " R2 =", R2(a,X,y))
print('test-set: ','J =', J(a,Xtest,ytest), " R2 =", R2(a,Xtest,ytest))
In [18]:
X = np.column_stack( (np.ones((x.size,1)) , x, x**2 )) # construct the augmented matrix X
Xtest = np.column_stack( (np.ones((x.size,1)) , xtest, xtest**2 )) # needed for error analysis
a = a_opt(X,y) # solve the regression
fig, ax = plt.subplots()
ax.plot( x,y, "o", color='g', label='training data') # plot the test data
ax.plot(xtest,ytest, 'o', color='r', label='testset data')
xt = np.linspace(-1,10,100)
ax.plot(xt, f(xt), '.', color='c', label='source function')
ax.plot(xt, a[0] + a[1]*xt + a[2]*xt**2 ,
label='h(x) = %.2f + %.2f*x + %.2f*x**2'
%(a[0],a[1],a[2])) # plot the model function
ax.legend()
Out[18]:
In [19]:
print('a =', a) # print the values of a
print('training-set: ','J =', J(a,X,y), " R2 =", R2(a,X,y))
print('test-set: ','J =', J(a,Xtest,ytest), " R2 =", R2(a,Xtest,ytest))
In [20]:
X = np.column_stack( (np.ones((x.size,1)) , x, x**2, x**3 )) # construct the augmented matrix X
Xtest = np.column_stack( (np.ones((x.size,1)) , xtest, xtest**2, xtest**3 )) # needed for error analysis
a = a_opt(X,y) # solve the regression
fig, ax = plt.subplots()
ax.plot( x,y, "o", color='g', label='training data') # plot the test data
ax.plot(xtest,ytest, 'o', color='r', label='testset data')
xt = np.linspace(-1,10,100)
ax.plot(xt, f(xt), '.', color='c', label='source function')
ax.plot(xt, a[0] + a[1]*xt + a[2]*xt**2 + a[3]*xt**3,
label='h(x) = %.2f + %.2f*x + %.2f*x**2 + %.2f*x**3'
%(a[0],a[1],a[2],a[3])) # plot the model function
ax.legend()
Out[20]:
In [21]:
print('a =', a) # print the values of a
print('training-set: ','J =', J(a,X,y), " R2 =", R2(a,X,y))
print('test-set: ','J =', J(a,Xtest,ytest), " R2 =", R2(a,Xtest,ytest))
In [22]:
X = np.column_stack( (np.ones((x.size,1)) , x, x**2, x**3, x**4 )) # construct the augmented matrix X
Xtest = np.column_stack( (np.ones((x.size,1)) , xtest, xtest**2, xtest**3, xtest**4 )) # needed for error analysis
a = a_opt(X,y) # solve the regression
fig, ax = plt.subplots()
ax.plot( x,y, "o", color='g', label='training data') # plot the test data
ax.plot(xtest,ytest, 'o', color='r', label='testset data')
xt = np.linspace(-1,10,100)
ax.plot(xt, f(xt), '.', color='c', label='source function')
ax.plot(xt, a[0] + a[1]*xt + a[2]*xt**2 + a[3]*xt**3 + a[4]*xt**4,
label='h(x) = %.2f + %.2f*x + %.2f*x**2 + %.2f*x**3 + %.2f*x**4'
%(a[0],a[1],a[2],a[3],a[4])) # plot the model function
ax.legend()
Out[22]:
In [23]:
print('a =', a) # print the values of a
print('training-set: ','J =', J(a,X,y), " R2 =", R2(a,X,y))
print('test-set: ','J =', J(a,Xtest,ytest), " R2 =", R2(a,Xtest,ytest))
In [24]:
X = np.column_stack( (np.ones((x.size,1)) , x, x**2, x**3, x**4, x**5 )) # construct the augmented matrix X
Xtest = np.column_stack( (np.ones((x.size,1)) , xtest, xtest**2, xtest**3, xtest**4, xtest**5 )) # needed for error analysis
a = a_opt(X,y) # solve the regression
fig, ax = plt.subplots()
ax.plot( x,y, "o", color='g', label='training data') # plot the test data
ax.plot(xtest,ytest, 'o', color='r', label='testset data')
xt = np.linspace(-1,10,100)
ax.plot(xt, f(xt), '.', color='c', label='source function')
ax.plot(xt, a[0] + a[1]*xt + a[2]*xt**2 + a[3]*xt**3 + a[4]*xt**4 + a[5]*xt**5,
label='h(x) = %.2f + %.2f*x + %.2f*x**2 + %.2f*x**3 + %.2f*x**4 + %.2f*x**5'
%(a[0],a[1],a[2],a[3],a[4],a[5])) # plot the model function
ax.legend()
Out[24]:
In [25]:
print('a =', a) # print the values of a
print('training-set: ','J =', J(a,X,y), " R2 =", R2(a,X,y))
print('test-set: ','J =', J(a,Xtest,ytest), " R2 =", R2(a,Xtest,ytest))
Now the training-set fit is exact! The cost function $J$ is 0 and $R^2$ is 1. The test-set still has a poor fit. Even though the model function fits the training-set perfectly it is not a very good model overall. It's not as bad as it could be!
Lets add 3 more terms just for fun,
In [26]:
X = np.column_stack( (np.ones((x.size,1)) , x, x**2, x**3, x**4, x**5, x**6, x**7, x**8 )) # construct the augmented matrix X
Xtest = np.column_stack( (np.ones((x.size,1)) , xtest, xtest**2, xtest**3, xtest**4, xtest**5, xtest**6, xtest**7, xtest**8 )) # needed for error analysis
a = a_opt(X,y) # solve the regression
fig, ax = plt.subplots()
ax.plot( x,y, "o", color='g', label='training data') # plot the test data
ax.plot(xtest,ytest, 'o', color='r', label='testset data')
xt = np.linspace(-1,10,100)
ax.plot(xt, f(xt), '.', color='c', label='source function')
ax.plot(xt, a[0] + a[1]*xt + a[2]*xt**2 + a[3]*xt**3 + a[4]*xt**4 + a[5]*xt**5 + a[6]*xt**6 + a[7]*xt**7 + a[8]*xt**8,
label='h(x) = %.2f + %.2f*x + %.2f*x**2 + %.2f*x**3 + %.2f*x**4 + %.2f*x**5 + %.2f*x**6 + %.2f*x**7 + %.2f*x**8'
%(a[0],a[1],a[2],a[3],a[4],a[5],a[6],a[7],a[8])) # plot the model function
ax.legend()
Out[26]:
In [27]:
print('a =', a) # print the values of a
print('training-set: ','J =', J(a,X,y), " R2 =", R2(a,X,y))
print('test-set: ','J =', J(a,Xtest,ytest), " R2 =", R2(a,Xtest,ytest))
Garbage!
We have covered a lot of ground and presented many fundamental ideas and problems in Machine Learning in this series of posts. I hope you've enjoyed these posts as much as I have enjoyed writing them! I'll start in on other Machine Learning topics in the next posts.
Happy computing! --dbk