This is an interactive implementation of example 1.1 from Christopher Bishop's highly recommended book Pattern Recognition and Machine Learning.
In this example we generate a random dataset according to a normal distribution and perform a least squares linear regression fit to the randomly generated data using a user definable degree of polynomial. The hope in creating this notebook was to help bring a bit of intuition for individuals interested in learning more about a least squares linear regression model, much in the way the book does, by bringing some interaction forward for users to be able to tweak parameters of the model.
This should help provide some intuition on understanding what something like "over-fitting" may look like on a simple linear regression problem.
In [35]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
from ipywidgets import widget
from IPython.display import display
from math import pi, sin
import numpy as np
from matplotlib import pyplot as plt
from sklearn.linear_model import Ridge
%matplotlib inline
In [15]:
def target(x):
'''
Function to generate target variables
'''
return sin(2 * pi * x) + np.random.normal(scale=0.3)
def example_data_generating_dist(size):
'''
Function to generate example data
size = size of data set to generate
'''
data = []
for i in range(size):
x = np.random.uniform()
y = target(x)
data.append([x,y])
arr = np.array(data)
x = np.array(arr[:,0])
y = np.array(arr[:,1])
return x, y
def polyfit(x, y, degree):
'''
Fit a polynomaial to some data
'''
_coef = np.polyfit(x,y,degree)
_poly = np.poly1d(_coef)
_ys = _poly(y)
return _poly
def graph_polyfit(degree, size):
x, y = example_data_generating_dist(size)
model = polyfit(x, y, degree)
xp = np.linspace(-1, 1, 50)
plt.ylim(y.min()-.2, y.max()+.2)
plt.xlim(x.min()-.2, x.max()+.2)
plt.plot(x, y, '.', xp, model(xp), '--')
plt.show()
return model
In [28]:
graph_underfit = interactive(graph_polyfit, degree=1, size=10)
graph_underfit
Just as we show in the example above, we can also generate another example of what would be considered an overfitting of our $sin(2\pi x)$ function.
As you can see, the 9th degree polynomial oscillates wildly between the different datapoints in an effort to find a best fit that ends up being too "close" to the data that the model sees.
In [29]:
graph_overfit = interactive(graph_polyfit, degree=9, size=10)
graph_overfit
In [31]:
graph_just_right = interactive(graph_polyfit, degree=3, size=10)
graph_just_right
There are multiple ways that we can deal with overfitting. The easiest way is, if you can, just get more data from the system it is that you're trying to model. This will help the model to generalize to your dataset using more examples.
Remember that when dealing with a least squares fit, you're working to minimize the sum of the squares of the errors between between our predictions $y(x_n, w) \ \forall \ x \in X$ and their corresponding target values $t_n$.
$$ E(w) = \frac{1}{2}\{y(x_n, w) - t_n\}^2$$Intuitively you can imagine that this minimization over a larger summation of datapoints, allows more and more data to "push" and "pull" on the function approximating the data, creating a smoothing of the function in the process.
In [33]:
graph_reg = interactive(graph_polyfit, degree=9, size=100)
graph_reg
You can see above, just be sampling 90 more data points from our mock function, the 9th degree polynomial is already starting to smooth out a bunch. If you grab the size slider and move it to the right, generating more sample data, you'll see that the recomputed polynomial that is modeling the data smooths even more. Giving us this regularization affect.
If you can't get your hands on more data from the system that has produced the data, then you may have to move into the realm of adding a penalty to your cost, or error, function $E(w)$.
A common penalization to add to our error function is the sum of the squares of all parameters, or coefficients, of the model.
$$ E(w) = \frac{1}{2}\{y(x_n, w) - t_n\}^2 + \frac{\lambda}{2}\|w\|^2$$where $\|w\|^2 = w^Tw = w_0^2 + w_1^2 + w_3^2 ... w_M^2$. The coefficient $\lambda$ is what governs the relative importants of the regularization effect on the model.
In [62]:
# generate some random data and target values
rr_X, rr_y = example_data_generating_dist(100)
In [60]:
rr_y
Out[60]:
In [79]:
clf = Ridge(alpha=1.0, solver='lsqr')
clf.fit(rr_X[:,np.newaxis], rr_y)
Out[79]:
In [80]:
plt.scatter(rr_X, rr_y)
Out[80]:
In [87]:
colors = ['teal', 'yellowgreen', 'gold']
lw = 2
for count, degree in enumerate([2]):
model = Ridge()
model.fit(rr_X[:,np.newaxis], rr_y)
y_plot = model.predict(rr_X[:,np.newaxis])
plt.plot(rr_X, y_plot, color=colors[count], linewidth=lw,
label="degree %d" % degree)
In [ ]: