In [ ]:

    
%%HTML
<style>
.container { width:100% } 
</style>

Polynomial Regression and Overfitting



In [ ]:

    
import numpy                as np
import sklearn.linear_model as lm

In this notebook we want to discuss both polynomial regression and overfitting. One possible reason causing overfitting is a correlation between features. Let us create a dataset with two feature x1 and x2 that are more or less the same. Acutually, x2 is x1 plus some random noise. The dependent variable y is the square root of x1.



In [ ]:

    
np.random.seed(42)
N  = 20                     # number of data points
X1 = np.array([k for k in range(N)])
X2 = np.array([k + 0.2 * (np.random.rand() - 0.5) for k in range(N)])
Y  = np.sqrt(X1)            # Y is the square root of X1
X1 = np.reshape(X1, (N, 1)) # turn X1 into an N x 1 matrix
X2 = np.reshape(X2, (N, 1)) # turn X2 into an N x 1 matrix
X  = np.hstack([X1, X2])    # combine X1 and X2 into an N x 2 matrix
X

Let us plot the data. We will use colors to distinguish between x1 and x2. The pairs $(x_1, y)$ are plotted in blue, while the pairs $(x_2, y)$ are plotted in yellow.



In [ ]:

    
import matplotlib.pyplot as plt
import seaborn           as sns



In [ ]:

    
plt.figure(figsize=(15, 12))
sns.set(style='darkgrid')
plt.title('A Regression Problem')
plt.axvline(x=0.0, c='k')
plt.axhline(y=0.0, c='k')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.xticks(np.arange(0.0, N, step=1.0))
plt.yticks(np.arange(0.0, np.sqrt(N) + 1, step=0.25))
plt.scatter(X1, Y, color='b') 
plt.scatter(X2, Y, color='y')

We want to split the data into a training set and a test set. The training set will be used to compute the parameters of our model, while the testing set is only used to check the accuracy. SciKit-Learn has a predefined method sklearn.model_selection.train_test_split that can be used to randomly split data into a training set and a test set.



In [ ]:

    
from sklearn.model_selection import train_test_split

We will split the data at a ratio of $4:1$, i.e. $80\%$ of the data will be used for training, while the remaining $20\%$ is used to test the accuracy.



In [ ]:

    
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
X_test

In order to build a linear regression model, we import the module linear_model from SciKit-Learn.

The function $\texttt{linear_regression}(\texttt{X_train}, \texttt{Y_train}, \texttt{X_test}, \texttt{Y_test})$ takes a feature matrix $\texttt{X_train}$ and a corresponding vector $\texttt{Y_train}$ and computes a linear regression model $M$ that best fits these data. Then, the explained variance of the model is computed both for the training set and for the test set.



In [ ]:

    
def linear_regression(X_train, Y_train, X_test, Y_test):
    M = lm.LinearRegression()
    M.fit(X_train, Y_train)
    train_score = M.score(X_train, Y_train)
    test_score  = M.score(X_test , Y_test)
    return M, train_score, test_score

We use the function linear_regression to build a model for our data. Note that this function uses only the training data data to create the model. The test data are only used for evaluating the model.



In [ ]:

    
M, train_score, test_score = linear_regression(X_train, Y_train, X_test, Y_test)
train_score, test_score

Notice that the explained variance is a lot worse on the test set. Lets plot the linear model. The coefficients are stored in M.intercept_ and M.coef_.



In [ ]:

    
ϑ0     = M.intercept_
ϑ1, ϑ2 = M.coef_



In [ ]:

    
plt.figure(figsize=(15, 10))
sns.set(style='darkgrid')
plt.title('A Regression Problem')
plt.axvline(x=0.0, c='k')
plt.axhline(y=0.0, c='k')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.xticks(np.arange(0.0, N + 1, step=1.0))
plt.yticks(np.arange(0.0, np.sqrt(N) + 1, step=0.25))
plt.scatter(X_train[:,0], Y_train, color='b') 
plt.scatter(X_test [:,0], Y_test , color='g') 
plt.plot([0, N], [ϑ0, ϑ0 + (ϑ1 + ϑ2) * N], c='r')
#plt.savefig('sqrt-linear.pdf')

In order to improve the explained variance of our model, we extend it with polynomial features, i.e. we add features like $x_1^2$ and $x_1\cdot x_2$ etc.



In [ ]:

    
from sklearn.preprocessing import PolynomialFeatures



In [ ]:

    
quadratic = PolynomialFeatures(2, include_bias=False)



In [ ]:

    
X_train_quadratic = quadratic.fit_transform(X_train)
X_test_quadratic  = quadratic.fit_transform(X_test)
quadratic.get_feature_names(['x1', 'x2'])



In [ ]:

    
X_test_quadratic

Let us fit this quadratic model.



In [ ]:

    
M, train_score, test_score = linear_regression(X_train_quadratic, Y_train, X_test_quadratic, Y_test)
train_score, test_score

The accuracy on the training set and on the test set have both increased. Let us plot the model.



In [ ]:

    
ϑ0                 = M.intercept_
ϑ1, ϑ2, ϑ3, ϑ4, ϑ5 = M.coef_

Plotting the regression curve starts to get tedious.



In [ ]:

    
a = np.arange(0.0, N+1, 0.01)
b = ϑ0 + (ϑ1 + ϑ2 ) * a + (ϑ3 + ϑ4 + ϑ5) * a**2



In [ ]:

    
plt.figure(figsize=(15, 8))
sns.set(style='darkgrid')
plt.title('A Regression Problem: Second Order Terms included')
plt.axvline(x=0.0, c='k')
plt.axhline(y=0.0, c='k')
plt.xlabel('x axis')
plt.ylabel('y axis')
plt.xticks(np.arange(0.0, N + 1, step=1.0))
plt.yticks(np.arange(0.0, np.sqrt(N) + 1, step=0.25))
plt.scatter(X_train[:,0], Y_train, color='b') 
plt.scatter(X_test [:,0], Y_test , color='g') 
plt.plot(a, b, c='r')
#plt.savefig('sqrt-quadratic.pdf')

Obviously, the quadratic curve is a much better match than the linear model. Lets try to use higher order polynomials.

$\texttt{polynomial}(n)$ creates a polynomial in the variables a and bthat contains all terms of the form that contains all terms of the form $\Theta[k] \cdot a^i \cdot b^j$ where $i+j \leq n$.



In [ ]:

    
def polynomial(n):
    sum = '   Θ[0]' 
    cnt = 0
    for k in range(1, n+1):
        for i in range(0, k+1):
            cnt += 1
            sum += f' + Θ[{cnt}] * a**{k-i} * b**{i}'
        if k < n:
            sum += '\\\n'
    return sum

Let's check this out for $n=4$.



In [ ]:

    
print(polynomial(4))

The function $\texttt{polynomial_vector}(n, M)$ takes a number $n$ and a model $M$. It returns a pair of vectors that can be used to plot the nonlinear model.



In [ ]:

    
def polynomial_vector(n, M):
    Θ = [M.intercept_] + list(M.coef_)
    a = np.reshape(X1, (N, ))
    b = np.reshape(X2, (N, ))
    return 0.5*(a + b), eval(polynomial(n))

The function $\texttt{plot_nth_degree_boundary}(n)$ creates a polynomial regression model of degree $n$. It plots both the data and the polynomial model.



In [ ]:

    
def plot_nth_degree_polynomial(n):
    poly         = PolynomialFeatures(n, include_bias=False)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly  = poly.fit_transform(X_test)
    M, s1, s2    = linear_regression(X_train_poly, Y_train, X_test_poly, Y_test)
    print('The explained variance on the training set is:', s1)
    print('The explained variance on the test     set is:', s2)
    a, b = polynomial_vector(n, M)
    plt.figure(figsize=(15, 10))
    sns.set(style='darkgrid')
    plt.title('A Regression Problem')
    plt.axvline(x=0.0, c='k')
    plt.axhline(y=0.0, c='k')
    plt.xlabel('x axis')
    plt.ylabel('y axis')
    plt.xticks(np.arange(0.0, N + 1, step=1.0))
    plt.yticks(np.arange(0.0, 2*np.sqrt(N), step=0.25))
    plt.scatter(X_train[:,0], Y_train, color='b') 
    plt.scatter(X_test [:,0], Y_test , color='g') 
    plt.plot(a, b, c='r')
    #plt.savefig('sqrt-' + str(n) + '.pdf')

Let us test this for the polynomial logistic regression model of degree $4$.



In [ ]:

    
plot_nth_degree_polynomial(4)

This seems to be working and the explained variance has improved. Lets try to use even higher order polynomials. Hopefully, we can get a $100\%$ explained variance.



In [ ]:

    
plot_nth_degree_polynomial(6)

It turns out that we can get $100\%$ of explained variance, but only for the training set. The explained variance of the test set has drecreased and apparently the curve is starting to get wiggly.

Ridge Regression



In [ ]:

    
def ridge_regression(X_train, Y_train, X_test, Y_test, alpha):
    M = lm.Ridge(alpha, solver='svd')
    M.fit(X_train, Y_train)
    train_score = M.score(X_train, Y_train)
    test_score  = M.score(X_test , Y_test)
    return M, train_score, test_score



In [ ]:

    
def plot_nth_degree_polynomial_ridge(n, alpha):
    poly         = PolynomialFeatures(n, include_bias=False)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly  = poly.fit_transform(X_test)
    M, s1, s2    = ridge_regression(X_train_poly, Y_train, X_test_poly, Y_test, alpha)
    print('The explained variance on the training set is:', s1)
    print('The explained variance on the test     set is:', s2)
    a, b = polynomial_vector(n, M)
    plt.figure(figsize=(15, 10))
    sns.set(style='darkgrid')
    plt.title('A Regression Problem')
    plt.axvline(x=0.0, c='k')
    plt.axhline(y=0.0, c='k')
    plt.xlabel('x axis')
    plt.ylabel('y axis')
    plt.xticks(np.arange(0.0, N + 1, step=1.0))
    plt.yticks(np.arange(0.0, 2*np.sqrt(N), step=0.25))
    plt.scatter(X_train[:,0], Y_train, color='b') 
    plt.scatter(X_test [:,0], Y_test , color='g') 
    plt.plot(a, b, c='r')
    #plt.savefig('sqrt-' + str(n) + 'ridge.pdf')

Lets try to use a polynomial of degree 6 but without regularization.



In [ ]:

    
plot_nth_degree_polynomial_ridge(6, 0.0)

This looks like the model that we had found before. Let us try to add a bit of regularization.



In [ ]:

    
plot_nth_degree_polynomial_ridge(6, 0.05)

Now the model is much smoother and the explained variance has also increased considerably on the test set.



In [ ]:

    
plot_nth_degree_polynomial_ridge(6, 100000)

If we increase the regularization parameter too much, the model is no longer able to fit the data and the explained variance decreases.



In [ ]: