The overfitting problem

We have seen what is linear regression, how to make models and algorithms for estimating the parameters of such models, how to measure the loss.

Now we see how to assess how well the considered method should perform in predicting new data, how to select amongst possible models to choose the best performing.
This leads directly to the bias-variance tradeoff, which is fundamental to machine learning.

Load in the data

The dataset is from house sales in King County, the region where the city of Seattle, WA is located.


In [1]:
import pandas as pd

In [2]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 
              'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 
              'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 
              'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 
              'id':str, 'sqft_lot':int, 'view':int}

In [3]:
sales = pd.read_csv('../datasets/kc_house_data.csv', dtype=dtype_dict)

In [4]:
sales.head()


Out[4]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 20141013T000000 221900.0 3.0 1.00 1180.0 5650 1 0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 1340.0 5650.0
1 6414100192 20141209T000000 538000.0 3.0 2.25 2570.0 7242 2 0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 1690.0 7639.0
2 5631500400 20150225T000000 180000.0 2.0 1.00 770.0 10000 1 0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 2720.0 8062.0
3 2487200875 20141209T000000 604000.0 4.0 3.00 1960.0 5000 1 0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 1360.0 5000.0
4 1954400510 20150218T000000 510000.0 3.0 2.00 1680.0 8080 1 0 0 ... 8 1680 0 1987 0 98074 47.6168 -122.045 1800.0 7503.0

5 rows × 21 columns


In [5]:
sales.shape


Out[5]:
(21613, 21)

In [6]:
m = sales.shape[0] # number of training examples
m


Out[6]:
21613

The dataset contains information (21 features, including the price) related to 21613 houses.
Our target variable (i.e., what we want to predict when a new house gets on sale) is the price.

Baseline: the simplest model

Now let's compute the loss in the case of the simplest model: a fixed price equal to the average of historic prices, independently on house size, rooms, location, ...


In [7]:
# Let's compute the mean of the House Prices in King County 
y = sales['price'] # extract the price column

In [8]:
avg_price = y.mean() # this is our baseline
print ("average price: ${:.0f} ".format(avg_price))


average price: $540088 

In [9]:
ExamplePrice = y[0]
ExamplePrice


Out[9]:
221900.0

The predictions are very easy to calculate, just the baseline value:


In [10]:
def get_baseline_predictions():
       # Simplest version: return the baseline as predicted values
    predicted_values = avg_price 
    return predicted_values

Example:


In [11]:
my_house_size = 2500
estimated_price = get_baseline_predictions()
print ("The estimated price for a house with {} squared feet is {:.0f}".format(my_house_size, estimated_price))


The estimated price for a house with 2500 squared feet is 540088

The estimated price for the example house will still be around 540K, wile the real value is around 222K. Quite an error!

Measures of loss

There are several way of implementing the loss, I use the squared error here.

$L = [y - f(X)]^2$


In [12]:
import numpy as np
def get_loss(yhat, target):
    """
    Arguments:
    yhat -- vector of size m (predicted labels)
    target -- vector of size m (true labels)
    
    Returns:
    loss -- the value of the L2 loss function
    """
      # compute the residuals (since we are squaring it doesn't matter 
      #   which order you subtract)
      # np.dot will square the residuals and add them up
    loss = np.dot((target - yhat), (target - yhat))

    return(loss)

To better see the value of the cost function we use also the RMSE, the Root Mean Square Deviation.
Basically the average of the losses, rooted.


In [13]:
baselineCost = get_loss(get_baseline_predictions(), y)

print ("Training Error for baseline RSS: {:.0f}".format(baselineCost))
print ("Average Training Error for baseline RMSE: {:.0f}".format(np.sqrt(baselineCost/m)))


Training Error for baseline RSS: 2912916761921302
Average Training Error for baseline RMSE: 367119

As you can see, it is quite high error, especially related to the average selling price.

Now, we can look at how training error behaves as model complexity increases.

Learning a better but still simple model

Using a constant value, the average, is easy but does not make too much sense. Let's create a linear model with the house size as the feature. We expect that the price is dependent on the size: bigger house, more expensive.


In [14]:
from sklearn import linear_model

In [15]:
simple_model = linear_model.LinearRegression()

In [16]:
simple_features = sales[['sqft_living']] # input X: the house size

In [17]:
simple_model.fit(simple_features, y)


Out[17]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Now that we have fit the model we can extract the regression weights (coefficients) as follows:


In [18]:
simple_model_intercept = simple_model.intercept_
print (simple_model_intercept)


-43580.7430945

In [19]:
simple_model_weights = simple_model.coef_
print (simple_model_weights)


[ 280.6235679]

This means that our simple model to predict a house price y is (approximated):
$y = -43581 + 281x $
where x is the size in squared feet.
It is not anymore a horizontal line but a diagonal one, with a slope.

Making Predictions

Recall that once a model is built we can use the .predict() function to find the predicted values for data we pass. For example using the example model above:


In [20]:
training_predictions = simple_model.predict(simple_features)
print (training_predictions[0])


287555.067025

We are getting closer to the real value for the example house (recall, it's around 222K).

Compute the Training Error

Now that we can make predictions given the model, let's again compute the RSS and the RMSE.


In [21]:
# First get the predictions using the features subset
predictions = simple_model.predict(sales[['sqft_living']])
simpleCost = get_loss(predictions, y) 
print ("Training Error for baseline RSS: {:.0f}".format(simpleCost))
print ("Average Training Error for baseline RMSE: {:.0f}".format(np.sqrt(simpleCost/m)))


Training Error for baseline RSS: 1477276362322490
Average Training Error for baseline RMSE: 261441

The simple model reduced greatly the training error.

Learning a multiple regression model

We can add more features to the model, for example the number of bedrooms and bathrooms.


In [22]:
more_features = sales[['sqft_living', 'bedrooms', 'bathrooms']] # input X

We can learn a multiple regression model predicting 'price' based on the above features on the data with the following code:


In [23]:
better_model = linear_model.LinearRegression()
better_model.fit(more_features, y)


Out[23]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Now that we have fitted the model we can extract the regression weights (coefficients) as follows:


In [24]:
betterModel_intercept = better_model.intercept_
print (betterModel_intercept)


74847.1408013

In [25]:
betterModel_weights = better_model.coef_
print (betterModel_weights)


[   309.39239013 -57860.8943206    7932.71222266]

The better model is therefore:
$y = 74847 + 309x1 - 57861x2 + 7933x3$
Note that the equation has now three variables: the size, the bedrooms and the bathrooms.

Making Predictions

Again we can use the .predict() function to find the predicted values for data we pass. For the model above:


In [26]:
better_predictions = better_model.predict(more_features)
print (better_predictions[0])


274280.190415

Again, a little bit closer to the real value (222K)

Compute the Training Error

Now that we can make predictions given the model, let's write a function to compute the RSS of the model.


In [27]:
predictions = better_model.predict(more_features)
betterCost = get_loss(predictions, y) 
print ("Training Error for baseline RSS: {:.0f}".format(betterCost))
print ("Average Training Error for baseline RMSE: {:.0f}".format(np.sqrt(betterCost/m)))


Training Error for baseline RSS: 1436301587370045
Average Training Error for baseline RMSE: 257790

Only a slight improvement this time

Create some new features

Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, squarefeet, and # of bathrooms) but we can also consider transformations of existing features e.g. the log of the squarefeet or even "interaction" features such as the product of bedrooms and bathrooms.


In [28]:
from math import log

Next we create the following new features as column :

  • bedrooms_squared = bedrooms*bedrooms
  • bed_bath_rooms = bedrooms*bathrooms
  • log_sqft_living = log(sqft_living)
  • lat_plus_long = lat + long
  • more polynomial features: bedrooms ^ 4, bathrooms ^ 7, size ^ 3

In [29]:
sales['bedrooms_squared'] = sales['bedrooms'].apply(lambda x: x**2)

In [30]:
sales['bed_bath_rooms'] = sales['bedrooms'] * sales.bathrooms

In [31]:
sales['log_sqft_living'] = sales['sqft_living'].apply(lambda x: log(x))

In [32]:
sales['lat_plus_long'] = sales['lat'] + sales.long

In [33]:
sales['bedrooms_4'] = sales['bedrooms'].apply(lambda x: x**4)

In [34]:
sales['bathrooms_7'] = sales['bathrooms'].apply(lambda x: x**7)

In [35]:
sales['size_3'] = sales['sqft_living'].apply(lambda x: x**3)

In [36]:
sales.head()


Out[36]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... long sqft_living15 sqft_lot15 bedrooms_squared bed_bath_rooms log_sqft_living lat_plus_long bedrooms_4 bathrooms_7 size_3
0 7129300520 20141013T000000 221900.0 3.0 1.00 1180.0 5650 1 0 0 ... -122.257 1340.0 5650.0 9.0 3.00 7.073270 -74.7458 81.0 1.00000 1.643032e+09
1 6414100192 20141209T000000 538000.0 3.0 2.25 2570.0 7242 2 0 0 ... -122.319 1690.0 7639.0 9.0 6.75 7.851661 -74.5980 81.0 291.92926 1.697459e+10
2 5631500400 20150225T000000 180000.0 2.0 1.00 770.0 10000 1 0 0 ... -122.233 2720.0 8062.0 4.0 2.00 6.646391 -74.4951 16.0 1.00000 4.565330e+08
3 2487200875 20141209T000000 604000.0 4.0 3.00 1960.0 5000 1 0 0 ... -122.393 1360.0 5000.0 16.0 12.00 7.580700 -74.8722 256.0 2187.00000 7.529536e+09
4 1954400510 20150218T000000 510000.0 3.0 2.00 1680.0 8080 1 0 0 ... -122.045 1800.0 7503.0 9.0 6.00 7.426549 -74.4282 81.0 128.00000 4.741632e+09

5 rows × 28 columns

  • Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this feature will mostly affect houses with many bedrooms.
  • bedrooms times bathrooms gives what's called an "interaction" feature. It is large when both of them are large.
  • Taking the log of squarefeet has the effect of bringing large values closer together and spreading out small values.
  • Adding latitude to longitude is totally non-sensical but we will do it anyway (you'll see why)

Learning Multiple Models

Now we will learn the weights for five (nested) models for predicting house prices. The first model will have the fewest features, the second model will add more features and so on:


In [37]:
model_1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long', 'sqft_lot', 'floors']
model_2_features = model_1_features + ['log_sqft_living', 'bedrooms_squared', 'bed_bath_rooms']
model_3_features = model_2_features + ['lat_plus_long']
model_4_features = model_3_features + ['bedrooms_4', 'bathrooms_7']
model_5_features = model_4_features + ['size_3']

Now that we have the features, we learn the weights for the five different models for predicting target = 'price' using and look at the value of the weights/coefficients:


In [38]:
model_1 = linear_model.LinearRegression()
model_1.fit(sales[model_1_features], y)


Out[38]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [39]:
model_2 = linear_model.LinearRegression()
model_2.fit(sales[model_2_features], y)


Out[39]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [40]:
model_3 = linear_model.LinearRegression()
model_3.fit(sales[model_3_features], y)


Out[40]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [41]:
model_4 = linear_model.LinearRegression()
model_4.fit(sales[model_4_features], y)


Out[41]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [42]:
model_5 = linear_model.LinearRegression()
model_5.fit(sales[model_5_features], y)


Out[42]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [43]:
# You can examine/extract each model's coefficients, for example:
print (model_1.coef_)
print (model_2.coef_)


[  3.06930393e+02  -5.37365974e+04   1.92633985e+04   6.59919715e+05
  -3.15325117e+05   6.01533077e-02  -5.77002641e+03]
[  5.22290856e+02  -5.83083200e+03   9.13874631e+04   6.55377075e+05
  -2.79014318e+05  -4.38190098e-03  -1.64123192e+03  -5.43968799e+05
   7.65579830e+02  -1.49981031e+04]

Interesting: in the previous model the weight coefficient for the size lot was positive but now in the model_2 is negative.
This is an effect of adding the feature logging the size.

Comparing multiple models

Now that you've learned three models and extracted the model weights we want to evaluate which model is best.

We can use the loss function from earlier to compute the RSS on training data for each of the models.


In [44]:
# Compute the RSS for each of the models:
print (get_loss(model_1.predict(sales[model_1_features]), y))
print (get_loss(model_2.predict(sales[model_2_features]), y))
print (get_loss(model_3.predict(sales[model_3_features]), y))
print (get_loss(model_4.predict(sales[model_4_features]), y))
print (get_loss(model_5.predict(sales[model_5_features]), y))


1.19260931528e+15
1.11713142806e+15
1.11713142806e+15
1.1115656343e+15
1.1091157483e+15

model_5 has the lowest RSS on the training data.
The most complex model.

The test error

Training error decreases quite significantly with model complexity. This is quite intuitive, because the model was fit on the training points and then as we increase the model complexity, we are better able to fit the training data points.

A natural question is whether a training error is a good measure of predictive performance?

The issue is that the training error is overly optimistic and that's because the beta parameters were fit on the training data to minimise the residual sum of squares, which can often be related to the training error. So, in general, having small training error does not imply having good predictive performance.

This takes us to something called test error (or out-of-sample error): we hold out some houses from the data set and we're putting these into what's called a test set. And when we fit our models, we just fit our models on the training data set.
But then when we go to assess our performance of that model we look at these test houses in the test dataset and these are hopefully serving as a proxy of everything out there in the world.

Bottom line, the test error is a (noisy) approximation of the true error.

Split data into training and testing.

Let's see how can be applied to our example.
First we split the data into a training set and a testing set using a function from sklearn, the train_test_split(). We use a seed for reproducibility.


In [45]:
from sklearn.model_selection import train_test_split
train_data,test_data = train_test_split(sales, test_size=0.3, random_state=999)

In [46]:
train_data.head()


Out[46]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... long sqft_living15 sqft_lot15 bedrooms_squared bed_bath_rooms log_sqft_living lat_plus_long bedrooms_4 bathrooms_7 size_3
9318 4420600015 20141006T000000 571500.0 4.0 2.25 2810.0 25990 2 0 0 ... -122.293 2020.0 16140.0 16.0 9.0 7.940940 -74.9937 256.0 291.929260 2.218804e+10
15904 8691390980 20140902T000000 728000.0 4.0 2.50 3290.0 5951 2 0 0 ... -121.976 3240.0 6159.0 16.0 10.0 8.098643 -74.3761 256.0 610.351562 3.561129e+10
7599 3326059191 20140616T000000 410000.0 4.0 2.00 1580.0 9581 1 0 0 ... -122.165 1580.0 10552.0 16.0 8.0 7.365180 -74.4730 256.0 128.000000 3.944312e+09
14893 6817810190 20140701T000000 401000.0 3.0 2.00 1240.0 11172 1 0 0 ... -122.037 1330.0 14102.0 9.0 6.0 7.122867 -74.4006 81.0 128.000000 1.906624e+09
20508 7625703357 20150227T000000 394950.0 2.0 2.25 1300.0 2104 2 0 0 ... -122.388 1430.0 1850.0 4.0 4.5 7.170120 -74.8403 16.0 291.929260 2.197000e+09

5 rows × 28 columns


In [47]:
train_data.shape


Out[47]:
(15129, 28)

In [48]:
# test_data = pd.read_csv('kc_house_test_data.csv', dtype=dtype_dict)
test_data.head()


Out[48]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... long sqft_living15 sqft_lot15 bedrooms_squared bed_bath_rooms log_sqft_living lat_plus_long bedrooms_4 bathrooms_7 size_3
10033 4440400125 20140508T000000 228000.0 4.0 1.75 2000.0 6120 1 0 0 ... -122.258 1880.0 6120.0 16.0 7.00 7.600902 -74.7545 256.0 50.265076 8.000000e+09
20870 3575305485 20140829T000000 409000.0 3.0 2.50 1890.0 6500 2 0 0 ... -122.058 2340.0 7500.0 9.0 7.50 7.544332 -74.4355 81.0 610.351562 6.751269e+09
21323 9402800005 20141028T000000 1500000.0 3.0 3.50 3530.0 3610 2 0 0 ... -122.339 1780.0 3610.0 9.0 10.50 8.169053 -74.6533 81.0 6433.929688 4.398698e+10
12147 9407001600 20150423T000000 305000.0 3.0 1.75 1660.0 11500 1 0 0 ... -121.774 1250.0 11000.0 9.0 5.25 7.414573 -74.3267 81.0 50.265076 4.574296e+09
15145 8712100020 20150127T000000 600000.0 2.0 1.00 1290.0 4636 1 0 0 ... -122.301 1940.0 4635.0 4.0 2.00 7.162397 -74.6617 16.0 1.000000 2.146689e+09

5 rows × 28 columns


In [49]:
test_data.shape


Out[49]:
(6484, 28)

In this case the testing set will be the 30% (therefore the training set is 70% of the original data)


In [50]:
train_y = train_data.price # extract the price column
test_y = test_data.price

Retrain the models on training data only:


In [51]:
model_1.fit(train_data[model_1_features], train_y)


Out[51]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [52]:
model_2.fit(train_data[model_2_features], train_y)


Out[52]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [53]:
model_3.fit(train_data[model_3_features], train_y)


Out[53]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [54]:
model_4.fit(train_data[model_4_features], train_y)


Out[54]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [55]:
model_5.fit(train_data[model_5_features], train_y)


Out[55]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [56]:
# Compute the RSS on TRAINING data for each of the models
print (get_loss(model_1.predict(train_data[model_1_features]), train_y))
print (get_loss(model_2.predict(train_data[model_2_features]), train_y))
print (get_loss(model_3.predict(train_data[model_3_features]), train_y))
print (get_loss(model_4.predict(train_data[model_4_features]), train_y))
print (get_loss(model_5.predict(train_data[model_5_features]), train_y))


8.60021288255e+14
7.89347967237e+14
7.89347967237e+14
7.73197162481e+14
7.50061625701e+14

Now compute the RSS on TEST data for each of the models.


In [57]:
# Compute the RSS on TESTING data for each of the three models and record the values:
print (get_loss(model_1.predict(test_data[model_1_features]), test_y))
print (get_loss(model_2.predict(test_data[model_2_features]), test_y))
print (get_loss(model_3.predict(test_data[model_3_features]), test_y))
print (get_loss(model_4.predict(test_data[model_4_features]), test_y))
print (get_loss(model_5.predict(test_data[model_5_features]), test_y))


3.33694184144e+14
3.31182631706e+14
3.31182631706e+14
3.49204952637e+14
4.05978695042e+14

The most complex model has the lowest error on the training data, but since that model has a non-sensical feature, it performs less well on the test data.

Overfitting

When you have too many features in a model and the learned hypothesis fit the training set very well but fail to generalise to new data (predict prices on new houses) then this is called the overfitting problem.

Formally, a model, let's say Model1 with some parameters beta_1, overfits if exists another model - let's call it Model2, with estimated parameters beta_2 such that the training error of Model2 is less than the training error of Model1 but on the other hand, the true general error of Model2 is greater than the true error of Model1.

From the picture above you can see that the models prone to overfit are the ones that have small training error and high complexity. Therefore one simple way to avoid overfitting is to prefer simpler models and avoid complex models with many features.


In [ ]: