We have seen what is linear regression, how to make models and algorithms for estimating the parameters of such models, how to measure the loss.
Now we see how to assess how well the considered method should perform in predicting new data, how to select amongst possible models to choose the best performing.
This leads directly to the bias-variance tradeoff, which is fundamental to machine learning.
In [1]:
import pandas as pd
In [2]:
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float,
'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str,
'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str,
'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int,
'id':str, 'sqft_lot':int, 'view':int}
In [3]:
sales = pd.read_csv('../datasets/kc_house_data.csv', dtype=dtype_dict)
In [4]:
sales.head()
Out[4]:
In [5]:
sales.shape
Out[5]:
In [6]:
m = sales.shape[0] # number of training examples
m
Out[6]:
The dataset contains information (21 features, including the price) related to 21613 houses.
Our target variable (i.e., what we want to predict when a new house gets on sale) is the price.
Now let's compute the loss in the case of the simplest model: a fixed price equal to the average of historic prices, independently on house size, rooms, location, ...
In [7]:
# Let's compute the mean of the House Prices in King County
y = sales['price'] # extract the price column
In [8]:
avg_price = y.mean() # this is our baseline
print ("average price: ${:.0f} ".format(avg_price))
In [9]:
ExamplePrice = y[0]
ExamplePrice
Out[9]:
The predictions are very easy to calculate, just the baseline value:
In [10]:
def get_baseline_predictions():
# Simplest version: return the baseline as predicted values
predicted_values = avg_price
return predicted_values
Example:
In [11]:
my_house_size = 2500
estimated_price = get_baseline_predictions()
print ("The estimated price for a house with {} squared feet is {:.0f}".format(my_house_size, estimated_price))
The estimated price for the example house will still be around 540K, wile the real value is around 222K. Quite an error!
There are several way of implementing the loss, I use the squared error here.
$L = [y - f(X)]^2$
In [12]:
import numpy as np
def get_loss(yhat, target):
"""
Arguments:
yhat -- vector of size m (predicted labels)
target -- vector of size m (true labels)
Returns:
loss -- the value of the L2 loss function
"""
# compute the residuals (since we are squaring it doesn't matter
# which order you subtract)
# np.dot will square the residuals and add them up
loss = np.dot((target - yhat), (target - yhat))
return(loss)
To better see the value of the cost function we use also the RMSE, the Root Mean Square Deviation.
Basically the average of the losses, rooted.
In [13]:
baselineCost = get_loss(get_baseline_predictions(), y)
print ("Training Error for baseline RSS: {:.0f}".format(baselineCost))
print ("Average Training Error for baseline RMSE: {:.0f}".format(np.sqrt(baselineCost/m)))
As you can see, it is quite high error, especially related to the average selling price.
Now, we can look at how training error behaves as model complexity increases.
Using a constant value, the average, is easy but does not make too much sense. Let's create a linear model with the house size as the feature. We expect that the price is dependent on the size: bigger house, more expensive.
In [14]:
from sklearn import linear_model
In [15]:
simple_model = linear_model.LinearRegression()
In [16]:
simple_features = sales[['sqft_living']] # input X: the house size
In [17]:
simple_model.fit(simple_features, y)
Out[17]:
Now that we have fit the model we can extract the regression weights (coefficients) as follows:
In [18]:
simple_model_intercept = simple_model.intercept_
print (simple_model_intercept)
In [19]:
simple_model_weights = simple_model.coef_
print (simple_model_weights)
This means that our simple model to predict a house price y is (approximated):
$y = -43581 + 281x $
where x is the size in squared feet.
It is not anymore a horizontal line but a diagonal one, with a slope.
In [20]:
training_predictions = simple_model.predict(simple_features)
print (training_predictions[0])
We are getting closer to the real value for the example house (recall, it's around 222K).
Now that we can make predictions given the model, let's again compute the RSS and the RMSE.
In [21]:
# First get the predictions using the features subset
predictions = simple_model.predict(sales[['sqft_living']])
simpleCost = get_loss(predictions, y)
print ("Training Error for baseline RSS: {:.0f}".format(simpleCost))
print ("Average Training Error for baseline RMSE: {:.0f}".format(np.sqrt(simpleCost/m)))
The simple model reduced greatly the training error.
We can add more features to the model, for example the number of bedrooms and bathrooms.
In [22]:
more_features = sales[['sqft_living', 'bedrooms', 'bathrooms']] # input X
We can learn a multiple regression model predicting 'price' based on the above features on the data with the following code:
In [23]:
better_model = linear_model.LinearRegression()
better_model.fit(more_features, y)
Out[23]:
Now that we have fitted the model we can extract the regression weights (coefficients) as follows:
In [24]:
betterModel_intercept = better_model.intercept_
print (betterModel_intercept)
In [25]:
betterModel_weights = better_model.coef_
print (betterModel_weights)
The better model is therefore:
$y = 74847 + 309x1 - 57861x2 + 7933x3$
Note that the equation has now three variables: the size, the bedrooms and the bathrooms.
In [26]:
better_predictions = better_model.predict(more_features)
print (better_predictions[0])
Again, a little bit closer to the real value (222K)
Now that we can make predictions given the model, let's write a function to compute the RSS of the model.
In [27]:
predictions = better_model.predict(more_features)
betterCost = get_loss(predictions, y)
print ("Training Error for baseline RSS: {:.0f}".format(betterCost))
print ("Average Training Error for baseline RMSE: {:.0f}".format(np.sqrt(betterCost/m)))
Only a slight improvement this time
Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, squarefeet, and # of bathrooms) but we can also consider transformations of existing features e.g. the log of the squarefeet or even "interaction" features such as the product of bedrooms and bathrooms.
In [28]:
from math import log
Next we create the following new features as column :
In [29]:
sales['bedrooms_squared'] = sales['bedrooms'].apply(lambda x: x**2)
In [30]:
sales['bed_bath_rooms'] = sales['bedrooms'] * sales.bathrooms
In [31]:
sales['log_sqft_living'] = sales['sqft_living'].apply(lambda x: log(x))
In [32]:
sales['lat_plus_long'] = sales['lat'] + sales.long
In [33]:
sales['bedrooms_4'] = sales['bedrooms'].apply(lambda x: x**4)
In [34]:
sales['bathrooms_7'] = sales['bathrooms'].apply(lambda x: x**7)
In [35]:
sales['size_3'] = sales['sqft_living'].apply(lambda x: x**3)
In [36]:
sales.head()
Out[36]:
Now we will learn the weights for five (nested) models for predicting house prices. The first model will have the fewest features, the second model will add more features and so on:
In [37]:
model_1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long', 'sqft_lot', 'floors']
model_2_features = model_1_features + ['log_sqft_living', 'bedrooms_squared', 'bed_bath_rooms']
model_3_features = model_2_features + ['lat_plus_long']
model_4_features = model_3_features + ['bedrooms_4', 'bathrooms_7']
model_5_features = model_4_features + ['size_3']
Now that we have the features, we learn the weights for the five different models for predicting target = 'price' using and look at the value of the weights/coefficients:
In [38]:
model_1 = linear_model.LinearRegression()
model_1.fit(sales[model_1_features], y)
Out[38]:
In [39]:
model_2 = linear_model.LinearRegression()
model_2.fit(sales[model_2_features], y)
Out[39]:
In [40]:
model_3 = linear_model.LinearRegression()
model_3.fit(sales[model_3_features], y)
Out[40]:
In [41]:
model_4 = linear_model.LinearRegression()
model_4.fit(sales[model_4_features], y)
Out[41]:
In [42]:
model_5 = linear_model.LinearRegression()
model_5.fit(sales[model_5_features], y)
Out[42]:
In [43]:
# You can examine/extract each model's coefficients, for example:
print (model_1.coef_)
print (model_2.coef_)
Interesting: in the previous model the weight coefficient for the size lot was positive but now in the model_2 is negative.
This is an effect of adding the feature logging the size.
We can use the loss function from earlier to compute the RSS on training data for each of the models.
In [44]:
# Compute the RSS for each of the models:
print (get_loss(model_1.predict(sales[model_1_features]), y))
print (get_loss(model_2.predict(sales[model_2_features]), y))
print (get_loss(model_3.predict(sales[model_3_features]), y))
print (get_loss(model_4.predict(sales[model_4_features]), y))
print (get_loss(model_5.predict(sales[model_5_features]), y))
model_5 has the lowest RSS on the training data.
The most complex model.
Training error decreases quite significantly with model complexity. This is quite intuitive, because the model was fit on the training points and then as we increase the model complexity, we are better able to fit the training data points.
A natural question is whether a training error is a good measure of predictive performance?
The issue is that the training error is overly optimistic and that's because the beta parameters were fit on the training data to minimise the residual sum of squares, which can often be related to the training error. So, in general, having small training error does not imply having good predictive performance.
This takes us to something called test error (or out-of-sample error): we hold out some houses from the data set and we're putting these into what's called a test set.
And when we fit our models, we just fit our models on the training data set.
But then when we go to assess our performance of that model we look at these test houses in the test dataset and these are hopefully serving as a proxy of everything out there in the world.
Bottom line, the test error is a (noisy) approximation of the true error.
In [45]:
from sklearn.model_selection import train_test_split
train_data,test_data = train_test_split(sales, test_size=0.3, random_state=999)
In [46]:
train_data.head()
Out[46]:
In [47]:
train_data.shape
Out[47]:
In [48]:
# test_data = pd.read_csv('kc_house_test_data.csv', dtype=dtype_dict)
test_data.head()
Out[48]:
In [49]:
test_data.shape
Out[49]:
In this case the testing set will be the 30% (therefore the training set is 70% of the original data)
In [50]:
train_y = train_data.price # extract the price column
test_y = test_data.price
Retrain the models on training data only:
In [51]:
model_1.fit(train_data[model_1_features], train_y)
Out[51]:
In [52]:
model_2.fit(train_data[model_2_features], train_y)
Out[52]:
In [53]:
model_3.fit(train_data[model_3_features], train_y)
Out[53]:
In [54]:
model_4.fit(train_data[model_4_features], train_y)
Out[54]:
In [55]:
model_5.fit(train_data[model_5_features], train_y)
Out[55]:
In [56]:
# Compute the RSS on TRAINING data for each of the models
print (get_loss(model_1.predict(train_data[model_1_features]), train_y))
print (get_loss(model_2.predict(train_data[model_2_features]), train_y))
print (get_loss(model_3.predict(train_data[model_3_features]), train_y))
print (get_loss(model_4.predict(train_data[model_4_features]), train_y))
print (get_loss(model_5.predict(train_data[model_5_features]), train_y))
Now compute the RSS on TEST data for each of the models.
In [57]:
# Compute the RSS on TESTING data for each of the three models and record the values:
print (get_loss(model_1.predict(test_data[model_1_features]), test_y))
print (get_loss(model_2.predict(test_data[model_2_features]), test_y))
print (get_loss(model_3.predict(test_data[model_3_features]), test_y))
print (get_loss(model_4.predict(test_data[model_4_features]), test_y))
print (get_loss(model_5.predict(test_data[model_5_features]), test_y))
The most complex model has the lowest error on the training data, but since that model has a non-sensical feature, it performs less well on the test data.
When you have too many features in a model and the learned hypothesis fit the training set very well but fail to generalise to new data (predict prices on new houses) then this is called the overfitting problem.
Formally, a model, let's say Model1 with some parameters beta_1, overfits if exists another model - let's call it Model2, with estimated parameters beta_2 such that the training error of Model2 is less than the training error of Model1 but on the other hand, the true general error of Model2 is greater than the true error of Model1.
From the picture above you can see that the models prone to overfit are the ones that have small training error and high complexity. Therefore one simple way to avoid overfitting is to prefer simpler models and avoid complex models with many features.
In [ ]: