The overfitting problem

We have seen what is linear regression, how to make models and algorithms for estimating the parameters of such models, how to measure the loss.

Now we see how to assess how well the considered method should perform in predicting new data, how to select amongst possible models to choose the best performing.
This leads directly to the bias-variance tradeoff, which is fundamental to machine learning.

Load in the data

The dataset is from house sales in King County, the region where the city of Seattle, WA is located.



In [1]:

    
import pandas as pd



In [2]:

    
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float, 
              'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str, 
              'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 
              'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 
              'id':str, 'sqft_lot':int, 'view':int}



In [3]:

    
sales = pd.read_csv('../datasets/kc_house_data.csv', dtype=dtype_dict)



In [4]:

    
sales.head()









    Out[4]:






  
    
      
      id
      date
      price
      bedrooms
      bathrooms
      sqft_living
      sqft_lot
      floors
      waterfront
      view
      ...
      grade
      sqft_above
      sqft_basement
      yr_built
      yr_renovated
      zipcode
      lat
      long
      sqft_living15
      sqft_lot15
    
  
  
    
      0
      7129300520
      20141013T000000
      221900.0
      3.0
      1.00
      1180.0
      5650
      1
      0
      0
      ...
      7
      1180
      0
      1955
      0
      98178
      47.5112
      -122.257
      1340.0
      5650.0
    
    
      1
      6414100192
      20141209T000000
      538000.0
      3.0
      2.25
      2570.0
      7242
      2
      0
      0
      ...
      7
      2170
      400
      1951
      1991
      98125
      47.7210
      -122.319
      1690.0
      7639.0
    
    
      2
      5631500400
      20150225T000000
      180000.0
      2.0
      1.00
      770.0
      10000
      1
      0
      0
      ...
      6
      770
      0
      1933
      0
      98028
      47.7379
      -122.233
      2720.0
      8062.0
    
    
      3
      2487200875
      20141209T000000
      604000.0
      4.0
      3.00
      1960.0
      5000
      1
      0
      0
      ...
      7
      1050
      910
      1965
      0
      98136
      47.5208
      -122.393
      1360.0
      5000.0
    
    
      4
      1954400510
      20150218T000000
      510000.0
      3.0
      2.00
      1680.0
      8080
      1
      0
      0
      ...
      8
      1680
      0
      1987
      0
      98074
      47.6168
      -122.045
      1800.0
      7503.0
    
  

5 rows × 21 columns



In [5]:

    
sales.shape









    Out[5]:





(21613, 21)



In [6]:

    
m = sales.shape[0] # number of training examples
m









    Out[6]:





21613

The dataset contains information (21 features, including the price) related to 21613 houses.
Our target variable (i.e., what we want to predict when a new house gets on sale) is the price.

Baseline: the simplest model

Now let's compute the loss in the case of the simplest model: a fixed price equal to the average of historic prices, independently on house size, rooms, location, ...



In [7]:

    
# Let's compute the mean of the House Prices in King County 
y = sales['price'] # extract the price column



In [8]:

    
avg_price = y.mean() # this is our baseline
print ("average price: ${:.0f} ".format(avg_price))









    



average price: $540088



In [9]:

    
ExamplePrice = y[0]
ExamplePrice









    Out[9]:





221900.0

The predictions are very easy to calculate, just the baseline value:



In [10]:

    
def get_baseline_predictions():
       # Simplest version: return the baseline as predicted values
    predicted_values = avg_price 
    return predicted_values

Example:



In [11]:

    
my_house_size = 2500
estimated_price = get_baseline_predictions()
print ("The estimated price for a house with {} squared feet is {:.0f}".format(my_house_size, estimated_price))









    



The estimated price for a house with 2500 squared feet is 540088

The estimated price for the example house will still be around 540K, wile the real value is around 222K. Quite an error!

Measures of loss

There are several way of implementing the loss, I use the squared error here.

$L = [y - f(X)]^2$



In [12]:

    
import numpy as np
def get_loss(yhat, target):
    """
    Arguments:
    yhat -- vector of size m (predicted labels)
    target -- vector of size m (true labels)
    
    Returns:
    loss -- the value of the L2 loss function
    """
      # compute the residuals (since we are squaring it doesn't matter 
      #   which order you subtract)
      # np.dot will square the residuals and add them up
    loss = np.dot((target - yhat), (target - yhat))

    return(loss)

To better see the value of the cost function we use also the RMSE, the Root Mean Square Deviation.
Basically the average of the losses, rooted.



In [13]:

    
baselineCost = get_loss(get_baseline_predictions(), y)

print ("Training Error for baseline RSS: {:.0f}".format(baselineCost))
print ("Average Training Error for baseline RMSE: {:.0f}".format(np.sqrt(baselineCost/m)))









    



Training Error for baseline RSS: 2912916761921302
Average Training Error for baseline RMSE: 367119

As you can see, it is quite high error, especially related to the average selling price.

Now, we can look at how training error behaves as model complexity increases.

Learning a better but still simple model

Using a constant value, the average, is easy but does not make too much sense. Let's create a linear model with the house size as the feature. We expect that the price is dependent on the size: bigger house, more expensive.



In [14]:

    
from sklearn import linear_model



In [15]:

    
simple_model = linear_model.LinearRegression()



In [16]:

    
simple_features = sales[['sqft_living']] # input X: the house size



In [17]:

    
simple_model.fit(simple_features, y)









    Out[17]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Now that we have fit the model we can extract the regression weights (coefficients) as follows:



In [18]:

    
simple_model_intercept = simple_model.intercept_
print (simple_model_intercept)









    



-43580.7430945



In [19]:

    
simple_model_weights = simple_model.coef_
print (simple_model_weights)









    



[ 280.6235679]

This means that our simple model to predict a house price y is (approximated):
$y = -43581 + 281x $
where x is the size in squared feet.
It is not anymore a horizontal line but a diagonal one, with a slope.

Making Predictions

Recall that once a model is built we can use the .predict() function to find the predicted values for data we pass. For example using the example model above:



In [20]:

    
training_predictions = simple_model.predict(simple_features)
print (training_predictions[0])









    



287555.067025

We are getting closer to the real value for the example house (recall, it's around 222K).

Compute the Training Error

Now that we can make predictions given the model, let's again compute the RSS and the RMSE.



In [21]:

    
# First get the predictions using the features subset
predictions = simple_model.predict(sales[['sqft_living']])
simpleCost = get_loss(predictions, y) 
print ("Training Error for baseline RSS: {:.0f}".format(simpleCost))
print ("Average Training Error for baseline RMSE: {:.0f}".format(np.sqrt(simpleCost/m)))









    



Training Error for baseline RSS: 1477276362322490
Average Training Error for baseline RMSE: 261441

The simple model reduced greatly the training error.

Learning a multiple regression model

We can add more features to the model, for example the number of bedrooms and bathrooms.



In [22]:

    
more_features = sales[['sqft_living', 'bedrooms', 'bathrooms']] # input X

We can learn a multiple regression model predicting 'price' based on the above features on the data with the following code:



In [23]:

    
better_model = linear_model.LinearRegression()
better_model.fit(more_features, y)









    Out[23]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Now that we have fitted the model we can extract the regression weights (coefficients) as follows:



In [24]:

    
betterModel_intercept = better_model.intercept_
print (betterModel_intercept)









    



74847.1408013



In [25]:

    
betterModel_weights = better_model.coef_
print (betterModel_weights)









    



[   309.39239013 -57860.8943206    7932.71222266]

The better model is therefore:
$y = 74847 + 309x1 - 57861x2 + 7933x3$
Note that the equation has now three variables: the size, the bedrooms and the bathrooms.

Making Predictions

Again we can use the .predict() function to find the predicted values for data we pass. For the model above:



In [26]:

    
better_predictions = better_model.predict(more_features)
print (better_predictions[0])









    



274280.190415

Again, a little bit closer to the real value (222K)

Compute the Training Error

Now that we can make predictions given the model, let's write a function to compute the RSS of the model.



In [27]:

    
predictions = better_model.predict(more_features)
betterCost = get_loss(predictions, y) 
print ("Training Error for baseline RSS: {:.0f}".format(betterCost))
print ("Average Training Error for baseline RMSE: {:.0f}".format(np.sqrt(betterCost/m)))









    



Training Error for baseline RSS: 1436301587370045
Average Training Error for baseline RMSE: 257790

Only a slight improvement this time

Create some new features

Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, squarefeet, and # of bathrooms) but we can also consider transformations of existing features e.g. the log of the squarefeet or even "interaction" features such as the product of bedrooms and bathrooms.



In [28]:

    
from math import log

Next we create the following new features as column :

bedrooms_squared = bedrooms*bedrooms
bed_bath_rooms = bedrooms*bathrooms
log_sqft_living = log(sqft_living)
lat_plus_long = lat + long
more polynomial features: bedrooms ^ 4, bathrooms ^ 7, size ^ 3



In [29]:

    
sales['bedrooms_squared'] = sales['bedrooms'].apply(lambda x: x**2)



In [30]:

    
sales['bed_bath_rooms'] = sales['bedrooms'] * sales.bathrooms



In [31]:

    
sales['log_sqft_living'] = sales['sqft_living'].apply(lambda x: log(x))



In [32]:

    
sales['lat_plus_long'] = sales['lat'] + sales.long



In [33]:

    
sales['bedrooms_4'] = sales['bedrooms'].apply(lambda x: x**4)



In [34]:

    
sales['bathrooms_7'] = sales['bathrooms'].apply(lambda x: x**7)



In [35]:

    
sales['size_3'] = sales['sqft_living'].apply(lambda x: x**3)



In [36]:

    
sales.head()









    Out[36]:






  
    
      
      id
      date
      price
      bedrooms
      bathrooms
      sqft_living
      sqft_lot
      floors
      waterfront
      view
      ...
      long
      sqft_living15
      sqft_lot15
      bedrooms_squared
      bed_bath_rooms
      log_sqft_living
      lat_plus_long
      bedrooms_4
      bathrooms_7
      size_3
    
  
  
    
      0
      7129300520
      20141013T000000
      221900.0
      3.0
      1.00
      1180.0
      5650
      1
      0
      0
      ...
      -122.257
      1340.0
      5650.0
      9.0
      3.00
      7.073270
      -74.7458
      81.0
      1.00000
      1.643032e+09
    
    
      1
      6414100192
      20141209T000000
      538000.0
      3.0
      2.25
      2570.0
      7242
      2
      0
      0
      ...
      -122.319
      1690.0
      7639.0
      9.0
      6.75
      7.851661
      -74.5980
      81.0
      291.92926
      1.697459e+10
    
    
      2
      5631500400
      20150225T000000
      180000.0
      2.0
      1.00
      770.0
      10000
      1
      0
      0
      ...
      -122.233
      2720.0
      8062.0
      4.0
      2.00
      6.646391
      -74.4951
      16.0
      1.00000
      4.565330e+08
    
    
      3
      2487200875
      20141209T000000
      604000.0
      4.0
      3.00
      1960.0
      5000
      1
      0
      0
      ...
      -122.393
      1360.0
      5000.0
      16.0
      12.00
      7.580700
      -74.8722
      256.0
      2187.00000
      7.529536e+09
    
    
      4
      1954400510
      20150218T000000
      510000.0
      3.0
      2.00
      1680.0
      8080
      1
      0
      0
      ...
      -122.045
      1800.0
      7503.0
      9.0
      6.00
      7.426549
      -74.4282
      81.0
      128.00000
      4.741632e+09
    
  

5 rows × 28 columns

Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this feature will mostly affect houses with many bedrooms.
bedrooms times bathrooms gives what's called an "interaction" feature. It is large when both of them are large.
Taking the log of squarefeet has the effect of bringing large values closer together and spreading out small values.
Adding latitude to longitude is totally non-sensical but we will do it anyway (you'll see why)

Learning Multiple Models

Now we will learn the weights for five (nested) models for predicting house prices. The first model will have the fewest features, the second model will add more features and so on:



In [37]:

    
model_1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat', 'long', 'sqft_lot', 'floors']
model_2_features = model_1_features + ['log_sqft_living', 'bedrooms_squared', 'bed_bath_rooms']
model_3_features = model_2_features + ['lat_plus_long']
model_4_features = model_3_features + ['bedrooms_4', 'bathrooms_7']
model_5_features = model_4_features + ['size_3']

Now that we have the features, we learn the weights for the five different models for predicting target = 'price' using and look at the value of the weights/coefficients:



In [38]:

    
model_1 = linear_model.LinearRegression()
model_1.fit(sales[model_1_features], y)









    Out[38]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [39]:

    
model_2 = linear_model.LinearRegression()
model_2.fit(sales[model_2_features], y)









    Out[39]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [40]:

    
model_3 = linear_model.LinearRegression()
model_3.fit(sales[model_3_features], y)









    Out[40]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [41]:

    
model_4 = linear_model.LinearRegression()
model_4.fit(sales[model_4_features], y)









    Out[41]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [42]:

    
model_5 = linear_model.LinearRegression()
model_5.fit(sales[model_5_features], y)









    Out[42]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [43]:

    
# You can examine/extract each model's coefficients, for example:
print (model_1.coef_)
print (model_2.coef_)









    



[  3.06930393e+02  -5.37365974e+04   1.92633985e+04   6.59919715e+05
  -3.15325117e+05   6.01533077e-02  -5.77002641e+03]
[  5.22290856e+02  -5.83083200e+03   9.13874631e+04   6.55377075e+05
  -2.79014318e+05  -4.38190098e-03  -1.64123192e+03  -5.43968799e+05
   7.65579830e+02  -1.49981031e+04]

Interesting: in the previous model the weight coefficient for the size lot was positive but now in the model_2 is negative.
This is an effect of adding the feature logging the size.

Comparing multiple models

Now that you've learned three models and extracted the model weights we want to evaluate which model is best.

We can use the loss function from earlier to compute the RSS on training data for each of the models.



In [44]:

    
# Compute the RSS for each of the models:
print (get_loss(model_1.predict(sales[model_1_features]), y))
print (get_loss(model_2.predict(sales[model_2_features]), y))
print (get_loss(model_3.predict(sales[model_3_features]), y))
print (get_loss(model_4.predict(sales[model_4_features]), y))
print (get_loss(model_5.predict(sales[model_5_features]), y))









    



1.19260931528e+15
1.11713142806e+15
1.11713142806e+15
1.1115656343e+15
1.1091157483e+15

model_5 has the lowest RSS on the training data.
The most complex model.

The test error

Training error decreases quite significantly with model complexity. This is quite intuitive, because the model was fit on the training points and then as we increase the model complexity, we are better able to fit the training data points.

A natural question is whether a training error is a good measure of predictive performance?

The issue is that the training error is overly optimistic and that's because the beta parameters were fit on the training data to minimise the residual sum of squares, which can often be related to the training error. So, in general, having small training error does not imply having good predictive performance.

This takes us to something called test error (or out-of-sample error): we hold out some houses from the data set and we're putting these into what's called a test set. And when we fit our models, we just fit our models on the training data set.
But then when we go to assess our performance of that model we look at these test houses in the test dataset and these are hopefully serving as a proxy of everything out there in the world.

Bottom line, the test error is a (noisy) approximation of the true error.

Split data into training and testing.

Let's see how can be applied to our example.
First we split the data into a training set and a testing set using a function from sklearn, the train_test_split(). We use a seed for reproducibility.



In [45]:

    
from sklearn.model_selection import train_test_split
train_data,test_data = train_test_split(sales, test_size=0.3, random_state=999)



In [46]:

    
train_data.head()









    Out[46]:






  
    
      
      id
      date
      price
      bedrooms
      bathrooms
      sqft_living
      sqft_lot
      floors
      waterfront
      view
      ...
      long
      sqft_living15
      sqft_lot15
      bedrooms_squared
      bed_bath_rooms
      log_sqft_living
      lat_plus_long
      bedrooms_4
      bathrooms_7
      size_3
    
  
  
    
      9318
      4420600015
      20141006T000000
      571500.0
      4.0
      2.25
      2810.0
      25990
      2
      0
      0
      ...
      -122.293
      2020.0
      16140.0
      16.0
      9.0
      7.940940
      -74.9937
      256.0
      291.929260
      2.218804e+10
    
    
      15904
      8691390980
      20140902T000000
      728000.0
      4.0
      2.50
      3290.0
      5951
      2
      0
      0
      ...
      -121.976
      3240.0
      6159.0
      16.0
      10.0
      8.098643
      -74.3761
      256.0
      610.351562
      3.561129e+10
    
    
      7599
      3326059191
      20140616T000000
      410000.0
      4.0
      2.00
      1580.0
      9581
      1
      0
      0
      ...
      -122.165
      1580.0
      10552.0
      16.0
      8.0
      7.365180
      -74.4730
      256.0
      128.000000
      3.944312e+09
    
    
      14893
      6817810190
      20140701T000000
      401000.0
      3.0
      2.00
      1240.0
      11172
      1
      0
      0
      ...
      -122.037
      1330.0
      14102.0
      9.0
      6.0
      7.122867
      -74.4006
      81.0
      128.000000
      1.906624e+09
    
    
      20508
      7625703357
      20150227T000000
      394950.0
      2.0
      2.25
      1300.0
      2104
      2
      0
      0
      ...
      -122.388
      1430.0
      1850.0
      4.0
      4.5
      7.170120
      -74.8403
      16.0
      291.929260
      2.197000e+09
    
  

5 rows × 28 columns



In [47]:

    
train_data.shape









    Out[47]:





(15129, 28)



In [48]:

    
# test_data = pd.read_csv('kc_house_test_data.csv', dtype=dtype_dict)
test_data.head()









    Out[48]:






  
    
      
      id
      date
      price
      bedrooms
      bathrooms
      sqft_living
      sqft_lot
      floors
      waterfront
      view
      ...
      long
      sqft_living15
      sqft_lot15
      bedrooms_squared
      bed_bath_rooms
      log_sqft_living
      lat_plus_long
      bedrooms_4
      bathrooms_7
      size_3
    
  
  
    
      10033
      4440400125
      20140508T000000
      228000.0
      4.0
      1.75
      2000.0
      6120
      1
      0
      0
      ...
      -122.258
      1880.0
      6120.0
      16.0
      7.00
      7.600902
      -74.7545
      256.0
      50.265076
      8.000000e+09
    
    
      20870
      3575305485
      20140829T000000
      409000.0
      3.0
      2.50
      1890.0
      6500
      2
      0
      0
      ...
      -122.058
      2340.0
      7500.0
      9.0
      7.50
      7.544332
      -74.4355
      81.0
      610.351562
      6.751269e+09
    
    
      21323
      9402800005
      20141028T000000
      1500000.0
      3.0
      3.50
      3530.0
      3610
      2
      0
      0
      ...
      -122.339
      1780.0
      3610.0
      9.0
      10.50
      8.169053
      -74.6533
      81.0
      6433.929688
      4.398698e+10
    
    
      12147
      9407001600
      20150423T000000
      305000.0
      3.0
      1.75
      1660.0
      11500
      1
      0
      0
      ...
      -121.774
      1250.0
      11000.0
      9.0
      5.25
      7.414573
      -74.3267
      81.0
      50.265076
      4.574296e+09
    
    
      15145
      8712100020
      20150127T000000
      600000.0
      2.0
      1.00
      1290.0
      4636
      1
      0
      0
      ...
      -122.301
      1940.0
      4635.0
      4.0
      2.00
      7.162397
      -74.6617
      16.0
      1.000000
      2.146689e+09
    
  

5 rows × 28 columns



In [49]:

    
test_data.shape









    Out[49]:





(6484, 28)

In this case the testing set will be the 30% (therefore the training set is 70% of the original data)



In [50]:

    
train_y = train_data.price # extract the price column
test_y = test_data.price

Retrain the models on training data only:



In [51]:

    
model_1.fit(train_data[model_1_features], train_y)









    Out[51]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [52]:

    
model_2.fit(train_data[model_2_features], train_y)









    Out[52]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [53]:

    
model_3.fit(train_data[model_3_features], train_y)









    Out[53]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [54]:

    
model_4.fit(train_data[model_4_features], train_y)









    Out[54]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [55]:

    
model_5.fit(train_data[model_5_features], train_y)









    Out[55]:





LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)



In [56]:

    
# Compute the RSS on TRAINING data for each of the models
print (get_loss(model_1.predict(train_data[model_1_features]), train_y))
print (get_loss(model_2.predict(train_data[model_2_features]), train_y))
print (get_loss(model_3.predict(train_data[model_3_features]), train_y))
print (get_loss(model_4.predict(train_data[model_4_features]), train_y))
print (get_loss(model_5.predict(train_data[model_5_features]), train_y))









    



8.60021288255e+14
7.89347967237e+14
7.89347967237e+14
7.73197162481e+14
7.50061625701e+14

Now compute the RSS on TEST data for each of the models.



In [57]:

    
# Compute the RSS on TESTING data for each of the three models and record the values:
print (get_loss(model_1.predict(test_data[model_1_features]), test_y))
print (get_loss(model_2.predict(test_data[model_2_features]), test_y))
print (get_loss(model_3.predict(test_data[model_3_features]), test_y))
print (get_loss(model_4.predict(test_data[model_4_features]), test_y))
print (get_loss(model_5.predict(test_data[model_5_features]), test_y))









    



3.33694184144e+14
3.31182631706e+14
3.31182631706e+14
3.49204952637e+14
4.05978695042e+14

The most complex model has the lowest error on the training data, but since that model has a non-sensical feature, it performs less well on the test data.

Overfitting

When you have too many features in a model and the learned hypothesis fit the training set very well but fail to generalise to new data (predict prices on new houses) then this is called the overfitting problem.

Formally, a model, let's say Model1 with some parameters beta_1, overfits if exists another model - let's call it Model2, with estimated parameters beta_2 such that the training error of Model2 is less than the training error of Model1 but on the other hand, the true general error of Model2 is greater than the true error of Model1.

From the picture above you can see that the models prone to overfit are the ones that have small training error and high complexity. Therefore one simple way to avoid overfitting is to prefer simpler models and avoid complex models with many features.



In [ ]:

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
0	7129300520	20141013T000000	221900.0	3.0	1.00	1180.0	5650	1	...	7	1180	0	1955	0	98178	47.5112	-122.257	1340.0	5650.0
1	6414100192	20141209T000000	538000.0	3.0	2.25	2570.0	7242	2	...	7	2170	400	1951	1991	98125	47.7210	-122.319	1690.0	7639.0
2	5631500400	20150225T000000	180000.0	2.0	1.00	770.0	10000	1	...	6	770	0	1933	0	98028	47.7379	-122.233	2720.0	8062.0
3	2487200875	20141209T000000	604000.0	4.0	3.00	1960.0	5000	1	...	7	1050	910	1965	0	98136	47.5208	-122.393	1360.0	5000.0
4	1954400510	20150218T000000	510000.0	3.0	2.00	1680.0	8080	1	...	8	1680	0	1987	0	98074	47.6168	-122.045	1800.0	7503.0

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	long	sqft_living15	sqft_lot15	bedrooms_squared	bed_bath_rooms	log_sqft_living	lat_plus_long	bedrooms_4	bathrooms_7	size_3
9318	4420600015	20141006T000000	571500.0	4.0	2.25	2810.0	25990	2	...	-122.293	2020.0	16140.0	16.0	9.0	7.940940	-74.9937	256.0	291.929260	2.218804e+10
15904	8691390980	20140902T000000	728000.0	4.0	2.50	3290.0	5951	2	...	-121.976	3240.0	6159.0	16.0	10.0	8.098643	-74.3761	256.0	610.351562	3.561129e+10
7599	3326059191	20140616T000000	410000.0	4.0	2.00	1580.0	9581	1	...	-122.165	1580.0	10552.0	16.0	8.0	7.365180	-74.4730	256.0	128.000000	3.944312e+09
14893	6817810190	20140701T000000	401000.0	3.0	2.00	1240.0	11172	1	...	-122.037	1330.0	14102.0	9.0	6.0	7.122867	-74.4006	81.0	128.000000	1.906624e+09
20508	7625703357	20150227T000000	394950.0	2.0	2.25	1300.0	2104	2	...	-122.388	1430.0	1850.0	4.0	4.5	7.170120	-74.8403	16.0	291.929260	2.197000e+09

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	long	sqft_living15	sqft_lot15	bedrooms_squared	bed_bath_rooms	log_sqft_living	lat_plus_long	bedrooms_4	bathrooms_7	size_3
10033	4440400125	20140508T000000	228000.0	4.0	1.75	2000.0	6120	1	...	-122.258	1880.0	6120.0	16.0	7.00	7.600902	-74.7545	256.0	50.265076	8.000000e+09
20870	3575305485	20140829T000000	409000.0	3.0	2.50	1890.0	6500	2	...	-122.058	2340.0	7500.0	9.0	7.50	7.544332	-74.4355	81.0	610.351562	6.751269e+09
21323	9402800005	20141028T000000	1500000.0	3.0	3.50	3530.0	3610	2	...	-122.339	1780.0	3610.0	9.0	10.50	8.169053	-74.6533	81.0	6433.929688	4.398698e+10
12147	9407001600	20150423T000000	305000.0	3.0	1.75	1660.0	11500	1	...	-121.774	1250.0	11000.0	9.0	5.25	7.414573	-74.3267	81.0	50.265076	4.574296e+09
15145	8712100020	20150127T000000	600000.0	2.0	1.00	1290.0	4636	1	...	-122.301	1940.0	4635.0	4.0	2.00	7.162397	-74.6617	16.0	1.000000	2.146689e+09