Regression Week 1: Simple Linear Regression

In this notebook, we will use data on house sales in King County to predict house prices using simple (one input) linear regression. You will:

Write a function to compute the Simple Linear Regression weights using the closed form solution
Write a function to make predictions of the output given the input feature
Turn the regression around to predict the input given the output
Compare two different models for predicting house prices

Importing Libraries



In [98]:

    
import os
import zipfile
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
%matplotlib inline

Unzipping files with house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.



In [99]:

    
# Put files in current direction into a list
files_list = [f for f in os.listdir('.') if os.path.isfile(f)]



In [100]:

    
# Filenames of unzipped files
unzip_files = ['kc_house_train_data.csv','kc_house_test_data.csv', 'kc_house_data.csv']



In [101]:

    
# If upzipped file not in files_list, unzip the file
for filename in unzip_files:
    if filename not in files_list:
        zip_file = filename + '.zip'
        unzipping = zipfile.ZipFile(zip_file)
        unzipping.extractall()
        unzipping.close

Loading Sales data, Sales Training data, and Sales Test data



In [102]:

    
# Dictionary with the correct dtypes for the DataFrame columns
dtype_dict = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 
              'sqft_living15':float, 'grade':int, 'yr_renovated':int, 
              'price':float, 'bedrooms':float, 'zipcode':str, 'long':float, 
              'sqft_lot15':float, 'sqft_living':float, 'floors':str, 
              'condition':int, 'lat':float, 'date':str, 'sqft_basement':int, 
              'yr_built':int, 'id':str, 'sqft_lot':int, 'view':int}



In [103]:

    
sales = pd.read_csv('kc_house_data.csv', dtype = dtype_dict)
train_data = pd.read_csv('kc_house_train_data.csv', dtype = dtype_dict)
test_data = pd.read_csv('kc_house_test_data.csv', dtype = dtype_dict)



In [104]:

    
# Looking at head of training data DataFrame
train_data.head()









    Out[104]:






  
    
      
      id
      date
      price
      bedrooms
      bathrooms
      sqft_living
      sqft_lot
      floors
      waterfront
      view
      ...
      grade
      sqft_above
      sqft_basement
      yr_built
      yr_renovated
      zipcode
      lat
      long
      sqft_living15
      sqft_lot15
    
  
  
    
      0
      7129300520
      20141013T000000
      221900
      3
      1.00
      1180
      5650
      1
      0
      0
      ...
      7
      1180
      0
      1955
      0
      98178
      47.5112
      -122.257
      1340
      5650
    
    
      1
      6414100192
      20141209T000000
      538000
      3
      2.25
      2570
      7242
      2
      0
      0
      ...
      7
      2170
      400
      1951
      1991
      98125
      47.7210
      -122.319
      1690
      7639
    
    
      2
      5631500400
      20150225T000000
      180000
      2
      1.00
      770
      10000
      1
      0
      0
      ...
      6
      770
      0
      1933
      0
      98028
      47.7379
      -122.233
      2720
      8062
    
    
      3
      2487200875
      20141209T000000
      604000
      4
      3.00
      1960
      5000
      1
      0
      0
      ...
      7
      1050
      910
      1965
      0
      98136
      47.5208
      -122.393
      1360
      5000
    
    
      4
      1954400510
      20150218T000000
      510000
      3
      2.00
      1680
      8080
      1
      0
      0
      ...
      8
      1680
      0
      1987
      0
      98074
      47.6168
      -122.045
      1800
      7503
    
  

5 rows × 21 columns

Build a generic simple linear regression function

We can use the closed form solution found from lecture to compute the slope and intercept for a simple linear regression on observations stored as numpy arrays: input_feature, output.

Complete the following function to compute the simple linear regression slope and intercept:



In [105]:

    
def simple_linear_regression(input_feature, output):

    # Computing sums needed to calculate slope and intercept
    xi_sum = sum(input_feature)
    yi_sum = sum(output)
    yi_xi_sum = sum(input_feature*output)
    xi_squared_sum = sum(input_feature*input_feature)
    N = float(len(input_feature))

    # Values for slope and intercept
    slope = (yi_xi_sum - (xi_sum*yi_sum)/N)/(xi_squared_sum - (xi_sum*xi_sum)/N)
    intercept = yi_sum/N - slope*(xi_sum/N)

    return (intercept, slope)

We can test that our function works by passing it something where we know the answer. In particular, we can generate a feature and then put the output exactly on a line: output = 1 + 1*input_feature then we know both our slope and intercept should be 1



In [106]:

    
test_feature = np.arange(5)
test_output = 1.0 + 1.0*np.arange(5)
(test_intercept, test_slope) =  simple_linear_regression(test_feature, test_output)

print "Intercept: " + str(test_intercept)
print "Slope: " + str(test_slope)









    



Intercept: 1.0
Slope: 1.0

Now that we know it works let's build a regression model for predicting price based on sqft_living. Rembember that we train on train_data!



In [107]:

    
(sqft_intercept, sqft_slope) = simple_linear_regression(train_data['sqft_living'].values, train_data['price'].values)

print "Intercept: " + str(sqft_intercept)
print "Slope: " + str(sqft_slope)









    



Intercept: -47116.0790729
Slope: 281.95883963

Creating model with Squared Feet feature for Visualization



In [108]:

    
sqft_model = np.arange(0.0,8000000.0,1, dtype=float)
house_price_sqft_model = sqft_intercept + sqft_slope*sqft_model



In [109]:

    
plt.figure(figsize=(8,6))
plt.plot(sales['sqft_living'],sales['price'],'.', label= 'House Price Data')
plt.hold(True)
plt.plot(sqft_model, house_price_sqft_model, '-' , label= 'Linear Regression Model')
plt.hold(False)
plt.legend(loc='upper left', fontsize=16)
plt.xlabel('Living Area (ft^2)', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)
plt.title('Simple Linear Regression with Living Area Feature', fontsize=16)
plt.axis([0.0, 14000.0, 0.0, 8000000.0])
plt.show()

Predicting Values

Now that we have the model parameters: intercept & slope we can make predictions. Complete the following function to return the predicted output given the input_feature, slope and intercept:



In [110]:

    
def get_regression_predictions(input_feature, intercept, slope):
    
    predicted_values = intercept + slope*input_feature

    return predicted_values

Now that we can calculate a prediction given the slope and intercept let's make a prediction. Use the following to find out the estimated price for a house with 2650 squarefeet according to the squarefeet model we estimated above.

Quiz Question: Using your Slope and Intercept from (4), What is the predicted price for a house with 2650 sqft?



In [111]:

    
my_house_sqft = 2650
estimated_price = get_regression_predictions(my_house_sqft, sqft_intercept, sqft_slope)
print "The estimated price for a house with %d squarefeet is $%.2f" % (my_house_sqft, estimated_price)









    



The estimated price for a house with 2650 squarefeet is $700074.85

Residual Sum of Squares

Now that we have a model and can make predictions let's evaluate our model using Residual Sum of Squares (RSS). Recall that RSS is the sum of the squares of the residuals and the residuals is just a fancy word for the difference between the predicted output and the true output.

Complete the following function to compute the RSS of a simple linear regression model given the input_feature, output, intercept and slope:



In [112]:

    
def get_residual_sum_of_squares(input_feature, output, intercept, slope):
    
    # Vector of residuals for each observation i
    residual_vect = output - (intercept + slope*input_feature)

    # Squaring the residuals and adding them up
    RSS = sum(residual_vect*residual_vect)

    return(RSS)

Let's test our get_residual_sum_of_squares function by applying it to the test model where the data lie exactly on a line. Since they lie exactly on a line the residual sum of squares should be zero!



In [113]:

    
print get_residual_sum_of_squares(test_feature, test_output, test_intercept, test_slope) # should be 0.0

0.0

Now use your function to calculate the RSS on training data from the squarefeet model calculated above.

Quiz Question: According to this function and the slope and intercept from the squarefeet model What is the RSS for the simple linear regression using squarefeet to predict prices on TRAINING data?



In [114]:

    
rss_prices_on_sqft = get_residual_sum_of_squares(train_data['sqft_living'].values, train_data['price'].values, sqft_intercept, sqft_slope)
print 'The RSS of predicting Prices based on Square Feet is : ' + str(rss_prices_on_sqft)









    



The RSS of predicting Prices based on Square Feet is : 1.20191835418e+15

Predict the squarefeet given price

What if we want to predict the squarefoot given the price? Since we have an equation y = a + b*x we can solve the function for x. So that if we have the intercept (a) and the slope (b) and the price (y) we can solve for the estimated squarefeet (x).

Complete the following function to compute the inverse regression estimate, i.e. predict the input_feature given the output!



In [115]:

    
def inverse_regression_predictions(output, intercept, slope):
    
    estimated_feature = (output - intercept)/float(slope)

    return estimated_feature

Now that we have a function to compute the squarefeet given the price from our simple regression model let's see how big we might expect a house that coses $800,000 to be.

Quiz Question: According to this function and the regression slope and intercept from (3) what is the estimated square-feet for a house costing $800,000?



In [116]:

    
my_house_price = 800000
estimated_squarefeet = inverse_regression_predictions(my_house_price, sqft_intercept, sqft_slope)
print "The estimated squarefeet for a house worth $%.2f is %d" % (my_house_price, estimated_squarefeet)









    



The estimated squarefeet for a house worth $800000.00 is 3004

New Model: estimate prices from bedrooms

We have made one model for predicting house prices using squarefeet, but there are many other features in the sales DataFrame. Use your simple linear regression function to estimate the regression parameters from predicting Prices based on number of bedrooms. Use the training data!



In [117]:

    
# Estimate the slope and intercept for predicting 'price' based on 'bedrooms'
(bedrm_intercept, bedrm_slope) = simple_linear_regression(train_data['bedrooms'].values, train_data['price'].values)

print "Intercept: " + str(bedrm_intercept)
print "Slope: " + str(bedrm_slope)









    



Intercept: 109473.177623
Slope: 127588.952934

Creating model with Bedrooms feature for Visualization



In [118]:

    
bedrooms_model = np.arange(0.0,35.0+0.1,0.1, dtype=float)
house_price_bedrooms_model = bedrm_intercept + bedrm_slope*bedrooms_model



In [119]:

    
plt.figure(figsize=(8,6))
plt.plot(sales['bedrooms'],sales['price'],'.', label= 'House Price Data')
plt.hold(True)
plt.plot(bedrooms_model, house_price_bedrooms_model, '-' , label= 'Linear Regression Model')
plt.hold(False)
plt.legend(loc='upper right', fontsize=16)
plt.xlabel('# Bedrooms', fontsize=16)
plt.ylabel('House Price ($)', fontsize=16)
plt.title('Simple Linear Regression with # Bedrooms Feature', fontsize=16)
plt.show()

Test your Linear Regression Algorithm

Now we have two models for predicting the price of a house. How do we know which one is better? Calculate the RSS on the TEST data (remember this data wasn't involved in learning the model). Compute the RSS from predicting prices using bedrooms and from predicting prices using squarefeet.

Quiz Question: Which model (square feet or bedrooms) has lowest RSS on TEST data? Think about why this might be the case.



In [120]:

    
# Compute RSS when using bedrooms on TEST data:
rss_prices_on_test_bedrm = get_residual_sum_of_squares(test_data['bedrooms'].values, test_data['price'].values, bedrm_intercept, bedrm_slope)
print 'The RSS of predicting Prices based Test Data on Bedrooms is : ' + str(rss_prices_on_test_bedrm)









    



The RSS of predicting Prices based Test Data on Bedrooms is : 4.9336458596e+14



In [121]:

    
# Compute RSS when using squarfeet on TEST data:
rss_prices_on_test_sqft = get_residual_sum_of_squares(test_data['sqft_living'].values, test_data['price'].values, sqft_intercept, sqft_slope)
print 'The RSS of predicting Prices based Test Data on Square Feet is : ' + str(rss_prices_on_test_sqft)









    



The RSS of predicting Prices based Test Data on Square Feet is : 2.75402933618e+14

RSS on Test set for square feet model is smaller. Thus, square feet is a better house price indicator than the number of bedrooms.



In [ ]:

	id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	...	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
0	7129300520	20141013T000000	221900	3	1.00	1180	5650	1	...	7	1180	0	1955	0	98178	47.5112	-122.257	1340	5650
1	6414100192	20141209T000000	538000	3	2.25	2570	7242	2	...	7	2170	400	1951	1991	98125	47.7210	-122.319	1690	7639
2	5631500400	20150225T000000	180000	2	1.00	770	10000	1	...	6	770	0	1933	0	98028	47.7379	-122.233	2720	8062
3	2487200875	20141209T000000	604000	4	3.00	1960	5000	1	...	7	1050	910	1965	0	98136	47.5208	-122.393	1360	5000
4	1954400510	20150218T000000	510000	3	2.00	1680	8080	1	...	8	1680	0	1987	0	98074	47.6168	-122.045	1800	7503