In [7]:
# import
import graphlab as gl
import matplotlib.pyplot as plt
import numpy as np
In [2]:
gl.canvas.set_target('ipynb')
%matplotlib inline
In [4]:
# reading data
sales = gl.SFrame('data/kc_house_data.gl/')
sales.head(4)
Out[4]:
In [5]:
train_data,test_data = sales.random_split(.8,seed=0)
3. Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, square feet, and # of bathrooms) but we can also consider transformations of existing variables e.g. the log of the square feet or even "interaction" variables such as the product of bedrooms and bathrooms. Add 4 new variables in both your train_data and test_data.
‘bedrooms_squared’ = ‘bedrooms’*‘bedrooms’
‘bed_bath_rooms’ = ‘bedrooms’*‘bathrooms’
‘log_sqft_living’ = log(‘sqft_living’)
‘lat_plus_long’ = ‘lat’ + ‘long’
Before we continue let’s explain these new variables:
Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.
Bedrooms times bathrooms is what's called an "interaction" variable. It is large when both of them are large.
Taking the log of square feet has the effect of bringing large values closer together and spreading out small values.
Adding latitude to longitude is non-sensical but we will do it anyway (you'll see why)
For those students not using SFrames you should first download and import the training and testing data sets provided and then add the four new variables each to both data sets (training and testing)
In [22]:
train_data['bedrooms_squared'] = train_data['bedrooms']*train_data['bedrooms']
train_data['bed_bath_rooms'] = train_data['bedrooms']*train_data['bathrooms']
train_data['log_sqft_living'] = np.log(train_data['sqft_living'])
train_data['lat_plus_long'] = train_data['lat']+train_data['long']
test_data['bedrooms_squared'] = test_data['bedrooms']*test_data['bedrooms']
test_data['bed_bath_rooms'] = test_data['bedrooms']*test_data['bathrooms']
test_data['log_sqft_living'] = np.log(test_data['sqft_living'])
test_data['lat_plus_long'] = test_data['lat']+test_data['long']
In [23]:
train_data[:3]
Out[23]:
what are the mean (arithmetic average) values of your 4 new variables on TEST data? (round to 2 digits)
In [38]:
print(np.average(test_data['bedrooms_squared']))
print(np.average(test_data['bed_bath_rooms'] ))
print(np.average(test_data['log_sqft_living'] ))
print(np.average(test_data['lat_plus_long']))
In [25]:
model1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat','long']
model2_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat','long', 'bed_bath_rooms']
model3_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat','long', 'bed_bath_rooms','bedrooms_squared', 'log_sqft_living','lat_plus_long']
In [27]:
model1 = gl.linear_regression.create(train_data, target='price', features=model1_features, validation_set = None)
model2 = gl.linear_regression.create(train_data, target='price', features=model2_features, validation_set = None)
model3 = gl.linear_regression.create(train_data, target='price', features=model3_features, validation_set = None)
What is the sign (positive or negative) for the coefficient/weight for ‘bathrooms’ in Model 1?
In [29]:
model1.get('coefficients') # negative
Out[29]:
What is the sign (positive or negative) for the coefficient/weight for ‘bathrooms’ in Model 2?
In [30]:
model2.get('coefficients') # positive
Out[30]:
Now using your three estimated models compute the RSS (Residual Sum of Squares) on the Training data.
In [31]:
def calcRSS(model, features, output):
predict = model.predict(features)
error = output - predict
rss = np.sum(np.square(error))
return rss
In [32]:
calcRSS(model1, train_data[model1_features], train_data['price'])
Out[32]:
In [33]:
calcRSS(model2, train_data[model2_features], train_data['price'])
Out[33]:
In [34]:
calcRSS(model3, train_data[model3_features], train_data['price'])
Out[34]:
three estimated models compute the RSS on the Testing data
In [35]:
calcRSS(model1, test_data[model1_features], test_data['price'])
Out[35]:
In [36]:
calcRSS(model2, test_data[model2_features], test_data['price'])
Out[36]:
In [37]:
calcRSS(model3, test_data[model3_features], test_data['price'])
Out[37]:
In [ ]: