In [1]:
# import
import graphlab as gl
import matplotlib.pyplot as plt
import numpy as np

In [2]:
gl.canvas.set_target('ipynb')
%matplotlib inline

In [3]:
# reading data
sales = gl.SFrame('data/kc_house_data.gl/')
sales.head(4)


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\Atul\AppData\Local\Temp\graphlab_server_1502816832.log.0
This non-commercial license of GraphLab Create for academic use is assigned to atul9806@yahoo.in and will expire on February 02, 2018.
Out[3]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
7129300520 2014-10-13 00:00:00+00:00 221900.0 3.0 1.0 1180.0 5650 1 0
6414100192 2014-12-09 00:00:00+00:00 538000.0 3.0 2.25 2570.0 7242 2 0
5631500400 2015-02-25 00:00:00+00:00 180000.0 2.0 1.0 770.0 10000 1 0
2487200875 2014-12-09 00:00:00+00:00 604000.0 4.0 3.0 1960.0 5000 1 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 3 7 1180 0 1955 0 98178 47.51123398
0 3 7 2170 400 1951 1991 98125 47.72102274
0 3 6 770 0 1933 0 98028 47.73792661
0 5 7 1050 910 1965 0 98136 47.52082
long sqft_living15 sqft_lot15
-122.25677536 1340.0 5650.0
-122.3188624 1690.0 7639.0
-122.23319601 2720.0 8062.0
-122.39318505 1360.0 5000.0
[4 rows x 21 columns]

In [ ]:
train_data,test_data = sales.random_split(.8,seed=0)
3. Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, square feet, and # of bathrooms) but we can also consider transformations of existing variables e.g. the log of the square feet or even "interaction" variables such as the product of bedrooms and bathrooms. Add 4 new variables in both your train_data and test_data.

‘bedrooms_squared’ = ‘bedrooms’*‘bedrooms’
‘bed_bath_rooms’ = ‘bedrooms’*‘bathrooms’
‘log_sqft_living’ = log(‘sqft_living’)
‘lat_plus_long’ = ‘lat’ + ‘long’
Before we continue let’s explain these new variables:

Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.
Bedrooms times bathrooms is what's called an "interaction" variable. It is large when both of them are large.
Taking the log of square feet has the effect of bringing large values closer together and spreading out small values.
Adding latitude to longitude is non-sensical but we will do it anyway (you'll see why)
For those students not using SFrames you should first download and import the training and testing data sets provided and then add the four new variables each to both data sets (training and testing)

In [ ]:
train_data['bedrooms_squared'] = train_data['bedrooms']*train_data['bedrooms']
train_data['bed_bath_rooms'] = train_data['bedrooms']*train_data['bathrooms']
train_data['log_sqft_living'] = np.log(train_data['sqft_living'])
train_data['lat_plus_long'] = train_data['lat']+train_data['long']

test_data['bedrooms_squared'] = test_data['bedrooms']*test_data['bedrooms']
test_data['bed_bath_rooms'] = test_data['bedrooms']*test_data['bathrooms']
test_data['log_sqft_living'] = np.log(test_data['sqft_living'])
test_data['lat_plus_long'] = test_data['lat']+test_data['long']

In [ ]:
train_data[:3]

what are the mean (arithmetic average) values of your 4 new variables on TEST data? (round to 2 digits)


In [ ]:
print(np.average(test_data['bedrooms_squared']))
print(np.average(test_data['bed_bath_rooms']  ))
print(np.average(test_data['log_sqft_living'] ))
print(np.average(test_data['lat_plus_long']))

In [ ]:
model1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat','long']
model2_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat','long', 'bed_bath_rooms']
model3_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat','long', 'bed_bath_rooms','bedrooms_squared', 'log_sqft_living','lat_plus_long']

In [ ]:
model1 = gl.linear_regression.create(train_data, target='price', features=model1_features, validation_set = None)
model2 = gl.linear_regression.create(train_data, target='price', features=model2_features, validation_set = None)
model3 = gl.linear_regression.create(train_data, target='price', features=model3_features, validation_set = None)

What is the sign (positive or negative) for the coefficient/weight for ‘bathrooms’ in Model 1?


In [ ]:
model1.get('coefficients')  # negative

What is the sign (positive or negative) for the coefficient/weight for ‘bathrooms’ in Model 2?


In [ ]:
model2.get('coefficients')  # positive

Now using your three estimated models compute the RSS (Residual Sum of Squares) on the Training data.


In [ ]:
def calcRSS(model, features, output):
    predict = model.predict(features)
    error = output - predict
    rss = np.sum(np.square(error))
    return rss

In [ ]:
calcRSS(model1, train_data[model1_features], train_data['price'])

In [ ]:
calcRSS(model2, train_data[model2_features], train_data['price'])

In [ ]:
calcRSS(model3, train_data[model3_features], train_data['price'])

three estimated models compute the RSS on the Testing data


In [ ]:
calcRSS(model1, test_data[model1_features], test_data['price'])

In [ ]:
calcRSS(model2, test_data[model2_features], test_data['price'])

In [ ]:
calcRSS(model3, test_data[model3_features], test_data['price'])

In [ ]: