In [7]:
# import
import graphlab as gl
import matplotlib.pyplot as plt
import numpy as np

In [2]:
gl.canvas.set_target('ipynb')
%matplotlib inline

In [4]:
# reading data
sales = gl.SFrame('data/kc_house_data.gl/')
sales.head(4)


This non-commercial license of GraphLab Create for academic use is assigned to atul9806@yahoo.in and will expire on February 02, 2018.
[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\Atul\AppData\Local\Temp\graphlab_server_1502814938.log.0
Out[4]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
7129300520 2014-10-13 00:00:00+00:00 221900.0 3.0 1.0 1180.0 5650 1 0
6414100192 2014-12-09 00:00:00+00:00 538000.0 3.0 2.25 2570.0 7242 2 0
5631500400 2015-02-25 00:00:00+00:00 180000.0 2.0 1.0 770.0 10000 1 0
2487200875 2014-12-09 00:00:00+00:00 604000.0 4.0 3.0 1960.0 5000 1 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 3 7 1180 0 1955 0 98178 47.51123398
0 3 7 2170 400 1951 1991 98125 47.72102274
0 3 6 770 0 1933 0 98028 47.73792661
0 5 7 1050 910 1965 0 98136 47.52082
long sqft_living15 sqft_lot15
-122.25677536 1340.0 5650.0
-122.3188624 1690.0 7639.0
-122.23319601 2720.0 8062.0
-122.39318505 1360.0 5000.0
[4 rows x 21 columns]

In [5]:
train_data,test_data = sales.random_split(.8,seed=0)
3. Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, square feet, and # of bathrooms) but we can also consider transformations of existing variables e.g. the log of the square feet or even "interaction" variables such as the product of bedrooms and bathrooms. Add 4 new variables in both your train_data and test_data.

‘bedrooms_squared’ = ‘bedrooms’*‘bedrooms’
‘bed_bath_rooms’ = ‘bedrooms’*‘bathrooms’
‘log_sqft_living’ = log(‘sqft_living’)
‘lat_plus_long’ = ‘lat’ + ‘long’
Before we continue let’s explain these new variables:

Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.
Bedrooms times bathrooms is what's called an "interaction" variable. It is large when both of them are large.
Taking the log of square feet has the effect of bringing large values closer together and spreading out small values.
Adding latitude to longitude is non-sensical but we will do it anyway (you'll see why)
For those students not using SFrames you should first download and import the training and testing data sets provided and then add the four new variables each to both data sets (training and testing)

In [22]:
train_data['bedrooms_squared'] = train_data['bedrooms']*train_data['bedrooms']
train_data['bed_bath_rooms'] = train_data['bedrooms']*train_data['bathrooms']
train_data['log_sqft_living'] = np.log(train_data['sqft_living'])
train_data['lat_plus_long'] = train_data['lat']+train_data['long']

test_data['bedrooms_squared'] = test_data['bedrooms']*test_data['bedrooms']
test_data['bed_bath_rooms'] = test_data['bedrooms']*test_data['bathrooms']
test_data['log_sqft_living'] = np.log(test_data['sqft_living'])
test_data['lat_plus_long'] = test_data['lat']+test_data['long']

In [23]:
train_data[:3]


Out[23]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
7129300520 2014-10-13 00:00:00+00:00 221900.0 3.0 1.0 1180.0 5650 1 0
6414100192 2014-12-09 00:00:00+00:00 538000.0 3.0 2.25 2570.0 7242 2 0
5631500400 2015-02-25 00:00:00+00:00 180000.0 2.0 1.0 770.0 10000 1 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 3 7 1180 0 1955 0 98178 47.51123398
0 3 7 2170 400 1951 1991 98125 47.72102274
0 3 6 770 0 1933 0 98028 47.73792661
long sqft_living15 sqft_lot15 bedrooms_squared bed_bath_rooms log_sqft_living lat_plus_long
-122.25677536 1340.0 5650.0 9.0 3.0 7.07326971746 -74.74554138
-122.3188624 1690.0 7639.0 9.0 6.75 7.85166117789 -74.59783966
-122.23319601 2720.0 8062.0 4.0 2.0 6.64639051485 -74.4952694
[3 rows x 25 columns]

what are the mean (arithmetic average) values of your 4 new variables on TEST data? (round to 2 digits)


In [38]:
print(np.average(test_data['bedrooms_squared']))
print(np.average(test_data['bed_bath_rooms']  ))
print(np.average(test_data['log_sqft_living'] ))
print(np.average(test_data['lat_plus_long']))


12.4466777016
7.50390163159
7.55027467965
-74.6533349722

In [25]:
model1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat','long']
model2_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat','long', 'bed_bath_rooms']
model3_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat','long', 'bed_bath_rooms','bedrooms_squared', 'log_sqft_living','lat_plus_long']

In [27]:
model1 = gl.linear_regression.create(train_data, target='price', features=model1_features, validation_set = None)
model2 = gl.linear_regression.create(train_data, target='price', features=model2_features, validation_set = None)
model3 = gl.linear_regression.create(train_data, target='price', features=model3_features, validation_set = None)


Linear regression:
--------------------------------------------------------
Number of examples          : 17384
Number of features          : 5
Number of unpacked features : 5
Number of coefficients    : 6
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.018516     | 4074878.213096     | 236378.596455 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.

Linear regression:
--------------------------------------------------------
Number of examples          : 17384
Number of features          : 6
Number of unpacked features : 6
Number of coefficients    : 7
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.031025     | 4014170.932927     | 235190.935428 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.

Linear regression:
--------------------------------------------------------
Number of examples          : 17384
Number of features          : 9
Number of unpacked features : 9
Number of coefficients    : 10
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.026524     | 3193229.177894     | 228200.043155 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.

What is the sign (positive or negative) for the coefficient/weight for ‘bathrooms’ in Model 1?


In [29]:
model1.get('coefficients')  # negative


Out[29]:
name index value stderr
(intercept) None -56140675.7444 1649985.42028
sqft_living None 310.263325778 3.18882960408
bedrooms None -59577.1160682 2487.27977322
bathrooms None 13811.8405418 3593.54213297
lat None 629865.789485 13120.7100323
long None -214790.285186 13284.2851607
[6 rows x 4 columns]

What is the sign (positive or negative) for the coefficient/weight for ‘bathrooms’ in Model 2?


In [30]:
model2.get('coefficients')  # positive


Out[30]:
name index value stderr
(intercept) None -54410676.1152 1650405.16541
sqft_living None 304.449298057 3.20217535637
bedrooms None -116366.043231 4805.54966546
bathrooms None -77972.3305135 7565.05991091
lat None 625433.834953 13058.3530972
long None -203958.60296 13268.1283711
bed_bath_rooms None 26961.6249092 1956.36561555
[7 rows x 4 columns]

Now using your three estimated models compute the RSS (Residual Sum of Squares) on the Training data.


In [31]:
def calcRSS(model, features, output):
    predict = model.predict(features)
    error = output - predict
    rss = np.sum(np.square(error))
    return rss

In [32]:
calcRSS(model1, train_data[model1_features], train_data['price'])


Out[32]:
971328233543667.0

In [33]:
calcRSS(model2, train_data[model2_features], train_data['price'])


Out[33]:
961592067855751.87

In [34]:
calcRSS(model3, train_data[model3_features], train_data['price'])


Out[34]:
905276314555407.0

three estimated models compute the RSS on the Testing data


In [35]:
calcRSS(model1, test_data[model1_features], test_data['price'])


Out[35]:
226568089092795.47

In [36]:
calcRSS(model2, test_data[model2_features], test_data['price'])


Out[36]:
224368799993615.25

In [37]:
calcRSS(model3, test_data[model3_features], test_data['price'])


Out[37]:
251829318951767.87

In [ ]: