notebook.community

Edit and run



In [7]:

    
# import
import graphlab as gl
import matplotlib.pyplot as plt
import numpy as np



In [2]:

    
gl.canvas.set_target('ipynb')
%matplotlib inline



In [4]:

    
# reading data
sales = gl.SFrame('data/kc_house_data.gl/')
sales.head(4)









    



This non-commercial license of GraphLab Create for academic use is assigned to atul9806@yahoo.in and will expire on February 02, 2018.






    



[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\Atul\AppData\Local\Temp\graphlab_server_1502814938.log.0






    Out[4]:





    
        id
        date
        price
        bedrooms
        bathrooms
        sqft_living
        sqft_lot
        floors
        waterfront
    
    
        7129300520
        2014-10-13 00:00:00+00:00
        221900.0
        3.0
        1.0
        1180.0
        5650
        1
        0
    
    
        6414100192
        2014-12-09 00:00:00+00:00
        538000.0
        3.0
        2.25
        2570.0
        7242
        2
        0
    
    
        5631500400
        2015-02-25 00:00:00+00:00
        180000.0
        2.0
        1.0
        770.0
        10000
        1
        0
    
    
        2487200875
        2014-12-09 00:00:00+00:00
        604000.0
        4.0
        3.0
        1960.0
        5000
        1
        0
    


    
        view
        condition
        grade
        sqft_above
        sqft_basement
        yr_built
        yr_renovated
        zipcode
        lat
    
    
        0
        3
        7
        1180
        0
        1955
        0
        98178
        47.51123398
    
    
        0
        3
        7
        2170
        400
        1951
        1991
        98125
        47.72102274
    
    
        0
        3
        6
        770
        0
        1933
        0
        98028
        47.73792661
    
    
        0
        5
        7
        1050
        910
        1965
        0
        98136
        47.52082
    


    
        long
        sqft_living15
        sqft_lot15
    
    
        -122.25677536
        1340.0
        5650.0
    
    
        -122.3188624
        1690.0
        7639.0
    
    
        -122.23319601
        2720.0
        8062.0
    
    
        -122.39318505
        1360.0
        5000.0
    

[4 rows x 21 columns]



In [5]:

    
train_data,test_data = sales.random_split(.8,seed=0)

3. Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, square feet, and # of bathrooms) but we can also consider transformations of existing variables e.g. the log of the square feet or even "interaction" variables such as the product of bedrooms and bathrooms. Add 4 new variables in both your train_data and test_data.

‘bedrooms_squared’ = ‘bedrooms’*‘bedrooms’
‘bed_bath_rooms’ = ‘bedrooms’*‘bathrooms’
‘log_sqft_living’ = log(‘sqft_living’)
‘lat_plus_long’ = ‘lat’ + ‘long’
Before we continue let’s explain these new variables:

Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.
Bedrooms times bathrooms is what's called an "interaction" variable. It is large when both of them are large.
Taking the log of square feet has the effect of bringing large values closer together and spreading out small values.
Adding latitude to longitude is non-sensical but we will do it anyway (you'll see why)
For those students not using SFrames you should first download and import the training and testing data sets provided and then add the four new variables each to both data sets (training and testing)



In [22]:

    
train_data['bedrooms_squared'] = train_data['bedrooms']*train_data['bedrooms']
train_data['bed_bath_rooms'] = train_data['bedrooms']*train_data['bathrooms']
train_data['log_sqft_living'] = np.log(train_data['sqft_living'])
train_data['lat_plus_long'] = train_data['lat']+train_data['long']

test_data['bedrooms_squared'] = test_data['bedrooms']*test_data['bedrooms']
test_data['bed_bath_rooms'] = test_data['bedrooms']*test_data['bathrooms']
test_data['log_sqft_living'] = np.log(test_data['sqft_living'])
test_data['lat_plus_long'] = test_data['lat']+test_data['long']



In [23]:

    
train_data[:3]









    Out[23]:





    
        id
        date
        price
        bedrooms
        bathrooms
        sqft_living
        sqft_lot
        floors
        waterfront
    
    
        7129300520
        2014-10-13 00:00:00+00:00
        221900.0
        3.0
        1.0
        1180.0
        5650
        1
        0
    
    
        6414100192
        2014-12-09 00:00:00+00:00
        538000.0
        3.0
        2.25
        2570.0
        7242
        2
        0
    
    
        5631500400
        2015-02-25 00:00:00+00:00
        180000.0
        2.0
        1.0
        770.0
        10000
        1
        0
    


    
        view
        condition
        grade
        sqft_above
        sqft_basement
        yr_built
        yr_renovated
        zipcode
        lat
    
    
        0
        3
        7
        1180
        0
        1955
        0
        98178
        47.51123398
    
    
        0
        3
        7
        2170
        400
        1951
        1991
        98125
        47.72102274
    
    
        0
        3
        6
        770
        0
        1933
        0
        98028
        47.73792661
    


    
        long
        sqft_living15
        sqft_lot15
        bedrooms_squared
        bed_bath_rooms
        log_sqft_living
        lat_plus_long
    
    
        -122.25677536
        1340.0
        5650.0
        9.0
        3.0
        7.07326971746
        -74.74554138
    
    
        -122.3188624
        1690.0
        7639.0
        9.0
        6.75
        7.85166117789
        -74.59783966
    
    
        -122.23319601
        2720.0
        8062.0
        4.0
        2.0
        6.64639051485
        -74.4952694
    

[3 rows x 25 columns]

what are the mean (arithmetic average) values of your 4 new variables on TEST data? (round to 2 digits)



In [38]:

    
print(np.average(test_data['bedrooms_squared']))
print(np.average(test_data['bed_bath_rooms']  ))
print(np.average(test_data['log_sqft_living'] ))
print(np.average(test_data['lat_plus_long']))









    



12.4466777016
7.50390163159
7.55027467965
-74.6533349722



In [25]:

    
model1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat','long']
model2_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat','long', 'bed_bath_rooms']
model3_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat','long', 'bed_bath_rooms','bedrooms_squared', 'log_sqft_living','lat_plus_long']



In [27]:

    
model1 = gl.linear_regression.create(train_data, target='price', features=model1_features, validation_set = None)
model2 = gl.linear_regression.create(train_data, target='price', features=model2_features, validation_set = None)
model3 = gl.linear_regression.create(train_data, target='price', features=model3_features, validation_set = None)









    




Linear regression:






    




--------------------------------------------------------






    




Number of examples          : 17384






    




Number of features          : 5






    




Number of unpacked features : 5






    




Number of coefficients    : 6






    




Starting Newton Method






    




--------------------------------------------------------






    




+-----------+----------+--------------+--------------------+---------------+






    




| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |






    




+-----------+----------+--------------+--------------------+---------------+






    




| 1         | 2        | 0.018516     | 4074878.213096     | 236378.596455 |






    




+-----------+----------+--------------+--------------------+---------------+






    




SUCCESS: Optimal solution found.






    











    




Linear regression:






    




--------------------------------------------------------






    




Number of examples          : 17384






    




Number of features          : 6






    




Number of unpacked features : 6






    




Number of coefficients    : 7






    




Starting Newton Method






    




--------------------------------------------------------






    




+-----------+----------+--------------+--------------------+---------------+






    




| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |






    




+-----------+----------+--------------+--------------------+---------------+






    




| 1         | 2        | 0.031025     | 4014170.932927     | 235190.935428 |






    




+-----------+----------+--------------+--------------------+---------------+






    




SUCCESS: Optimal solution found.






    











    




Linear regression:






    




--------------------------------------------------------






    




Number of examples          : 17384






    




Number of features          : 9






    




Number of unpacked features : 9






    




Number of coefficients    : 10






    




Starting Newton Method






    




--------------------------------------------------------






    




+-----------+----------+--------------+--------------------+---------------+






    




| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |






    




+-----------+----------+--------------+--------------------+---------------+






    




| 1         | 2        | 0.026524     | 3193229.177894     | 228200.043155 |






    




+-----------+----------+--------------+--------------------+---------------+






    




SUCCESS: Optimal solution found.

What is the sign (positive or negative) for the coefficient/weight for ‘bathrooms’ in Model 1?



In [29]:

    
model1.get('coefficients')  # negative









    Out[29]:





    
        name
        index
        value
        stderr
    
    
        (intercept)
        None
        -56140675.7444
        1649985.42028
    
    
        sqft_living
        None
        310.263325778
        3.18882960408
    
    
        bedrooms
        None
        -59577.1160682
        2487.27977322
    
    
        bathrooms
        None
        13811.8405418
        3593.54213297
    
    
        lat
        None
        629865.789485
        13120.7100323
    
    
        long
        None
        -214790.285186
        13284.2851607
    

[6 rows x 4 columns]

What is the sign (positive or negative) for the coefficient/weight for ‘bathrooms’ in Model 2?



In [30]:

    
model2.get('coefficients')  # positive









    Out[30]:





    
        name
        index
        value
        stderr
    
    
        (intercept)
        None
        -54410676.1152
        1650405.16541
    
    
        sqft_living
        None
        304.449298057
        3.20217535637
    
    
        bedrooms
        None
        -116366.043231
        4805.54966546
    
    
        bathrooms
        None
        -77972.3305135
        7565.05991091
    
    
        lat
        None
        625433.834953
        13058.3530972
    
    
        long
        None
        -203958.60296
        13268.1283711
    
    
        bed_bath_rooms
        None
        26961.6249092
        1956.36561555
    

[7 rows x 4 columns]

Now using your three estimated models compute the RSS (Residual Sum of Squares) on the Training data.



In [31]:

    
def calcRSS(model, features, output):
    predict = model.predict(features)
    error = output - predict
    rss = np.sum(np.square(error))
    return rss



In [32]:

    
calcRSS(model1, train_data[model1_features], train_data['price'])









    Out[32]:





971328233543667.0



In [33]:

    
calcRSS(model2, train_data[model2_features], train_data['price'])









    Out[33]:





961592067855751.87



In [34]:

    
calcRSS(model3, train_data[model3_features], train_data['price'])









    Out[34]:





905276314555407.0

three estimated models compute the RSS on the Testing data



In [35]:

    
calcRSS(model1, test_data[model1_features], test_data['price'])









    Out[35]:





226568089092795.47



In [36]:

    
calcRSS(model2, test_data[model2_features], test_data['price'])









    Out[36]:





224368799993615.25



In [37]:

    
calcRSS(model3, test_data[model3_features], test_data['price'])









    Out[37]:





251829318951767.87



In [ ]:

id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors
7129300520	2014-10-13 00:00:00+00:00	221900.0	3.0	1.0	1180.0	5650	1
6414100192	2014-12-09 00:00:00+00:00	538000.0	3.0	2.25	2570.0	7242	2
5631500400	2015-02-25 00:00:00+00:00	180000.0	2.0	1.0	770.0	10000	1
2487200875	2014-12-09 00:00:00+00:00	604000.0	4.0	3.0	1960.0	5000	1

condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat
3	7	1180	0	1955	0	98178	47.51123398
3	7	2170	400	1951	1991	98125	47.72102274
3	6	770	0	1933	0	98028	47.73792661
5	7	1050	910	1965	0	98136	47.52082

long	sqft_living15	sqft_lot15
-122.25677536	1340.0	5650.0
-122.3188624	1690.0	7639.0
-122.23319601	2720.0	8062.0
-122.39318505	1360.0	5000.0

name	index	value	stderr
(intercept)	None	-56140675.7444	1649985.42028
sqft_living	None	310.263325778	3.18882960408
bedrooms	None	-59577.1160682	2487.27977322
bathrooms	None	13811.8405418	3593.54213297
lat	None	629865.789485	13120.7100323
long	None	-214790.285186	13284.2851607

name	index	value	stderr
(intercept)	None	-54410676.1152	1650405.16541
sqft_living	None	304.449298057	3.20217535637
bedrooms	None	-116366.043231	4805.54966546
bathrooms	None	-77972.3305135	7565.05991091
lat	None	625433.834953	13058.3530972
long	None	-203958.60296	13268.1283711
bed_bath_rooms	None	26961.6249092	1956.36561555