notebook.community

Edit and run



In [1]:

    
# import
import graphlab as gl
import matplotlib.pyplot as plt
import numpy as np



In [2]:

    
gl.canvas.set_target('ipynb')
%matplotlib inline



In [3]:

    
# reading data
sales = gl.SFrame('data/kc_house_data.gl/')
sales.head(4)









    



[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\Atul\AppData\Local\Temp\graphlab_server_1502816832.log.0






    



This non-commercial license of GraphLab Create for academic use is assigned to atul9806@yahoo.in and will expire on February 02, 2018.






    Out[3]:





    
        id
        date
        price
        bedrooms
        bathrooms
        sqft_living
        sqft_lot
        floors
        waterfront
    
    
        7129300520
        2014-10-13 00:00:00+00:00
        221900.0
        3.0
        1.0
        1180.0
        5650
        1
        0
    
    
        6414100192
        2014-12-09 00:00:00+00:00
        538000.0
        3.0
        2.25
        2570.0
        7242
        2
        0
    
    
        5631500400
        2015-02-25 00:00:00+00:00
        180000.0
        2.0
        1.0
        770.0
        10000
        1
        0
    
    
        2487200875
        2014-12-09 00:00:00+00:00
        604000.0
        4.0
        3.0
        1960.0
        5000
        1
        0
    


    
        view
        condition
        grade
        sqft_above
        sqft_basement
        yr_built
        yr_renovated
        zipcode
        lat
    
    
        0
        3
        7
        1180
        0
        1955
        0
        98178
        47.51123398
    
    
        0
        3
        7
        2170
        400
        1951
        1991
        98125
        47.72102274
    
    
        0
        3
        6
        770
        0
        1933
        0
        98028
        47.73792661
    
    
        0
        5
        7
        1050
        910
        1965
        0
        98136
        47.52082
    


    
        long
        sqft_living15
        sqft_lot15
    
    
        -122.25677536
        1340.0
        5650.0
    
    
        -122.3188624
        1690.0
        7639.0
    
    
        -122.23319601
        2720.0
        8062.0
    
    
        -122.39318505
        1360.0
        5000.0
    

[4 rows x 21 columns]



In [ ]:

    
train_data,test_data = sales.random_split(.8,seed=0)

3. Although we often think of multiple regression as including multiple different features (e.g. # of bedrooms, square feet, and # of bathrooms) but we can also consider transformations of existing variables e.g. the log of the square feet or even "interaction" variables such as the product of bedrooms and bathrooms. Add 4 new variables in both your train_data and test_data.

‘bedrooms_squared’ = ‘bedrooms’*‘bedrooms’
‘bed_bath_rooms’ = ‘bedrooms’*‘bathrooms’
‘log_sqft_living’ = log(‘sqft_living’)
‘lat_plus_long’ = ‘lat’ + ‘long’
Before we continue let’s explain these new variables:

Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.
Bedrooms times bathrooms is what's called an "interaction" variable. It is large when both of them are large.
Taking the log of square feet has the effect of bringing large values closer together and spreading out small values.
Adding latitude to longitude is non-sensical but we will do it anyway (you'll see why)
For those students not using SFrames you should first download and import the training and testing data sets provided and then add the four new variables each to both data sets (training and testing)



In [ ]:

    
train_data['bedrooms_squared'] = train_data['bedrooms']*train_data['bedrooms']
train_data['bed_bath_rooms'] = train_data['bedrooms']*train_data['bathrooms']
train_data['log_sqft_living'] = np.log(train_data['sqft_living'])
train_data['lat_plus_long'] = train_data['lat']+train_data['long']

test_data['bedrooms_squared'] = test_data['bedrooms']*test_data['bedrooms']
test_data['bed_bath_rooms'] = test_data['bedrooms']*test_data['bathrooms']
test_data['log_sqft_living'] = np.log(test_data['sqft_living'])
test_data['lat_plus_long'] = test_data['lat']+test_data['long']



In [ ]:

    
train_data[:3]

what are the mean (arithmetic average) values of your 4 new variables on TEST data? (round to 2 digits)



In [ ]:

    
print(np.average(test_data['bedrooms_squared']))
print(np.average(test_data['bed_bath_rooms']  ))
print(np.average(test_data['log_sqft_living'] ))
print(np.average(test_data['lat_plus_long']))



In [ ]:

    
model1_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat','long']
model2_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat','long', 'bed_bath_rooms']
model3_features = ['sqft_living', 'bedrooms', 'bathrooms', 'lat','long', 'bed_bath_rooms','bedrooms_squared', 'log_sqft_living','lat_plus_long']



In [ ]:

    
model1 = gl.linear_regression.create(train_data, target='price', features=model1_features, validation_set = None)
model2 = gl.linear_regression.create(train_data, target='price', features=model2_features, validation_set = None)
model3 = gl.linear_regression.create(train_data, target='price', features=model3_features, validation_set = None)

What is the sign (positive or negative) for the coefficient/weight for ‘bathrooms’ in Model 1?



In [ ]:

    
model1.get('coefficients')  # negative

What is the sign (positive or negative) for the coefficient/weight for ‘bathrooms’ in Model 2?



In [ ]:

    
model2.get('coefficients')  # positive

Now using your three estimated models compute the RSS (Residual Sum of Squares) on the Training data.



In [ ]:

    
def calcRSS(model, features, output):
    predict = model.predict(features)
    error = output - predict
    rss = np.sum(np.square(error))
    return rss



In [ ]:

    
calcRSS(model1, train_data[model1_features], train_data['price'])



In [ ]:

    
calcRSS(model2, train_data[model2_features], train_data['price'])



In [ ]:

    
calcRSS(model3, train_data[model3_features], train_data['price'])

three estimated models compute the RSS on the Testing data



In [ ]:

    
calcRSS(model1, test_data[model1_features], test_data['price'])



In [ ]:

    
calcRSS(model2, test_data[model2_features], test_data['price'])



In [ ]:

    
calcRSS(model3, test_data[model3_features], test_data['price'])



In [ ]:

id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors
7129300520	2014-10-13 00:00:00+00:00	221900.0	3.0	1.0	1180.0	5650	1
6414100192	2014-12-09 00:00:00+00:00	538000.0	3.0	2.25	2570.0	7242	2
5631500400	2015-02-25 00:00:00+00:00	180000.0	2.0	1.0	770.0	10000	1
2487200875	2014-12-09 00:00:00+00:00	604000.0	4.0	3.0	1960.0	5000	1

condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat
3	7	1180	0	1955	0	98178	47.51123398
3	7	2170	400	1951	1991	98125	47.72102274
3	6	770	0	1933	0	98028	47.73792661
5	7	1050	910	1965	0	98136	47.52082

long	sqft_living15	sqft_lot15
-122.25677536	1340.0	5650.0
-122.3188624	1690.0	7639.0
-122.23319601	2720.0	8062.0
-122.39318505	1360.0	5000.0