Load GrahpLab Create



In [1]:

    
import graphlab

Basic settings



In [2]:

    
#limit number of worker processes to 4
graphlab.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 4)









    



[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1474249782.log






    



This non-commercial license of GraphLab Create for academic use is assigned to sudhanshu.shekhar.iitd@gmail.com and will expire on September 18, 2017.



In [3]:

    
#set canvas to open inline
graphlab.canvas.set_target('ipynb')

Load House Sales Data



In [4]:

    
sales = graphlab.SFrame('home_data.gl/')

Assignment begins

1. Selection and summary statistics

In the notebook we covered in the module, we discovered which neighborhood (zip code) of Seattle had the highest average house sale price. Now, take the sales data, select only the houses with this zip code, and compute the average price. Save this result to answer the quiz at the end.



In [5]:

    
highest_avg_price_zipcode = '98039'



In [7]:

    
sales_zipcode = sales[sales['zipcode'] == highest_avg_price_zipcode]



In [10]:

    
avg_price_highest_zipcode = sales_zipcode['price'].mean()



In [11]:

    
print avg_price_highest_zipcode

2. Filtering data

Using logical filters, first select the houses that have ‘sqft_living’ higher than 2000 sqft but no larger than 4000 sqft. What fraction of the all houses have ‘sqft_living’ in this range? Save this result to answer the quiz at the end.

Total number of houses



In [12]:

    
total_houses = sales.num_rows()



In [13]:

    
print total_houses

Houses with the above criteria



In [17]:

    
filtered_houses = sales[(sales['sqft_living'] > 2000) & (sales['sqft_living'] <= 4000)]



In [18]:

    
print filtered_houses.num_rows()



In [23]:

    
filtered_houses = sales[sales.apply(lambda x: (x['sqft_living'] > 2000) & (x['sqft_living'] <= 4000))]



In [24]:

    
print filtered_houses.num_rows()



In [27]:

    
total_filtered_houses = filtered_houses.num_rows()



In [28]:

    
print total_filtered_houses

Fraction of Houses



In [33]:

    
filtered_houses_fraction = total_filtered_houses / float(total_houses)



In [34]:

    
print filtered_houses_fraction









    



0.421875722945

3. Building a regression model with several more features

Build the feature set



In [36]:

    
advanced_features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house				
'grade', # measure of quality of construction				
'waterfront', # waterfront property				
'view', # type of view				
'sqft_above', # square feet above ground				
'sqft_basement', # square feet in basement				
'yr_built', # the year built				
'yr_renovated', # the year renovated				
'lat', 'long', # the lat-long of the parcel				
'sqft_living15', # average sq.ft. of 15 nearest neighbors 				
'sqft_lot15', # average lot size of 15 nearest neighbors 
]



In [37]:

    
print advanced_features









    



['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode', 'condition', 'grade', 'waterfront', 'view', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15']



In [38]:

    
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

Create train and test data



In [39]:

    
train_data, test_data = sales.random_split(.8, seed=0)

Compute the RMSE

RMSE(root mean squared error) on the test_data for the model using just my_features, and for the one using advanced_features.



In [40]:

    
my_feature_model = graphlab.linear_regression.create(train_data, target='price', features=my_features, validation_set=None)









    




Linear regression:






    




--------------------------------------------------------






    




Number of examples          : 17384






    




Number of features          : 6






    




Number of unpacked features : 6






    




Number of coefficients    : 115






    




Starting Newton Method






    




--------------------------------------------------------






    




+-----------+----------+--------------+--------------------+---------------+






    




| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |






    




+-----------+----------+--------------+--------------------+---------------+






    




| 1         | 2        | 1.057301     | 3763208.270524     | 181908.848367 |






    




+-----------+----------+--------------+--------------------+---------------+






    




SUCCESS: Optimal solution found.



In [41]:

    
print my_feature_model.evaluate(test_data)









    



{'max_error': 3486584.509381928, 'rmse': 179542.43331269105}



In [43]:

    
print test_data['price'].mean()









    



543054.042563



In [44]:

    
advanced_feature_model = graphlab.linear_regression.create(train_data, target='price', features=advanced_features, validation_set=None)









    




Linear regression:






    




--------------------------------------------------------






    




Number of examples          : 17384






    




Number of features          : 18






    




Number of unpacked features : 18






    




Number of coefficients    : 127






    




Starting Newton Method






    




--------------------------------------------------------






    




+-----------+----------+--------------+--------------------+---------------+






    




| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |






    




+-----------+----------+--------------+--------------------+---------------+






    




| 1         | 2        | 0.080134     | 3469012.450663     | 154580.940735 |






    




+-----------+----------+--------------+--------------------+---------------+






    




SUCCESS: Optimal solution found.



In [45]:

    
print advanced_feature_model.evaluate(test_data)









    



{'max_error': 3556849.413848093, 'rmse': 156831.11680191013}

Difference in RMSE

What is the difference in RMSE between the model trained with my_features and the one trained with advanced_features? Save this result to answer the quiz at the end.



In [47]:

    
print my_feature_model.evaluate(test_data)['rmse'] - advanced_feature_model.evaluate(test_data)['rmse']









    



22711.3165108

That's all folks!



In [ ]: