Load GrahpLab Create


In [1]:
import graphlab

Basic settings


In [2]:
#limit number of worker processes to 4
graphlab.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 4)


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1474249782.log
This non-commercial license of GraphLab Create for academic use is assigned to sudhanshu.shekhar.iitd@gmail.com and will expire on September 18, 2017.

In [3]:
#set canvas to open inline
graphlab.canvas.set_target('ipynb')

Load House Sales Data


In [4]:
sales = graphlab.SFrame('home_data.gl/')

Assignment begins

1. Selection and summary statistics

In the notebook we covered in the module, we discovered which neighborhood (zip code) of Seattle had the highest average house sale price. Now, take the sales data, select only the houses with this zip code, and compute the average price. Save this result to answer the quiz at the end.


In [5]:
highest_avg_price_zipcode = '98039'

In [7]:
sales_zipcode = sales[sales['zipcode'] == highest_avg_price_zipcode]

In [10]:
avg_price_highest_zipcode = sales_zipcode['price'].mean()

In [11]:
print avg_price_highest_zipcode


2160606.6

2. Filtering data

Using logical filters, first select the houses that have ‘sqft_living’ higher than 2000 sqft but no larger than 4000 sqft. What fraction of the all houses have ‘sqft_living’ in this range? Save this result to answer the quiz at the end.

Total number of houses


In [12]:
total_houses = sales.num_rows()

In [13]:
print total_houses


21613

Houses with the above criteria


In [17]:
filtered_houses = sales[(sales['sqft_living'] > 2000) & (sales['sqft_living'] <= 4000)]

In [18]:
print filtered_houses.num_rows()


9118

In [23]:
filtered_houses = sales[sales.apply(lambda x: (x['sqft_living'] > 2000) & (x['sqft_living'] <= 4000))]

In [24]:
print filtered_houses.num_rows()


9118

In [27]:
total_filtered_houses = filtered_houses.num_rows()

In [28]:
print total_filtered_houses


9118

Fraction of Houses


In [33]:
filtered_houses_fraction = total_filtered_houses / float(total_houses)

In [34]:
print filtered_houses_fraction


0.421875722945

3. Building a regression model with several more features

Build the feature set


In [36]:
advanced_features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house				
'grade', # measure of quality of construction				
'waterfront', # waterfront property				
'view', # type of view				
'sqft_above', # square feet above ground				
'sqft_basement', # square feet in basement				
'yr_built', # the year built				
'yr_renovated', # the year renovated				
'lat', 'long', # the lat-long of the parcel				
'sqft_living15', # average sq.ft. of 15 nearest neighbors 				
'sqft_lot15', # average lot size of 15 nearest neighbors 
]

In [37]:
print advanced_features


['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode', 'condition', 'grade', 'waterfront', 'view', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15']

In [38]:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

Create train and test data


In [39]:
train_data, test_data = sales.random_split(.8, seed=0)

Compute the RMSE

RMSE(root mean squared error) on the test_data for the model using just my_features, and for the one using advanced_features.


In [40]:
my_feature_model = graphlab.linear_regression.create(train_data, target='price', features=my_features, validation_set=None)


Linear regression:
--------------------------------------------------------
Number of examples          : 17384
Number of features          : 6
Number of unpacked features : 6
Number of coefficients    : 115
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 1.057301     | 3763208.270524     | 181908.848367 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.


In [41]:
print my_feature_model.evaluate(test_data)


{'max_error': 3486584.509381928, 'rmse': 179542.43331269105}

In [43]:
print test_data['price'].mean()


543054.042563

In [44]:
advanced_feature_model = graphlab.linear_regression.create(train_data, target='price', features=advanced_features, validation_set=None)


Linear regression:
--------------------------------------------------------
Number of examples          : 17384
Number of features          : 18
Number of unpacked features : 18
Number of coefficients    : 127
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.080134     | 3469012.450663     | 154580.940735 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.


In [45]:
print advanced_feature_model.evaluate(test_data)


{'max_error': 3556849.413848093, 'rmse': 156831.11680191013}

Difference in RMSE

What is the difference in RMSE between the model trained with my_features and the one trained with advanced_features? Save this result to answer the quiz at the end.


In [47]:
print my_feature_model.evaluate(test_data)['rmse'] - advanced_feature_model.evaluate(test_data)['rmse']


22711.3165108

That's all folks!


In [ ]: