Load GrahpLab Create

In [1]:
import graphlab

Basic settings

In [2]:
#limit number of worker processes to 4
graphlab.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 4)

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1474249782.log
In [3]:
#set canvas to open inline

Load House Sales Data

In [4]:
sales = graphlab.SFrame('home_data.gl/')

Assignment begins

1. Selection and summary statistics

In the notebook we covered in the module, we discovered which neighborhood (zip code) of Seattle had the highest average house sale price. Now, take the sales data, select only the houses with this zip code, and compute the average price. Save this result to answer the quiz at the end.

In [5]:
highest_avg_price_zipcode = '98039'

In [7]:
sales_zipcode = sales[sales['zipcode'] == highest_avg_price_zipcode]

In [10]:
avg_price_highest_zipcode = sales_zipcode['price'].mean()

In [11]:
print avg_price_highest_zipcode


2. Filtering data

Using logical filters, first select the houses that have ‘sqft_living’ higher than 2000 sqft but no larger than 4000 sqft. What fraction of the all houses have ‘sqft_living’ in this range? Save this result to answer the quiz at the end.

Total number of houses

In [12]:
total_houses = sales.num_rows()

In [13]:
print total_houses


Houses with the above criteria

In [17]:
filtered_houses = sales[(sales['sqft_living'] > 2000) & (sales['sqft_living'] <= 4000)]

In [18]:
print filtered_houses.num_rows()


In [23]:
filtered_houses = sales[sales.apply(lambda x: (x['sqft_living'] > 2000) & (x['sqft_living'] <= 4000))]

In [24]:
print filtered_houses.num_rows()


In [27]:
total_filtered_houses = filtered_houses.num_rows()

In [28]:
print total_filtered_houses


Fraction of Houses

In [33]:
filtered_houses_fraction = total_filtered_houses / float(total_houses)

In [34]:
print filtered_houses_fraction


3. Building a regression model with several more features

Build the feature set

In [36]:
advanced_features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house				
'grade', # measure of quality of construction				
'waterfront', # waterfront property				
'view', # type of view				
'sqft_above', # square feet above ground				
'sqft_basement', # square feet in basement				
'yr_built', # the year built				
'yr_renovated', # the year renovated				
'lat', 'long', # the lat-long of the parcel				
'sqft_living15', # average sq.ft. of 15 nearest neighbors 				
'sqft_lot15', # average lot size of 15 nearest neighbors 

In [37]:
print advanced_features

['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode', 'condition', 'grade', 'waterfront', 'view', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15']

In [38]:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

Create train and test data

In [39]:
train_data, test_data = sales.random_split(.8, seed=0)

Compute the RMSE

RMSE(root mean squared error) on the test_data for the model using just my_features, and for the one using advanced_features.

In [40]:
my_feature_model = graphlab.linear_regression.create(train_data, target='price', features=my_features, validation_set=None)

Linear regression:
Number of examples          : 17384
Number of features          : 6
Number of unpacked features : 6
Number of coefficients    : 115
Starting Newton Method
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
| 1         | 2        | 1.057301     | 3763208.270524     | 181908.848367 |
SUCCESS: Optimal solution found.

In [41]:
print my_feature_model.evaluate(test_data)

{'max_error': 3486584.509381928, 'rmse': 179542.43331269105}

In [43]:
print test_data['price'].mean()


In [44]:
advanced_feature_model = graphlab.linear_regression.create(train_data, target='price', features=advanced_features, validation_set=None)

Linear regression:
Number of examples          : 17384
Number of features          : 18
Number of unpacked features : 18
Number of coefficients    : 127
Starting Newton Method
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
| 1         | 2        | 0.080134     | 3469012.450663     | 154580.940735 |
SUCCESS: Optimal solution found.

In [45]:
print advanced_feature_model.evaluate(test_data)

{'max_error': 3556849.413848093, 'rmse': 156831.11680191013}

Difference in RMSE

What is the difference in RMSE between the model trained with my_features and the one trained with advanced_features? Save this result to answer the quiz at the end.

In [47]:
print my_feature_model.evaluate(test_data)['rmse'] - advanced_feature_model.evaluate(test_data)['rmse']


That's all folks!

