``````

In [1]:

import graphlab

``````

## Basic settings

``````

In [2]:

#limit number of worker processes to 4
graphlab.set_runtime_config('GRAPHLAB_DEFAULT_NUM_PYLAMBDA_WORKERS', 4)

``````
``````

[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1474249782.log

This non-commercial license of GraphLab Create for academic use is assigned to sudhanshu.shekhar.iitd@gmail.com and will expire on September 18, 2017.

``````
``````

In [3]:

#set canvas to open inline
graphlab.canvas.set_target('ipynb')

``````

``````

In [4]:

sales = graphlab.SFrame('home_data.gl/')

``````

# Assignment begins

## 1. Selection and summary statistics

In the notebook we covered in the module, we discovered which neighborhood (zip code) of Seattle had the highest average house sale price. Now, take the sales data, select only the houses with this zip code, and compute the average price. Save this result to answer the quiz at the end.

``````

In [5]:

highest_avg_price_zipcode = '98039'

``````
``````

In [7]:

sales_zipcode = sales[sales['zipcode'] == highest_avg_price_zipcode]

``````
``````

In [10]:

avg_price_highest_zipcode = sales_zipcode['price'].mean()

``````
``````

In [11]:

print avg_price_highest_zipcode

``````
``````

2160606.6

``````

## 2. Filtering data

Using logical filters, first select the houses that have ‘sqft_living’ higher than 2000 sqft but no larger than 4000 sqft. What fraction of the all houses have ‘sqft_living’ in this range? Save this result to answer the quiz at the end.

### Total number of houses

``````

In [12]:

total_houses = sales.num_rows()

``````
``````

In [13]:

print total_houses

``````
``````

21613

``````

### Houses with the above criteria

``````

In [17]:

filtered_houses = sales[(sales['sqft_living'] > 2000) & (sales['sqft_living'] <= 4000)]

``````
``````

In [18]:

print filtered_houses.num_rows()

``````
``````

9118

``````
``````

In [23]:

filtered_houses = sales[sales.apply(lambda x: (x['sqft_living'] > 2000) & (x['sqft_living'] <= 4000))]

``````
``````

In [24]:

print filtered_houses.num_rows()

``````
``````

9118

``````
``````

In [27]:

total_filtered_houses = filtered_houses.num_rows()

``````
``````

In [28]:

print total_filtered_houses

``````
``````

9118

``````

### Fraction of Houses

``````

In [33]:

filtered_houses_fraction = total_filtered_houses / float(total_houses)

``````
``````

In [34]:

print filtered_houses_fraction

``````
``````

0.421875722945

``````

## 3. Building a regression model with several more features

### Build the feature set

``````

In [36]:

'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house
'grade', # measure of quality of construction
'waterfront', # waterfront property
'view', # type of view
'sqft_above', # square feet above ground
'sqft_basement', # square feet in basement
'yr_built', # the year built
'yr_renovated', # the year renovated
'lat', 'long', # the lat-long of the parcel
'sqft_living15', # average sq.ft. of 15 nearest neighbors
'sqft_lot15', # average lot size of 15 nearest neighbors
]

``````
``````

In [37]:

``````
``````

['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode', 'condition', 'grade', 'waterfront', 'view', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15']

``````
``````

In [38]:

my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

``````

### Create train and test data

``````

In [39]:

train_data, test_data = sales.random_split(.8, seed=0)

``````

### Compute the RMSE

RMSE(root mean squared error) on the test_data for the model using just my_features, and for the one using advanced_features.

``````

In [40]:

my_feature_model = graphlab.linear_regression.create(train_data, target='price', features=my_features, validation_set=None)

``````
``````

Linear regression:

--------------------------------------------------------

Number of examples          : 17384

Number of features          : 6

Number of unpacked features : 6

Number of coefficients    : 115

Starting Newton Method

--------------------------------------------------------

+-----------+----------+--------------+--------------------+---------------+

| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |

+-----------+----------+--------------+--------------------+---------------+

| 1         | 2        | 1.057301     | 3763208.270524     | 181908.848367 |

+-----------+----------+--------------+--------------------+---------------+

SUCCESS: Optimal solution found.

``````
``````

In [41]:

print my_feature_model.evaluate(test_data)

``````
``````

{'max_error': 3486584.509381928, 'rmse': 179542.43331269105}

``````
``````

In [43]:

print test_data['price'].mean()

``````
``````

543054.042563

``````
``````

In [44]:

``````
``````

Linear regression:

--------------------------------------------------------

Number of examples          : 17384

Number of features          : 18

Number of unpacked features : 18

Number of coefficients    : 127

Starting Newton Method

--------------------------------------------------------

+-----------+----------+--------------+--------------------+---------------+

| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |

+-----------+----------+--------------+--------------------+---------------+

| 1         | 2        | 0.080134     | 3469012.450663     | 154580.940735 |

+-----------+----------+--------------+--------------------+---------------+

SUCCESS: Optimal solution found.

``````
``````

In [45]:

``````
``````

{'max_error': 3556849.413848093, 'rmse': 156831.11680191013}

``````

### Difference in RMSE

What is the difference in RMSE between the model trained with my_features and the one trained with advanced_features? Save this result to answer the quiz at the end.

``````

In [47]:

``````
``````

22711.3165108

``````

# That's all folks!

``````

In [ ]:

``````