In [2]:
import graphlab

In [5]:
sales = graphlab.SFrame('home_data.gl/')

Q1


In [31]:
houses_1 = sales[sales['zipcode']=='98039']

In [32]:
houses_1['price'].mean()


Out[32]:
2160606.5999999996

Q2


In [33]:
houses_2 = sales[(sales['sqft_living'] > 2000) & (sales['sqft_living'] < 4000)]

In [36]:
len(houses_2) / float(len(sales))


Out[36]:
0.4215518437977143

Q3

RMSE of about \$255,170!

The RMSE goes down from \$255,170 to \$179,508 with more features.


In [8]:
train_data,test_data = sales.random_split(.8,seed=0)

In [9]:
advanced_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode', 'condition', 'grade', 'waterfront', 'view', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'lat', 'long', 'sqft_living15', 'sqft_lot15']

In [10]:
advanced_features_model = graphlab.linear_regression.create(train_data,target='price',features=advanced_features)


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 16510
PROGRESS: Number of features          : 18
PROGRESS: Number of unpacked features : 18
PROGRESS: Number of coefficients    : 127
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 1.046632     | 3485902.939104     | 1284169.317824       | 154910.036363 | 149644.352069   |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+

In [11]:
print advanced_features_model.evaluate(test_data)


{'max_error': 3553409.1030555945, 'rmse': 156714.09846260582}