In [2]:
#Predicting House Prices

In [11]:
import graphlab

In [6]:
#Load the House Sales Data

In [13]:
sales = graphlab.SFrame('home_data.gl')

In [14]:
sales


Out[14]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
7129300520 2014-10-13 00:00:00+00:00 221900 3 1 1180 5650 1 0
6414100192 2014-12-09 00:00:00+00:00 538000 3 2.25 2570 7242 2 0
5631500400 2015-02-25 00:00:00+00:00 180000 2 1 770 10000 1 0
2487200875 2014-12-09 00:00:00+00:00 604000 4 3 1960 5000 1 0
1954400510 2015-02-18 00:00:00+00:00 510000 3 2 1680 8080 1 0
7237550310 2014-05-12 00:00:00+00:00 1225000 4 4.5 5420 101930 1 0
1321400060 2014-06-27 00:00:00+00:00 257500 3 2.25 1715 6819 2 0
2008000270 2015-01-15 00:00:00+00:00 291850 3 1.5 1060 9711 1 0
2414600126 2015-04-15 00:00:00+00:00 229500 3 1 1780 7470 1 0
3793500160 2015-03-12 00:00:00+00:00 323000 3 2.5 1890 6560 2 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 3 7 1180 0 1955 0 98178 47.51123398
0 3 7 2170 400 1951 1991 98125 47.72102274
0 3 6 770 0 1933 0 98028 47.73792661
0 5 7 1050 910 1965 0 98136 47.52082
0 3 8 1680 0 1987 0 98074 47.61681228
0 3 11 3890 1530 2001 0 98053 47.65611835
0 3 7 1715 0 1995 0 98003 47.30972002
0 3 7 1060 0 1963 0 98198 47.40949984
0 3 7 1050 730 1960 0 98146 47.51229381
0 3 7 1890 0 2003 0 98038 47.36840673
long sqft_living15 sqft_lot15
-122.25677536 1340.0 5650.0
-122.3188624 1690.0 7639.0
-122.23319601 2720.0 8062.0
-122.39318505 1360.0 5000.0
-122.04490059 1800.0 7503.0
-122.00528655 4760.0 101930.0
-122.32704857 2238.0 6819.0
-122.31457273 1650.0 9711.0
-122.33659507 1780.0 8113.0
-122.0308176 2390.0 7570.0
[21613 rows x 21 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [15]:
#Exploring the Data for Housing

In [16]:
sales.show(view = "Categorical")


[WARNING] View param Categorical is not recognized. You may want to use "Summary", "Plots", "Table" or a specific plot type such as "Scatter Plot". I am showing you the Summary view for now.
Canvas is accessible via web browser at the URL: http://localhost:57196/index.html
Opening Canvas in default web browser.
[WARNING] View param Categorical is not recognized. You may want to use "Summary", "Plots", "Table" or a specific plot type such as "Scatter Plot". I am showing you the Summary view for now.
[WARNING] View param Categorical is not recognized. You may want to use "Summary", "Plots", "Table" or a specific plot type such as "Scatter Plot". I am showing you the Summary view for now.
[WARNING] View param Categorical is not recognized. You may want to use "Summary", "Plots", "Table" or a specific plot type such as "Scatter Plot". I am showing you the Summary view for now.
[WARNING] View param Categorical is not recognized. You may want to use "Summary", "Plots", "Table" or a specific plot type such as "Scatter Plot". I am showing you the Summary view for now.
[WARNING] View param Categorical is not recognized. You may want to use "Summary", "Plots", "Table" or a specific plot type such as "Scatter Plot". I am showing you the Summary view for now.
[WARNING] View param Categorical is not recognized. You may want to use "Summary", "Plots", "Table" or a specific plot type such as "Scatter Plot". I am showing you the Summary view for now.
[WARNING] View param Categorical is not recognized. You may want to use "Summary", "Plots", "Table" or a specific plot type such as "Scatter Plot". I am showing you the Summary view for now.
[WARNING] View param Categorical is not recognized. You may want to use "Summary", "Plots", "Table" or a specific plot type such as "Scatter Plot". I am showing you the Summary view for now.
[WARNING] View param Categorical is not recognized. You may want to use "Summary", "Plots", "Table" or a specific plot type such as "Scatter Plot". I am showing you the Summary view for now.
[WARNING] View param Categorical is not recognized. You may want to use "Summary", "Plots", "Table" or a specific plot type such as "Scatter Plot". I am showing you the Summary view for now.

In [18]:
graphlab.canvas.set_target('ipynb')
sales.show(view="Scatter Plot", x = "sqft_living", y = "price")



In [19]:
#Create a regression model


[WARNING] View param  is not recognized. You may want to use "Summary", "Plots", "Table" or a specific plot type such as "Scatter Plot". I am showing you the Summary view for now.
[WARNING] View param  is not recognized. You may want to use "Summary", "Plots", "Table" or a specific plot type such as "Scatter Plot". I am showing you the Summary view for now.

In [21]:
train_data,test_data = sales.random_split(.8, seed  = 20154)

In [22]:
#Build the regression model

In [23]:
sqft_model = graphlab.linear_regression.create(train_data, target='price', features = ['sqft_living'])


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.


In [24]:
sqft_model


Out[24]:
Class                         : LinearRegression

Schema
------
Number of coefficients        : 2
Number of examples            : 16504
Number of feature columns     : 1
Number of unpacked features   : 1

Hyperparameters
---------------
L1 penalty                    : 0.0
L2 penalty                    : 0.01

Training Summary
----------------
Solver                        : auto
Solver iterations             : 1
Solver status                 : SUCCESS: Optimal solution found.
Training time (sec)           : 1.0206

Settings
--------
Residual sum of squares       : 1.09660506066e+15
Training RMSE                 : 257768.8983

Highest Positive Coefficients
-----------------------------
sqft_living                   : 275.7252

Lowest Negative Coefficients
----------------------------
(intercept)                   : -33799.5529

In [25]:
#Evaluate the model

In [26]:
print test_data['price'].mean()


541862.773989

In [28]:
print sqft_model.evaluate(test_data)


{'max_error': 4191877.672792676, 'rmse': 278301.6262443602}

In [29]:
#Let's show what our prediction's look like

In [38]:
import matplotlib.pyplot as plt
%matplotlib inline

In [ ]:


In [41]:
plt.plot(test_data['sqft_living'],test_data['price'],'.',test_data['sqft_living'],sqft_model.predict(test_data),'-')


Out[41]:
[<matplotlib.lines.Line2D at 0x3805c710>,
 <matplotlib.lines.Line2D at 0x3805c978>]

In [42]:
sqft_model.get('coefficients')


Out[42]:
name index value stderr
(intercept) None -33799.5528673 4963.18092561
sqft_living None 275.72516482 2.18252477396
[2 rows x 4 columns]

In [46]:
#Explore features in the data

In [48]:
features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

In [49]:
sales[features].show()



In [50]:
sales.show(view='BoxWhisker Plot', x ='zipcode', y='price')



In [51]:
#Build a regression model with more features

In [53]:
my_features_model = graphlab.linear_regression.create(train_data, target = 'price', features = features)


Linear regression:
--------------------------------------------------------
Number of examples          : 16464
Number of features          : 6
Number of unpacked features : 6
Number of coefficients    : 116
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| 1         | 2        | 0.009012     | 3793297.844995     | 2618909.222689       | 176096.068151 | 201495.196343   |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
SUCCESS: Optimal solution found.

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.


In [55]:
print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)


{'max_error': 4191877.672792676, 'rmse': 278301.6262443602}
{'max_error': 3469815.0820734077, 'rmse': 199706.103027552}

In [57]:
#Apply learned models to predict houses from dataset

In [59]:
house1 = sales[sales['id'] == '5309101200']


In [63]:
house1['price']


Out[63]:
dtype: int
Rows: ?
[620000L, ... ]

In [ ]:


In [65]:
print sqft_model.predict(house1)


[627940.8427018974]

In [66]:
print my_features_model.predict(house1)


[708624.4931138828]

In [68]:
#PRedict House2

In [70]:
house2 = sales[sales['id'] == '1925069082']

In [71]:
house2


Out[71]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
1925069082 2015-05-11 00:00:00+00:00 2200000 5 4.25 4640 22703 2 1
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
4 5 8 2860 1780 1952 0 98052 47.63925783
long sqft_living15 sqft_lot15
-122.09722322 3140.0 14200.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.

In [72]:
print sqft_model.predict(house2)
print my_features_model.predict(house2)


[1245565.211899782]
[1408277.6382105632]

In [73]:
my_features_model.coefficients


Out[73]:
name index value stderr
(intercept) None 98159.4765521 16392.3711753
bedrooms 3 41109.9874825 3586.29522016
bedrooms 2 59454.7544074 5665.20736852
bedrooms 5 -49193.8341442 5790.34377681
bedrooms 6 -120936.424068 13360.7867842
bedrooms 1 80305.2609204 15033.9008928
bedrooms 7 -311644.309772 35278.1495386
bedrooms 0 58777.8139835 81258.9608497
bedrooms 8 -280161.088204 57091.1131577
bedrooms 9 -577689.590371 103340.080518
[116 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [ ]: