Fire up graphlab create


In [1]:
import graphlab


A newer version of GraphLab Create (v1.9) is available! Your current version is v1.8.5.

You can use pip to upgrade the graphlab-create package. For more information see https://dato.com/products/create/upgrade.

Load some house sales data


In [2]:
sales = graphlab.SFrame('home_data.gl/')


2016-05-10 07:23:46,772 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.8.5 started. Logging: /tmp/graphlab_server_1462879425.log
This non-commercial license of GraphLab Create is assigned to robert.petit@emory.edu and will expire on March 28, 2017. For commercial licensing options, visit https://dato.com/buy/.

In [42]:
sales


Out[42]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
7129300520 2014-10-13 00:00:00+00:00 221900 3 1 1180 5650 1 0
6414100192 2014-12-09 00:00:00+00:00 538000 3 2.25 2570 7242 2 0
5631500400 2015-02-25 00:00:00+00:00 180000 2 1 770 10000 1 0
2487200875 2014-12-09 00:00:00+00:00 604000 4 3 1960 5000 1 0
1954400510 2015-02-18 00:00:00+00:00 510000 3 2 1680 8080 1 0
7237550310 2014-05-12 00:00:00+00:00 1225000 4 4.5 5420 101930 1 0
1321400060 2014-06-27 00:00:00+00:00 257500 3 2.25 1715 6819 2 0
2008000270 2015-01-15 00:00:00+00:00 291850 3 1.5 1060 9711 1 0
2414600126 2015-04-15 00:00:00+00:00 229500 3 1 1780 7470 1 0
3793500160 2015-03-12 00:00:00+00:00 323000 3 2.5 1890 6560 2 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 3 7 1180 0 1955 0 98178 47.51123398
0 3 7 2170 400 1951 1991 98125 47.72102274
0 3 6 770 0 1933 0 98028 47.73792661
0 5 7 1050 910 1965 0 98136 47.52082
0 3 8 1680 0 1987 0 98074 47.61681228
0 3 11 3890 1530 2001 0 98053 47.65611835
0 3 7 1715 0 1995 0 98003 47.30972002
0 3 7 1060 0 1963 0 98198 47.40949984
0 3 7 1050 730 1960 0 98146 47.51229381
0 3 7 1890 0 2003 0 98038 47.36840673
long sqft_living15 sqft_lot15
-122.25677536 1340.0 5650.0
-122.3188624 1690.0 7639.0
-122.23319601 2720.0 8062.0
-122.39318505 1360.0 5000.0
-122.04490059 1800.0 7503.0
-122.00528655 4760.0 101930.0
-122.32704857 2238.0 6819.0
-122.31457273 1650.0 9711.0
-122.33659507 1780.0 8113.0
-122.0308176 2390.0 7570.0
[21613 rows x 21 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Exploring the data for housing sales


In [4]:
graphlab.canvas.set_target('ipynb')
sales.show(view="Scatter Plot", x="sqft_living", y="price")


Create a simple regression model of sqft_living to price


In [43]:
seed = 0
train_data, test_data = sales.random_split(0.80, seed=seed)

Build the regression model


In [44]:
sqft_model = graphlab.linear_regression.create(train_data, target="price", features=['sqft_living'])


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Linear regression:
--------------------------------------------------------
Number of examples          : 16564
Number of features          : 1
Number of unpacked features : 1
Number of coefficients    : 2
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| 1         | 2        | 0.008041     | 4340948.503521     | 1880558.747720       | 263629.929331 | 248705.885911   |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
SUCCESS: Optimal solution found.

Evaluate the sqft_model


In [45]:
print(test_data['price'].mean())


543054.042563

In [46]:
print(sqft_model.evaluate(test_data))


{'max_error': 4136748.2240386726, 'rmse': 255204.8366606025}

Let's show what out predictions look like


In [47]:
import matplotlib
matplotlib.use('TkAgg')


/Users/rpetit/anaconda/envs/dato-env/lib/python2.7/site-packages/matplotlib/__init__.py:1350: UserWarning:  This call to matplotlib.use() has no effect
because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

  warnings.warn(_use_error_msg)

In [48]:
import matplotlib.pyplot as plt
%matplotlib inline

In [49]:
plt.plot(test_data['sqft_living'], test_data['price'], '.',
         test_data['sqft_living'], sqft_model.predict(test_data), '-')


Out[49]:
[<matplotlib.lines.Line2D at 0x1138fd350>,
 <matplotlib.lines.Line2D at 0x1138fd510>]

In [50]:
sqft_model.get('coefficients')


Out[50]:
name index value stderr
(intercept) None -48419.1665962 5046.66645797
sqft_living None 282.777648388 2.21593883288
[2 rows x 4 columns]

Exploring other features in the data


In [51]:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

In [52]:
sales[my_features].show()



In [53]:
sales.show(view='BoxWhisker Plot', x="zipcode", y="price")


Build a regression model with more features


In [54]:
my_features_model = graphlab.linear_regression.create(train_data, target='price', features=my_features)


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Linear regression:
--------------------------------------------------------
Number of examples          : 16475
Number of features          : 6
Number of unpacked features : 6
Number of coefficients    : 115
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| 1         | 2        | 0.032765     | 3750307.450505     | 2618061.462384       | 181906.834370 | 200286.297748   |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
SUCCESS: Optimal solution found.


In [22]:
print(my_features)


['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

In [55]:
print(sqft_model.evaluate(test_data))
print(my_features_model.evaluate(test_data))


{'max_error': 4136748.2240386726, 'rmse': 255204.8366606025}
{'max_error': 3520902.667312363, 'rmse': 178078.83450264347}

Apply learned models to predict prices of 3 houses


In [25]:
house1 = sales[sales['id']=='5309101200']

In [28]:
house1


Out[28]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
5309101200 2014-06-05 00:00:00+00:00 620000 4 2.25 2400 5350 1.5 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 4 7 1460 940 1929 0 98117 47.67632376
long sqft_living15 sqft_lot15
-122.37010126 1250.0 4880.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.


In [30]:
print(house1['price'])


[620000, ... ]

In [31]:
print(sqft_model.predict(house1))


[630883.2900356471]

In [33]:
print(my_features_model.predict(house1))


[715148.2500020908]

Prediction for a second, fancier house


In [34]:
house2 = sales[sales['id'] == '1925069082']

In [35]:
house2


Out[35]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
1925069082 2015-05-11 00:00:00+00:00 2200000 5 4.25 4640 22703 2 1
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
4 5 8 2860 1780 1952 0 98052 47.63925783
long sqft_living15 sqft_lot15
-122.09722322 3140.0 14200.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.

In [36]:
print(house2['price'])
print(sqft_model.predict(house2))
print(my_features_model.predict(house2))


[2200000, ... ]
[1258737.0726558375]
[1425371.4541757465]

An even fancier house!


In [37]:
bill_gates = {
    'bedrooms':[8], 
    'bathrooms':[25], 
    'sqft_living':[50000], 
    'sqft_lot':[225000],
    'floors':[4], 
    'zipcode':['98039'], 
    'condition':[10], 
    'grade':[10],
    'waterfront':[1],
    'view':[4],
    'sqft_above':[37500],
    'sqft_basement':[12500],
    'yr_built':[1994],
    'yr_renovated':[2010],
    'lat':[47.627606],
    'long':[-122.242054],
    'sqft_living15':[5000],
    'sqft_lot15':[40000]
}

In [40]:
print(sqft_model.predict(graphlab.SFrame(bill_gates)))
print(my_features_model.predict(graphlab.SFrame(bill_gates)))


[13972776.170714691]
[13656671.856460731]

Assignment


In [61]:
x = sales[sales['sqft_living'] >= 2000]
y = x[x['sqft_living'] <= 4000]
print sales.num_rows()
print y.num_rows()
print float(y.num_rows()) / sales.num_rows()


21613
9221
0.426641373248

In [63]:
advanced_features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house				
'grade', # measure of quality of construction				
'waterfront', # waterfront property				
'view', # type of view				
'sqft_above', # square feet above ground				
'sqft_basement', # square feet in basement				
'yr_built', # the year built				
'yr_renovated', # the year renovated				
'lat', 'long', # the lat-long of the parcel				
'sqft_living15', # average sq.ft. of 15 nearest neighbors 				
'sqft_lot15', # average lot size of 15 nearest neighbors 
]

In [65]:
advanced_features_model = graphlab.linear_regression.create(train_data, target='price', features=advanced_features)


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Linear regression:
--------------------------------------------------------
Number of examples          : 16562
Number of features          : 18
Number of unpacked features : 18
Number of coefficients    : 127
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
| 1         | 2        | 0.068752     | 3447326.535057     | 5134584.970871       | 152171.456959 | 237832.194110   |
+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
SUCCESS: Optimal solution found.


In [66]:
print(my_features_model.evaluate(test_data))
print(advanced_features_model.evaluate(test_data))


{'max_error': 3520902.667312363, 'rmse': 178078.83450264347}
{'max_error': 3536799.0297642453, 'rmse': 156688.4071902741}