Fire up graphlab create



In [1]:

    
import graphlab









    



A newer version of GraphLab Create (v1.9) is available! Your current version is v1.8.5.

You can use pip to upgrade the graphlab-create package. For more information see https://dato.com/products/create/upgrade.

Load some house sales data



In [2]:

    
sales = graphlab.SFrame('home_data.gl/')









    



2016-05-10 07:23:46,772 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.8.5 started. Logging: /tmp/graphlab_server_1462879425.log






    



This non-commercial license of GraphLab Create is assigned to robert.petit@emory.edu and will expire on March 28, 2017. For commercial licensing options, visit https://dato.com/buy/.



In [42]:

    
sales









    Out[42]:





    
        id
        date
        price
        bedrooms
        bathrooms
        sqft_living
        sqft_lot
        floors
        waterfront
    
    
        7129300520
        2014-10-13 00:00:00+00:00
        221900
        3
        1
        1180
        5650
        1
        0
    
    
        6414100192
        2014-12-09 00:00:00+00:00
        538000
        3
        2.25
        2570
        7242
        2
        0
    
    
        5631500400
        2015-02-25 00:00:00+00:00
        180000
        2
        1
        770
        10000
        1
        0
    
    
        2487200875
        2014-12-09 00:00:00+00:00
        604000
        4
        3
        1960
        5000
        1
        0
    
    
        1954400510
        2015-02-18 00:00:00+00:00
        510000
        3
        2
        1680
        8080
        1
        0
    
    
        7237550310
        2014-05-12 00:00:00+00:00
        1225000
        4
        4.5
        5420
        101930
        1
        0
    
    
        1321400060
        2014-06-27 00:00:00+00:00
        257500
        3
        2.25
        1715
        6819
        2
        0
    
    
        2008000270
        2015-01-15 00:00:00+00:00
        291850
        3
        1.5
        1060
        9711
        1
        0
    
    
        2414600126
        2015-04-15 00:00:00+00:00
        229500
        3
        1
        1780
        7470
        1
        0
    
    
        3793500160
        2015-03-12 00:00:00+00:00
        323000
        3
        2.5
        1890
        6560
        2
        0
    


    
        view
        condition
        grade
        sqft_above
        sqft_basement
        yr_built
        yr_renovated
        zipcode
        lat
    
    
        0
        3
        7
        1180
        0
        1955
        0
        98178
        47.51123398
    
    
        0
        3
        7
        2170
        400
        1951
        1991
        98125
        47.72102274
    
    
        0
        3
        6
        770
        0
        1933
        0
        98028
        47.73792661
    
    
        0
        5
        7
        1050
        910
        1965
        0
        98136
        47.52082
    
    
        0
        3
        8
        1680
        0
        1987
        0
        98074
        47.61681228
    
    
        0
        3
        11
        3890
        1530
        2001
        0
        98053
        47.65611835
    
    
        0
        3
        7
        1715
        0
        1995
        0
        98003
        47.30972002
    
    
        0
        3
        7
        1060
        0
        1963
        0
        98198
        47.40949984
    
    
        0
        3
        7
        1050
        730
        1960
        0
        98146
        47.51229381
    
    
        0
        3
        7
        1890
        0
        2003
        0
        98038
        47.36840673
    


    
        long
        sqft_living15
        sqft_lot15
    
    
        -122.25677536
        1340.0
        5650.0
    
    
        -122.3188624
        1690.0
        7639.0
    
    
        -122.23319601
        2720.0
        8062.0
    
    
        -122.39318505
        1360.0
        5000.0
    
    
        -122.04490059
        1800.0
        7503.0
    
    
        -122.00528655
        4760.0
        101930.0
    
    
        -122.32704857
        2238.0
        6819.0
    
    
        -122.31457273
        1650.0
        9711.0
    
    
        -122.33659507
        1780.0
        8113.0
    
    
        -122.0308176
        2390.0
        7570.0
    

[21613 rows x 21 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Exploring the data for housing sales



In [4]:

    
graphlab.canvas.set_target('ipynb')
sales.show(view="Scatter Plot", x="sqft_living", y="price")

Create a simple regression model of sqft_living to price



In [43]:

    
seed = 0
train_data, test_data = sales.random_split(0.80, seed=seed)

Build the regression model



In [44]:

    
sqft_model = graphlab.linear_regression.create(train_data, target="price", features=['sqft_living'])









    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.







    




Linear regression:






    




--------------------------------------------------------






    




Number of examples          : 16564






    




Number of features          : 1






    




Number of unpacked features : 1






    




Number of coefficients    : 2






    




Starting Newton Method






    




--------------------------------------------------------






    




+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




| Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |






    




+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




| 1         | 2        | 0.008041     | 4340948.503521     | 1880558.747720       | 263629.929331 | 248705.885911   |






    




+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




SUCCESS: Optimal solution found.

Evaluate the sqft_model



In [45]:

    
print(test_data['price'].mean())









    



543054.042563



In [46]:

    
print(sqft_model.evaluate(test_data))









    



{'max_error': 4136748.2240386726, 'rmse': 255204.8366606025}

Let's show what out predictions look like



In [47]:

    
import matplotlib
matplotlib.use('TkAgg')









    



/Users/rpetit/anaconda/envs/dato-env/lib/python2.7/site-packages/matplotlib/__init__.py:1350: UserWarning:  This call to matplotlib.use() has no effect
because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

  warnings.warn(_use_error_msg)



In [48]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [49]:

    
plt.plot(test_data['sqft_living'], test_data['price'], '.',
         test_data['sqft_living'], sqft_model.predict(test_data), '-')









    Out[49]:





[<matplotlib.lines.Line2D at 0x1138fd350>,
 <matplotlib.lines.Line2D at 0x1138fd510>]



In [50]:

    
sqft_model.get('coefficients')









    Out[50]:





    
        name
        index
        value
        stderr
    
    
        (intercept)
        None
        -48419.1665962
        5046.66645797
    
    
        sqft_living
        None
        282.777648388
        2.21593883288
    

[2 rows x 4 columns]

Exploring other features in the data



In [51]:

    
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']



In [52]:

    
sales[my_features].show()



In [53]:

    
sales.show(view='BoxWhisker Plot', x="zipcode", y="price")

Build a regression model with more features



In [54]:

    
my_features_model = graphlab.linear_regression.create(train_data, target='price', features=my_features)









    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.







    




Linear regression:






    




--------------------------------------------------------






    




Number of examples          : 16475






    




Number of features          : 6






    




Number of unpacked features : 6






    




Number of coefficients    : 115






    




Starting Newton Method






    




--------------------------------------------------------






    




+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




| Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |






    




+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




| 1         | 2        | 0.032765     | 3750307.450505     | 2618061.462384       | 181906.834370 | 200286.297748   |






    




+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




SUCCESS: Optimal solution found.



In [22]:

    
print(my_features)









    



['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']



In [55]:

    
print(sqft_model.evaluate(test_data))
print(my_features_model.evaluate(test_data))









    



{'max_error': 4136748.2240386726, 'rmse': 255204.8366606025}
{'max_error': 3520902.667312363, 'rmse': 178078.83450264347}

Apply learned models to predict prices of 3 houses



In [25]:

    
house1 = sales[sales['id']=='5309101200']



In [28]:

    
house1









    Out[28]:





    
        id
        date
        price
        bedrooms
        bathrooms
        sqft_living
        sqft_lot
        floors
        waterfront
    
    
        5309101200
        2014-06-05 00:00:00+00:00
        620000
        4
        2.25
        2400
        5350
        1.5
        0
    


    
        view
        condition
        grade
        sqft_above
        sqft_basement
        yr_built
        yr_renovated
        zipcode
        lat
    
    
        0
        4
        7
        1460
        940
        1929
        0
        98117
        47.67632376
    


    
        long
        sqft_living15
        sqft_lot15
    
    
        -122.37010126
        1250.0
        4880.0
    

[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.



In [30]:

    
print(house1['price'])









    



[620000, ... ]



In [31]:

    
print(sqft_model.predict(house1))









    



[630883.2900356471]



In [33]:

    
print(my_features_model.predict(house1))









    



[715148.2500020908]

Prediction for a second, fancier house



In [34]:

    
house2 = sales[sales['id'] == '1925069082']



In [35]:

    
house2









    Out[35]:





    
        id
        date
        price
        bedrooms
        bathrooms
        sqft_living
        sqft_lot
        floors
        waterfront
    
    
        1925069082
        2015-05-11 00:00:00+00:00
        2200000
        5
        4.25
        4640
        22703
        2
        1
    


    
        view
        condition
        grade
        sqft_above
        sqft_basement
        yr_built
        yr_renovated
        zipcode
        lat
    
    
        4
        5
        8
        2860
        1780
        1952
        0
        98052
        47.63925783
    


    
        long
        sqft_living15
        sqft_lot15
    
    
        -122.09722322
        3140.0
        14200.0
    

[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.



In [36]:

    
print(house2['price'])
print(sqft_model.predict(house2))
print(my_features_model.predict(house2))









    



[2200000, ... ]
[1258737.0726558375]
[1425371.4541757465]

An even fancier house!



In [37]:

    
bill_gates = {
    'bedrooms':[8], 
    'bathrooms':[25], 
    'sqft_living':[50000], 
    'sqft_lot':[225000],
    'floors':[4], 
    'zipcode':['98039'], 
    'condition':[10], 
    'grade':[10],
    'waterfront':[1],
    'view':[4],
    'sqft_above':[37500],
    'sqft_basement':[12500],
    'yr_built':[1994],
    'yr_renovated':[2010],
    'lat':[47.627606],
    'long':[-122.242054],
    'sqft_living15':[5000],
    'sqft_lot15':[40000]
}



In [40]:

    
print(sqft_model.predict(graphlab.SFrame(bill_gates)))
print(my_features_model.predict(graphlab.SFrame(bill_gates)))









    



[13972776.170714691]
[13656671.856460731]

Assignment



In [61]:

    
x = sales[sales['sqft_living'] >= 2000]
y = x[x['sqft_living'] <= 4000]
print sales.num_rows()
print y.num_rows()
print float(y.num_rows()) / sales.num_rows()









    



21613
9221
0.426641373248



In [63]:

    
advanced_features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house				
'grade', # measure of quality of construction				
'waterfront', # waterfront property				
'view', # type of view				
'sqft_above', # square feet above ground				
'sqft_basement', # square feet in basement				
'yr_built', # the year built				
'yr_renovated', # the year renovated				
'lat', 'long', # the lat-long of the parcel				
'sqft_living15', # average sq.ft. of 15 nearest neighbors 				
'sqft_lot15', # average lot size of 15 nearest neighbors 
]



In [65]:

    
advanced_features_model = graphlab.linear_regression.create(train_data, target='price', features=advanced_features)









    



PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.







    




Linear regression:






    




--------------------------------------------------------






    




Number of examples          : 16562






    




Number of features          : 18






    




Number of unpacked features : 18






    




Number of coefficients    : 127






    




Starting Newton Method






    




--------------------------------------------------------






    




+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




| Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |






    




+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




| 1         | 2        | 0.068752     | 3447326.535057     | 5134584.970871       | 152171.456959 | 237832.194110   |






    




+-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+






    




SUCCESS: Optimal solution found.



In [66]:

    
print(my_features_model.evaluate(test_data))
print(advanced_features_model.evaluate(test_data))









    



{'max_error': 3520902.667312363, 'rmse': 178078.83450264347}
{'max_error': 3536799.0297642453, 'rmse': 156688.4071902741}

id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors
7129300520	2014-10-13 00:00:00+00:00	221900	3	1	1180	5650	1
6414100192	2014-12-09 00:00:00+00:00	538000	3	2.25	2570	7242	2
5631500400	2015-02-25 00:00:00+00:00	180000	2	1	770	10000	1
2487200875	2014-12-09 00:00:00+00:00	604000	4	3	1960	5000	1
1954400510	2015-02-18 00:00:00+00:00	510000	3	2	1680	8080	1
7237550310	2014-05-12 00:00:00+00:00	1225000	4	4.5	5420	101930	1
1321400060	2014-06-27 00:00:00+00:00	257500	3	2.25	1715	6819	2
2008000270	2015-01-15 00:00:00+00:00	291850	3	1.5	1060	9711	1
2414600126	2015-04-15 00:00:00+00:00	229500	3	1	1780	7470	1
3793500160	2015-03-12 00:00:00+00:00	323000	3	2.5	1890	6560	2

condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat
3	7	1180	0	1955	0	98178	47.51123398
3	7	2170	400	1951	1991	98125	47.72102274
3	6	770	0	1933	0	98028	47.73792661
5	7	1050	910	1965	0	98136	47.52082
3	8	1680	0	1987	0	98074	47.61681228
3	11	3890	1530	2001	0	98053	47.65611835
3	7	1715	0	1995	0	98003	47.30972002
3	7	1060	0	1963	0	98198	47.40949984
3	7	1050	730	1960	0	98146	47.51229381
3	7	1890	0	2003	0	98038	47.36840673

long	sqft_living15	sqft_lot15
-122.25677536	1340.0	5650.0
-122.3188624	1690.0	7639.0
-122.23319601	2720.0	8062.0
-122.39318505	1360.0	5000.0
-122.04490059	1800.0	7503.0
-122.00528655	4760.0	101930.0
-122.32704857	2238.0	6819.0
-122.31457273	1650.0	9711.0
-122.33659507	1780.0	8113.0
-122.0308176	2390.0	7570.0

name	index	value	stderr
(intercept)	None	-48419.1665962	5046.66645797
sqft_living	None	282.777648388	2.21593883288