In [1]:
import graphlab

In [2]:
sales = graphlab.SFrame('home_data.gl/')


[INFO] This non-commercial license of GraphLab Create is assigned to akshay.narayan@u.nus.eduand will expire on September 26, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-11252 - Server binary: /usr/local/lib/python2.7/dist-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1443971608.log
[INFO] GraphLab Server Version: 1.6.1

In [3]:
sales


Out[3]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
7129300520 2014-10-13 00:00:00+00:00 221900 3 1 1180 5650 1 0
6414100192 2014-12-09 00:00:00+00:00 538000 3 2.25 2570 7242 2 0
5631500400 2015-02-25 00:00:00+00:00 180000 2 1 770 10000 1 0
2487200875 2014-12-09 00:00:00+00:00 604000 4 3 1960 5000 1 0
1954400510 2015-02-18 00:00:00+00:00 510000 3 2 1680 8080 1 0
7237550310 2014-05-12 00:00:00+00:00 1225000 4 4.5 5420 101930 1 0
1321400060 2014-06-27 00:00:00+00:00 257500 3 2.25 1715 6819 2 0
2008000270 2015-01-15 00:00:00+00:00 291850 3 1.5 1060 9711 1 0
2414600126 2015-04-15 00:00:00+00:00 229500 3 1 1780 7470 1 0
3793500160 2015-03-12 00:00:00+00:00 323000 3 2.5 1890 6560 2 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 3 7 1180 0 1955 0 98178 47.51123398
0 3 7 2170 400 1951 1991 98125 47.72102274
0 3 6 770 0 1933 0 98028 47.73792661
0 5 7 1050 910 1965 0 98136 47.52082
0 3 8 1680 0 1987 0 98074 47.61681228
0 3 11 3890 1530 2001 0 98053 47.65611835
0 3 7 1715 0 1995 0 98003 47.30972002
0 3 7 1060 0 1963 0 98198 47.40949984
0 3 7 1050 730 1960 0 98146 47.51229381
0 3 7 1890 0 2003 0 98038 47.36840673
long sqft_living15 sqft_lot15
-122.25677536 1340.0 5650.0
-122.3188624 1690.0 7639.0
-122.23319601 2720.0 8062.0
-122.39318505 1360.0 5000.0
-122.04490059 1800.0 7503.0
-122.00528655 4760.0 101930.0
-122.32704857 2238.0 6819.0
-122.31457273 1650.0 9711.0
-122.33659507 1780.0 8113.0
-122.0308176 2390.0 7570.0
[21613 rows x 21 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [4]:
graphlab.canvas.set_target('ipynb')

In [5]:
sales.show(view="Scatter Plot", x="sqft_living", y="price")


Regression model (simple one)

Splitting data


In [7]:
trainData, testData = sales.random_split(0.8, seed=0)

Build the regression model


In [8]:
sqftPredModel = graphlab.linear_regression.create(trainData, target='price', features=['sqft_living'])


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 16585
PROGRESS: Number of features          : 1
PROGRESS: Number of unpacked features : 1
PROGRESS: Number of coefficients    : 2
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 1.014210     | 4330351.047582     | 1512518.620458       | 264412.029609 | 230522.848880   |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+

Evaluate the model just built


In [9]:
print testData['price'].mean()


543054.042563

In [10]:
print sqftPredModel.evaluate(testData)


{'max_error': 4128370.163665438, 'rmse': 255227.8405161093}

Make predictions


In [12]:
import matplotlib.pyplot as plt

In [13]:
%matplotlib inline

In [14]:
plt.plot(testData['sqft_living'], testData['price'], '.',
        testData['sqft_living'], sqftPredModel.predict(testData), '-')


Out[14]:
[<matplotlib.lines.Line2D at 0x7f5611369d90>,
 <matplotlib.lines.Line2D at 0x7f5611369f50>]

In [15]:
sqftPredModel.get('coefficients')


Out[15]:
name index value
(intercept) None -50203.0609198
sqft_living None 283.805146335
[2 rows x 3 columns]

In [16]:
features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

In [17]:
features


Out[17]:
['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

In [19]:
sales[features].show()



In [63]:
sales.show(view='BoxWhisker Plot', x='zipcode', y='price')


Adding more features


In [22]:
features


Out[22]:
['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

In [23]:
multiFeaturesModel = graphlab.linear_regression.create(trainData, target='price', features=features)


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 16467
PROGRESS: Number of features          : 6
PROGRESS: Number of unpacked features : 6
PROGRESS: Number of coefficients    : 115
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 0.027317     | 3756173.274715     | 2286721.275873       | 181142.426564 | 204029.819222   |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+

In [24]:
print sqftPredModel.evaluate(testData)
print multiFeaturesModel.evaluate(testData)


{'max_error': 4128370.163665438, 'rmse': 255227.8405161093}
{'max_error': 3550424.7814458236, 'rmse': 185249.76057867485}

In [25]:
h1 = sales[sales['id']=='5309101200']

In [26]:
h1


Out[26]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
5309101200 2014-06-05 00:00:00+00:00 620000 4 2.25 2400 5350 1.5 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 4 7 1460 940 1929 0 98117 47.67632376
long sqft_living15 sqft_lot15
-122.37010126 1250.0 4880.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.


In [27]:
h1['price']


Out[27]:
dtype: int
Rows: ?
[620000, ... ]

In [28]:
sqftPredModel.predict(h1)


Out[28]:
dtype: float
Rows: 1
[630929.2902845264]

In [29]:
multiFeaturesModel.predict(h1)


Out[29]:
dtype: float
Rows: 1
[717319.3399641677]

In [30]:
h2 = sales[sales['id']=='1925069082']

In [31]:
h2


Out[31]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
1925069082 2015-05-11 00:00:00+00:00 2200000 5 4.25 4640 22703 2 1
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
4 5 8 2860 1780 1952 0 98052 47.63925783
long sqft_living15 sqft_lot15
-122.09722322 3140.0 14200.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.

In [34]:
print sqftPredModel.predict(h2)


[1266652.8180751912]

In [35]:
print multiFeaturesModel.predict(h2)


[1440779.009390991]

In [36]:
h3 = {'bedrooms':[8], 
              'bathrooms':[25], 
              'sqft_living':[50000], 
              'sqft_lot':[225000],
              'floors':[4], 
              'zipcode':['98039'], 
              'condition':[10], 
              'grade':[10],
              'waterfront':[1],
              'view':[4],
              'sqft_above':[37500],
              'sqft_basement':[12500],
              'yr_built':[1994],
              'yr_renovated':[2010],
              'lat':[47.627606],
              'long':[-122.242054],
              'sqft_living15':[5000],
              'sqft_lot15':[40000]}

In [37]:
h3


Out[37]:
{'bathrooms': [25],
 'bedrooms': [8],
 'condition': [10],
 'floors': [4],
 'grade': [10],
 'lat': [47.627606],
 'long': [-122.242054],
 'sqft_above': [37500],
 'sqft_basement': [12500],
 'sqft_living': [50000],
 'sqft_living15': [5000],
 'sqft_lot': [225000],
 'sqft_lot15': [40000],
 'view': [4],
 'waterfront': [1],
 'yr_built': [1994],
 'yr_renovated': [2010],
 'zipcode': ['98039']}

In [40]:
print sqftPredModel.predict(graphlab.SFrame(h3))


[14140054.255836155]

In [41]:
print multiFeaturesModel.predict(graphlab.SFrame(h3))


[13600653.542879807]

In [64]:
sales[sales['zipcode']=='98039']


Out[64]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
3625049014 2014-08-29 00:00:00+00:00 2950000 4 3.5 4860 23885 2 0
2540700110 2015-02-12 00:00:00+00:00 1905000 4 3.5 4210 18564 2 0
3262300940 2014-11-07 00:00:00+00:00 875000 3 1 1220 8119 1 0
3262300940 2015-02-10 00:00:00+00:00 940000 3 1 1220 8119 1 0
6447300265 2014-10-14 00:00:00+00:00 4000000 4 5.5 7080 16573 2 0
2470100110 2014-08-04 00:00:00+00:00 5570000 5 5.75 9200 35069 2 0
2210500019 2015-03-24 00:00:00+00:00 937500 3 1 1320 8500 1 0
6447300345 2015-04-06 00:00:00+00:00 1160000 4 3 2680 15438 2 0
6447300225 2014-11-06 00:00:00+00:00 1880000 3 2.75 2620 17919 1 0
2525049148 2014-10-07 00:00:00+00:00 3418800 5 5 5450 20412 2 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 3 12 4860 0 1996 0 98039 47.61717049
0 3 11 4210 0 2001 0 98039 47.62060082
0 4 7 1220 0 1955 0 98039 47.63281908
0 4 7 1220 0 1955 0 98039 47.63281908
0 3 12 5760 1320 2008 0 98039 47.61512031
0 3 13 6200 3000 2001 0 98039 47.62888314
0 4 7 1320 0 1954 0 98039 47.61872888
2 3 8 2680 0 1902 1956 98039 47.61089438
1 4 9 2620 0 1949 0 98039 47.61435052
0 3 11 5450 0 2014 0 98039 47.62087993
long sqft_living15 sqft_lot15
-122.23040939 3580.0 16054.0
-122.2245047 3520.0 18564.0
-122.23554392 1910.0 8119.0
-122.23554392 1910.0 8119.0
-122.22420058 3140.0 15996.0
-122.23346379 3560.0 24345.0
-122.22643371 2790.0 10800.0
-122.22582388 4480.0 14406.0
-122.22772057 3400.0 14400.0
-122.23726918 3160.0 17825.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.

In [65]:
sales[sales['zipcode']=='98039']['price'].mean()


Out[65]:
2160606.5999999996

In [46]:
mediumHouses = sales[(sales['sqft_living']>2000) & (sales['sqft_living']<=4000)]

In [47]:
mediumHouses


Out[47]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
6414100192 2014-12-09 00:00:00+00:00 538000 3 2.25 2570 7242 2 0
1736800520 2015-04-03 00:00:00+00:00 662500 3 2.5 3560 9796 1 0
9297300055 2015-01-24 00:00:00+00:00 650000 4 3 2950 5000 2 0
2524049179 2014-08-26 00:00:00+00:00 2000000 3 2.75 3050 44867 1 0
7137970340 2014-07-03 00:00:00+00:00 285000 5 2.5 2270 6300 2 0
3814700200 2014-11-20 00:00:00+00:00 329000 3 2.25 2450 6500 2 0
1794500383 2014-06-26 00:00:00+00:00 937000 3 1.75 2450 2691 2 0
1873100390 2015-03-02 00:00:00+00:00 719000 4 2.5 2570 7173 2 0
8562750320 2014-11-10 00:00:00+00:00 580500 3 2.5 2320 3980 2 0
0461000390 2014-06-24 00:00:00+00:00 687500 4 1.75 2330 5000 1.5 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 3 7 2170 400 1951 1991 98125 47.72102274
0 3 8 1860 1700 1965 0 98007 47.60065993
3 3 9 1980 970 1979 0 98126 47.57136955
4 3 9 2330 720 1968 0 98040 47.53164379
0 3 8 2270 0 1995 0 98092 47.32658071
0 4 8 2450 0 1985 0 98030 47.37386303
0 3 8 1750 700 1915 0 98119 47.63855772
0 3 8 2570 0 2005 0 98052 47.70732168
0 3 8 2320 0 2003 0 98027 47.5391103
0 4 7 1510 820 1929 0 98117 47.68228235
long sqft_living15 sqft_lot15
-122.3188624 1690.0 7639.0
-122.14529566 2210.0 8925.0
-122.37541218 2140.0 4000.0
-122.23345881 4110.0 20336.0
-122.16892624 2240.0 7005.0
-122.17228981 2200.0 6865.0
-122.35985573 1760.0 3573.0
-122.11029785 2630.0 6026.0
-122.06971484 2580.0 3980.0
-122.36760203 1460.0 5000.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.

In [48]:
len(sales)


Out[48]:
21613

In [49]:
len(mediumHouses)


Out[49]:
9118

In [52]:
float(len(mediumHouses))/len(sales)


Out[52]:
0.42187572294452413

In [66]:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

In [68]:
multiFeaturesModel = graphlab.linear_regression.create(trainData, target='price', features=my_features)


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 16536
PROGRESS: Number of features          : 6
PROGRESS: Number of unpacked features : 6
PROGRESS: Number of coefficients    : 115
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 0.025383     | 3772851.963892     | 1224435.785248       | 183330.193868 | 153944.426205   |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+

In [69]:
print multiFeaturesModel.evaluate(testData)


{'max_error': 3451729.6347030415, 'rmse': 179535.52660627596}

In [56]:
advanced_features = [
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house				
'grade', # measure of quality of construction				
'waterfront', # waterfront property				
'view', # type of view				
'sqft_above', # square feet above ground				
'sqft_basement', # square feet in basement				
'yr_built', # the year built				
'yr_renovated', # the year renovated				
'lat', 'long', # the lat-long of the parcel				
'sqft_living15', # average sq.ft. of 15 nearest neighbors 				
'sqft_lot15', # average lot size of 15 nearest neighbors 
]

In [57]:
advanced_features


Out[57]:
['bedrooms',
 'bathrooms',
 'sqft_living',
 'sqft_lot',
 'floors',
 'zipcode',
 'condition',
 'grade',
 'waterfront',
 'view',
 'sqft_above',
 'sqft_basement',
 'yr_built',
 'yr_renovated',
 'lat',
 'long',
 'sqft_living15',
 'sqft_lot15']

In [70]:
advancedFeaturesModel = graphlab.linear_regression.create(trainData, target='price', features=advanced_features)


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 16508
PROGRESS: Number of features          : 18
PROGRESS: Number of unpacked features : 18
PROGRESS: Number of coefficients    : 127
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 0.038403     | 3496066.572213     | 2355613.507824       | 153316.267676 | 179620.913771   |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+

In [71]:
advancedFeaturesModel.evaluate(testData)


Out[71]:
{'max_error': 3574138.908747242, 'rmse': 156942.87201988485}

In [73]:
multiFeaturesModel.evaluate(testData)['rmse'] - advancedFeaturesModel.evaluate(testData)['rmse']


Out[73]:
22592.654586391116

In [ ]: