Use GraphLab Create


In [2]:
import graphlab

Read in house sales data


In [3]:
sales = graphlab.SFrame('home_data.gl/')


[INFO] This non-commercial license of GraphLab Create is assigned to sandipto.neogi@gmail.comand will expire on October 11, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-19454 - Server binary: /Users/sandiptoneogi/anaconda/envs/dato/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1444747691.log
[INFO] GraphLab Server Version: 1.6.1

In [4]:
sales


Out[4]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
7129300520 2014-10-13 00:00:00+00:00 221900 3 1 1180 5650 1 0
6414100192 2014-12-09 00:00:00+00:00 538000 3 2.25 2570 7242 2 0
5631500400 2015-02-25 00:00:00+00:00 180000 2 1 770 10000 1 0
2487200875 2014-12-09 00:00:00+00:00 604000 4 3 1960 5000 1 0
1954400510 2015-02-18 00:00:00+00:00 510000 3 2 1680 8080 1 0
7237550310 2014-05-12 00:00:00+00:00 1225000 4 4.5 5420 101930 1 0
1321400060 2014-06-27 00:00:00+00:00 257500 3 2.25 1715 6819 2 0
2008000270 2015-01-15 00:00:00+00:00 291850 3 1.5 1060 9711 1 0
2414600126 2015-04-15 00:00:00+00:00 229500 3 1 1780 7470 1 0
3793500160 2015-03-12 00:00:00+00:00 323000 3 2.5 1890 6560 2 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 3 7 1180 0 1955 0 98178 47.51123398
0 3 7 2170 400 1951 1991 98125 47.72102274
0 3 6 770 0 1933 0 98028 47.73792661
0 5 7 1050 910 1965 0 98136 47.52082
0 3 8 1680 0 1987 0 98074 47.61681228
0 3 11 3890 1530 2001 0 98053 47.65611835
0 3 7 1715 0 1995 0 98003 47.30972002
0 3 7 1060 0 1963 0 98198 47.40949984
0 3 7 1050 730 1960 0 98146 47.51229381
0 3 7 1890 0 2003 0 98038 47.36840673
long sqft_living15 sqft_lot15
-122.25677536 1340.0 5650.0
-122.3188624 1690.0 7639.0
-122.23319601 2720.0 8062.0
-122.39318505 1360.0 5000.0
-122.04490059 1800.0 7503.0
-122.00528655 4760.0 101930.0
-122.32704857 2238.0 6819.0
-122.31457273 1650.0 9711.0
-122.33659507 1780.0 8113.0
-122.0308176 2390.0 7570.0
[21613 rows x 21 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Exploratory analysis


In [5]:
graphlab.canvas.set_target('ipynb')
sales.show(view='Scatter Plot', x='sqft_living', y='price')


Create Linear Regression Model (Sq.Ft. - Price)


In [6]:
train_data, test_data = sales.random_split(0.8, seed=0)

In [7]:
sqft_model = graphlab.linear_regression.create(train_data, target='price', features=['sqft_living'])


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 16531
PROGRESS: Number of features          : 1
PROGRESS: Number of unpacked features : 1
PROGRESS: Number of coefficients    : 2
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 1.009459     | 4350129.081628     | 2293728.873714       | 262034.993068 | 279970.927753   |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+

Evaluate Model


In [8]:
print test_data['price'].mean()


543054.042563

In [9]:
print sqft_model.evaluate(test_data)


{'max_error': 4144024.4258035296, 'rmse': 255189.49466209492}

Show Predictions


In [10]:
import matplotlib.pyplot as plt
%matplotlib inline

In [11]:
plt.plot(test_data['sqft_living'], test_data['price'], '.',
         test_data['sqft_living'], sqft_model.predict(test_data), '-')


Out[11]:
[<matplotlib.lines.Line2D at 0x112acff50>,
 <matplotlib.lines.Line2D at 0x1116af150>]

In [12]:
sqft_model.get('coefficients')


Out[12]:
name index value
(intercept) None -46975.7933464
sqft_living None 281.895992674
[2 rows x 3 columns]

Explore other features


In [13]:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

In [14]:
sales[my_features].show()



In [15]:
sales.show(view='BoxWhisker Plot', x='zipcode', y='price')


Create Linear Regression Model (my_features - Price)


In [16]:
my_features_model = graphlab.linear_regression.create(train_data, target='price', features=my_features)


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 16533
PROGRESS: Number of features          : 6
PROGRESS: Number of unpacked features : 6
PROGRESS: Number of coefficients    : 114
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Validation-max_error | Training-rmse | Validation-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+
PROGRESS: | 1         | 2        | 0.036866     | 3758579.419276     | 5366092.210277       | 179896.279421 | 255344.687864   |
PROGRESS: +-----------+----------+--------------+--------------------+----------------------+---------------+-----------------+

In [17]:
print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)


{'max_error': 4144024.4258035296, 'rmse': 255189.49466209492}
{'max_error': 3493708.2697989633, 'rmse': 179727.5233993494}

Predict prices


In [18]:
house1 = sales[sales['id']=='5309101200']

In [19]:
house1


Out[19]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
5309101200 2014-06-05 00:00:00+00:00 620000 4 2.25 2400 5350 1.5 0
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
0 4 7 1460 940 1929 0 98117 47.67632376
long sqft_living15 sqft_lot15
-122.37010126 1250.0 4880.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.

In [20]:
print house1['price']


[620000, ... ]

In [21]:
print sqft_model.predict(house1)


[629574.5890704993]

In [22]:
print my_features_model.predict(house1)


[722104.5002635692]

Second prediction


In [23]:
house2 = sales[sales['id']=='1925069082']

In [24]:
house2


Out[24]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront
1925069082 2015-05-11 00:00:00+00:00 2200000 5 4.25 4640 22703 2 1
view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat
4 5 8 2860 1780 1952 0 98052 47.63925783
long sqft_living15 sqft_lot15
-122.09722322 3140.0 14200.0
[? rows x 21 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.

In [25]:
print house2['price']


[2200000, ... ]

In [26]:
print sqft_model.predict(house2)


[1261021.6126595747]

In [27]:
print my_features_model.predict(house2)


[1430814.1247004007]

Assignment


In [28]:
houses_98039 = sales[sales['zipcode']=='98039']

In [30]:
houses_98039['price'].mean()


Out[30]:
2160606.5999999996

In [76]:
houses_2000_4000 = sales[sales['sqft_living'] > 2000]
houses_2000_4000 = houses_2000_4000[houses_2000_4000['sqft_living'] <= 4000]

In [79]:
a = len(houses_2000_4000)
b = len(sales)
print a, b
print a/float(b)


9118 21613
0.421875722945

In [46]:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']

In [47]:
advanced_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house
'grade', # measure of quality of construction
'waterfront', # waterfront property
'view', # type of view
'sqft_above', # square feet above ground
'sqft_basement', # square feet in basement
'yr_built', # the year built
'yr_renovated', # the year renovated
'lat', 'long', # the lat-long of the parcel
'sqft_living15', # average sq.ft. of 15 nearest neighbors
'sqft_lot15', # average lot size of 15 nearest neighbors
]

In [63]:
train_data, test_data = sales.random_split(0.8, seed=0)

In [64]:
my_features_model = graphlab.linear_regression.create(train_data, target='price', features=my_features, validation_set=None)


PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 17384
PROGRESS: Number of features          : 6
PROGRESS: Number of unpacked features : 6
PROGRESS: Number of coefficients    : 115
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 0.036365     | 3763208.270523     | 181908.848367 |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+

In [65]:
advanced_features_model = graphlab.linear_regression.create(train_data, target='price', features=advanced_features, validation_set=None)


PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 17384
PROGRESS: Number of features          : 18
PROGRESS: Number of unpacked features : 18
PROGRESS: Number of coefficients    : 127
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 0.051791     | 3469012.450686     | 154580.940736 |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+

In [66]:
print my_features_model.evaluate(test_data)
print advanced_features_model.evaluate(test_data)


{'max_error': 3486584.509381705, 'rmse': 179542.4333126903}
{'max_error': 3556849.413858208, 'rmse': 156831.1168021901}

In [67]:
print 179542.4333126903 - 156831.1168021901


22711.3165105

In [ ]: