In [1]:
import graphlab as gl

Load datasets


In [2]:
oulu_sales = gl.SFrame.read_csv('datasets/oulu_housing_postcode.csv', delimiter=';')


This non-commercial license of GraphLab Create for academic use is assigned to snowytrees182@gmail.com and will expire on November 08, 2017.
[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1495261568.log
Finished parsing file /Users/Jin/projects/capstoneProj/regression_FinPrices/datasets/oulu_housing_postcode.csv
Parsing completed. Parsed 100 lines in 0.073536 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[int,str,int,int,str,float,float,str,str,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
Unable to parse line "26981;Höyrymyllyntie 8 A 7, Toppilansalmi Oulu;3135;2018;http://www.etuovi.com/kohde/d79239?sc=M1042147289&pos=29312;178700;57;2h+kk+s;Kerrostalo;FI"
Unable to parse line "3729;Kraaselintie 3, Rajahauta Oulu;1273;1986;http://www.etuovi.com/kohde/2204747?sc=M1042147289&pos=3992;134900;106;4h,k,khh,kph,s,wc,2vh.;Paritalo;FI"
Unable to parse line "13715;Höyrymyllyntie 8 A 10, Toppilansalmi Oulu;3301;2018;http://www.etuovi.com/kohde/e27299?sc=M1042147289&pos=14863;275600;83.5;4h+kt+s;Kerrostalo;FI"
Unable to parse line "26992;Höyrymyllyntie 8 A 16, Toppilansalmi Oulu;3558;2018;http://www.etuovi.com/kohde/d79228?sc=M1042147289&pos=29324;204600;57.5;3h+kk+s;Kerrostalo;FI"
Unable to parse line "27175;Höyrymyllyntie 8 A 4, Toppilansalmi Oulu;3109;2018;http://www.etuovi.com/kohde/d78797?sc=M1042147289&pos=29545;259600;83.5;4h+k+s;Kerrostalo;FI"
Unable to parse line "4619;Fokkatie 2, Toppilansaari Oulu;2199;2005;http://www.etuovi.com/kohde/9844469?sc=M1042147289&pos=4920;155000;70.5;3h, k, s;Kerrostalo;FI"
Unable to parse line "4680;Kraaselintie 47, Rajahauta Oulu;1160;1975;http://www.etuovi.com/kohde/7730282?sc=M1042147289&pos=4993;138000;119;4h, k, kph, s;Omakotitalo;FI"
Unable to parse line "15674;Pitkänmöljäntie 29 B31, Toppilansaari Oulu;4041;2017;http://www.etuovi.com/kohde/c34668?sc=M1042147289&pos=16907;278800;69;3h+k+s;Kerrostalo;FI"
Unable to parse line "30843;Koitelinkoskentie 1157, Huttukylä Oulu;1768;2007;http://www.etuovi.com/kohde/d66966?sc=M1042147289&pos=33692;244000;138;5h+k+khh+kh/wc+s+erill.wc+et+tk+tekn.tila;Omakotitalo;FI"
Unable to parse line "16709;Höyrymyllyntie 8 A 13, Toppilansalmi Oulu;4150;2018;http://www.etuovi.com/kohde/e22898?sc=M1042147289&pos=17965;107900;26;1h+kk;Kerrostalo;FI"
Read 465 lines. Lines per second: 27256.7
16 lines failed to parse correctly
Finished parsing file /Users/Jin/projects/capstoneProj/regression_FinPrices/datasets/oulu_housing_postcode.csv
Parsing completed. Parsed 1381 lines in 0.017645 secs.

In [3]:
oulu_sales.head()


Out[3]:
id address pricePerSqm yearBuilt sourceLink salePrice size
10 Tepontie 5A, Kaakkuri
Oulu ...
2246 2011 http://www.etuovi.com/koh
de/e46243?sc=M1042147 ...
99962.0 44.5
24 Runkotie 5 B, Oulunsalo
Oulu ...
1653 1999 http://www.etuovi.com/koh
de/7724940?sc=M104214 ...
83500.0 50.5
42 Höyrymyllyntie 8 A 23,
Toppilansalmi Oulu ...
3405 2018 http://www.etuovi.com/koh
de/e44988?sc=M1042147 ...
194100.0 57.0
45 Kauppalinnankatu 1 J75,
Linnanmaa Oulu ...
3258 2017 http://www.etuovi.com/koh
de/e44987?sc=M1042147 ...
105900.0 32.5
48 Valjakkotie 1,
Heikkilänkangas Oulu ...
274 2000 http://www.etuovi.com/koh
de/e44977?sc=M1042147 ...
15095.56 55.0
56 Höyhenkuja 1, Kaakkuri
Oulu ...
272 2001 http://www.etuovi.com/koh
de/e44972?sc=M1042147 ...
22836.99 84.0
66 Koskitie 39 E 44, Tuira
Oulu ...
4336 2016 http://www.etuovi.com/koh
de/e46226?sc=M1042147 ...
424900.0 98.0
68 Asemakyläntie 135,
Haukipudas Oulu ...
1845 2006 http://www.etuovi.com/koh
de/e44946?sc=M1042147 ...
179000.0 97.0
78 Pakkahuoneenkatu 21,
Keskusta Oulu ...
3293 2008 http://www.etuovi.com/koh
de/1194825?sc=M104214 ...
298000.0 90.5
80 Kontiotie 5-7,
Välivainio Oulu ...
1309 1963 http://www.etuovi.com/koh
de/7733206?sc=M104214 ...
72000.0 55.0
roomInfo houseType postcode
2h + kk + sauna + p Luhtitalo 90420
2h, k, s Rivitalo 90460
2h+ kk + s Kerrostalo 90520
1h+kk+alk+parveke Kerrostalo 90570
2H+K+S Rivitalo 90310
4H+K+S Kerrostalo 90420
4h+k+s Kerrostalo 90500
2-3h, k, kph, s Omakotitalo 90840
3h, k, rt, khh/vh, kph,
s, wc, parv+varasto ...
Kerrostalo 90100
2h, k Kerrostalo 90530
[10 rows x 10 columns]

Chosen features = pricePersqm, yearBuilt, postcode


In [4]:
feature1 = ['postcode']
feature2 = ['yearBuilt', 'postcode']
feature3 = ['size', 'yearBuilt', 'postcode']

Split data -> train, test


In [5]:
train_data, test_data = oulu_sales.random_split(0.8, seed=0)
print "all:", len(oulu_sales)
print "train:", len(train_data)
print "test:", len(test_data)


all: 1381
train: 1096
test: 285

In [6]:
size_model = gl.linear_regression.create(train_data, target='salePrice', features=['size'], validation_set=None)


Linear regression:
--------------------------------------------------------
Number of examples          : 1096
Number of features          : 1
Number of unpacked features : 1
Number of coefficients    : 2
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 1.005559     | 557626.087568      | 85995.136537  |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.

In [7]:
size_model.evaluate(test_data)



Out[7]:
{'max_error': 513109.5113596902, 'rmse': 90307.00349348203}

In [8]:
print test_data['pricePerSqm'].mean()


2302.49473684

In [9]:
size_model.get('coefficients')


Out[9]:
name index value stderr
(intercept) None 76219.9038385 4907.88977574
size None 1045.24028266 50.6888120576
[2 rows x 4 columns]

Visualise f1_model on test data


In [10]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(test_data['size'], test_data['salePrice'],',',
        test_data['size'], size_model.predict(test_data), '-')
plt.title('Size vs Price')
# 1st line = Real prices on test data
# 2nd line = Predicted prices from Model on test data


Out[10]:
<matplotlib.text.Text at 0x11c85e8d0>

In [11]:
gl.canvas.set_target('ipynb')
oulu_sales.show(view='BoxWhisker Plot', x='postcode', y='salePrice')


Graphlab Img on BoxWhisker Plot


In [12]:
oulu_sales.show(view='Scatter Plot', x='size', y='salePrice')


Graphlab img on Scatter Plot

Graphlab img on Scatter Plot


In [13]:
f1_model = gl.linear_regression.create(train_data, target='salePrice', features=feature1, validation_set=None)
f2_model = gl.linear_regression.create(train_data, target='salePrice', features=feature2, validation_set=None)
f3_model = gl.linear_regression.create(train_data, target='salePrice', features=feature3, validation_set=None)


Linear regression:
--------------------------------------------------------
Number of examples          : 1096
Number of features          : 1
Number of unpacked features : 1
Number of coefficients    : 2
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.001470     | 937989.344298      | 100435.942394 |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.

Linear regression:
--------------------------------------------------------
Number of examples          : 1096
Number of features          : 2
Number of unpacked features : 2
Number of coefficients    : 3
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.001842     | 960668.627360      | 97145.231693  |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.

Linear regression:
--------------------------------------------------------
Number of examples          : 1096
Number of features          : 3
Number of unpacked features : 3
Number of coefficients    : 4
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.002450     | 597277.862578      | 71437.369798  |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.


In [14]:
print f1_model.evaluate(test_data)
print f2_model.evaluate(test_data)
print f3_model.evaluate(test_data)


{'max_error': 593855.0786543759, 'rmse': 111112.35792317627}
{'max_error': 613116.7512256526, 'rmse': 110500.79952024212}
{'max_error': 569085.9688430987, 'rmse': 76241.73102230058}

In [15]:
f1_model.get('coefficients')


Out[15]:
name index value stderr
(intercept) None 2772776.89594 1041518.93232
postcode None -28.857811874 11.5124669825
[2 rows x 4 columns]


In [16]:
f2_model.get('coefficients')


Out[16]:
name index value stderr
(intercept) None -249376.689947 1109896.34746
yearBuilt None 1158.37160507 142.274577041
postcode None -21.0040259648 11.3184756917
[3 rows x 4 columns]


In [17]:
f3_model.get('coefficients')


Out[17]:
name index value stderr
(intercept) None 1050311.91483 829863.943798
size None 1305.22778967 44.9557289965
yearBuilt None 1826.13863253 107.04028059
postcode None -51.2849278443 8.60539978599
[4 rows x 4 columns]


In [18]:
oulu_sales['pricePerSqm'].mean()


Out[18]:
2284.7914554670533

In [19]:
test_data.tail()


Out[19]:
id address pricePerSqm yearBuilt sourceLink salePrice size
40288 Aleksanterinkatu 71 A 48,
Hollihaka Oulu ...
4981 2016 http://www.etuovi.com/koh
de/9540323?sc=M104214 ...
129500.0 26.0
40528 Koskitie 16 B 37, Tuira
Oulu ...
4763 2014 http://www.etuovi.com/koh
de/904235?sc=M1042147 ...
755000.0 158.5
40870 Höyrymyllyntie 18,
Toppilansalmi Oulu ...
3608 2015 http://www.etuovi.com/koh
de/9482342?sc=M104214 ...
142500.0 39.5
41026 Vipukuja 9 ja 11,
Ritaharju Oulu ...
2571 2016 http://www.etuovi.com/koh
de/c34348?sc=M1042147 ...
378000.0 147.0
41302 Ratamotie 42, Pateniemi
Oulu ...
1250 2016 http://www.etuovi.com/koh
de/c27848?sc=M1042147 ...
57500.0 46.0
41418 Ratamotie 42, Pateniemi
Oulu ...
1250 2016 http://www.etuovi.com/koh
de/c26444?sc=M1042147 ...
51250.0 41.0
42470 Solistinkatu 1,
Karjasilta Oulu ...
3219 2008 http://www.etuovi.com/koh
de/1157063?sc=M104214 ...
235000.0 73.0
42514 Kivirinne 14 A,
Maikkula/Kontionkangas ...
1649 1987 http://www.etuovi.com/koh
de/b76677?sc=M1042147 ...
249000.0 151.0
42562 Siilotie 23 B5,
Toppilansalmi Oulu ...
3598 2014 http://www.etuovi.com/koh
de/523566?sc=M1042147 ...
147500.0 41.0
42908 Aleksanterinkatu 71 B,
Hollihaka Oulu ...
2929 2016 http://www.etuovi.com/koh
de/7638246?sc=M104214 ...
205000.0 70.0
roomInfo houseType postcode
1h+kk Kerrostalo 90120
5h+k+s Kerrostalo 90500
1h+kt+s Kerrostalo 90520
3 tai 4 mh + oh,k,rh +
khh + WC, S + WC, SU, SA ...
Omakotitalo 90540
Talliosake Rivitalo 90800
Talliosake Rivitalo 90800
3h, kk, s Kerrostalo 90140
4mh, oh, k, khh, s Omakotitalo 90240
1h,k,s,parvi Kerrostalo 90520
3h+kk+vh+erillinen
wc+kph+s ...
Kerrostalo 90120
[10 rows x 10 columns]

Predict using f3_model (lowest rmse)


In [20]:
house1 = oulu_sales[oulu_sales['id']==156]
house2 = oulu_sales[oulu_sales['id']==1123]
house3 = oulu_sales[oulu_sales['id']==2091]
house4 = oulu_sales[oulu_sales['id']==42908]
house5 = oulu_sales[oulu_sales['id']==40870]

In [21]:
print house1
print house2
print house3
print house4
print house5


+-----+-------------------------------+-------------+-----------+
|  id |            address            | pricePerSqm | yearBuilt |
+-----+-------------------------------+-------------+-----------+
| 156 | Purjehtijantie 8 A 13, Kos... |     1198    |    1976   |
+-----+-------------------------------+-------------+-----------+
+-------------------------------+-----------+------+-----------------------+
|           sourceLink          | salePrice | size |        roomInfo       |
+-------------------------------+-----------+------+-----------------------+
| http://www.etuovi.com/kohd... |  74900.0  | 62.5 | 2h+k+kph+aula+parveke |
+-------------------------------+-----------+------+-----------------------+
+------------+----------+
| houseType  | postcode |
+------------+----------+
| Kerrostalo |  90560   |
+------------+----------+
[? rows x 10 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
+------+---------------------------+-------------+-----------+
|  id  |          address          | pricePerSqm | yearBuilt |
+------+---------------------------+-------------+-----------+
| 1123 | Ratamopolku 6, Jääli Oulu |     2238    |    2017   |
+------+---------------------------+-------------+-----------+
+-------------------------------+-----------+-------+
|           sourceLink          | salePrice |  size |
+-------------------------------+-----------+-------+
| http://www.etuovi.com/kohd... |  235000.0 | 105.0 |
+-------------------------------+-----------+-------+
+-------------------------------+-------------+----------+
|            roomInfo           |  houseType  | postcode |
+-------------------------------+-------------+----------+
| 3 mh + oh+k+ruok+khh+ph+s+... | Omakotitalo |  90940   |
+-------------------------------+-------------+----------+
[? rows x 10 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
+------+-----------------------------+-------------+-----------+
|  id  |           address           | pricePerSqm | yearBuilt |
+------+-----------------------------+-------------+-----------+
| 2091 | Nummikatu 42, Keskusta Oulu |     2529    |    1923   |
+------+-----------------------------+-------------+-----------+
+-------------------------------+-----------+------+
|           sourceLink          | salePrice | size |
+-------------------------------+-----------+------+
| http://www.etuovi.com/kohd... |  129000.0 | 51.0 |
+-------------------------------+-----------+------+
+-------------------------------+------------+----------+
|            roomInfo           | houseType  | postcode |
+-------------------------------+------------+----------+
| tupakeittiö, makuuhuone, k... | Kerrostalo |  90100   |
+-------------------------------+------------+----------+
[? rows x 10 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
+-------+-------------------------------+-------------+-----------+
|   id  |            address            | pricePerSqm | yearBuilt |
+-------+-------------------------------+-------------+-----------+
| 42908 | Aleksanterinkatu 71 B, Hol... |     2929    |    2016   |
+-------+-------------------------------+-------------+-----------+
+-------------------------------+-----------+------+-----------------------------+
|           sourceLink          | salePrice | size |           roomInfo          |
+-------------------------------+-----------+------+-----------------------------+
| http://www.etuovi.com/kohd... |  205000.0 | 70.0 | 3h+kk+vh+erillinen wc+kph+s |
+-------------------------------+-----------+------+-----------------------------+
+------------+----------+
| houseType  | postcode |
+------------+----------+
| Kerrostalo |  90120   |
+------------+----------+
[? rows x 10 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
+-------+-------------------------------+-------------+-----------+
|   id  |            address            | pricePerSqm | yearBuilt |
+-------+-------------------------------+-------------+-----------+
| 40870 | Höyrymyllyntie 18, Toppila... |     3608    |    2015   |
+-------+-------------------------------+-------------+-----------+
+-------------------------------+-----------+------+----------+------------+
|           sourceLink          | salePrice | size | roomInfo | houseType  |
+-------------------------------+-----------+------+----------+------------+
| http://www.etuovi.com/kohd... |  142500.0 | 39.5 | 1h+kt+s  | Kerrostalo |
+-------------------------------+-----------+------+----------+------------+
+----------+
| postcode |
+----------+
|  90520   |
+----------+
[? rows x 10 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.

In [22]:
print "Prediction:", f3_model.predict(house1[''])
print "Real Price:", house1['salePrice']
print f3_model.evaluate(house1)
print "--------------------"

print "Prediction:", f3_model.predict(house2)
print "Real Price:",house2['salePrice']
print f3_model.evaluate(house2)
print "--------------------"

print "Prediction:", f3_model.predict(house3)
print "Real Price:",house3['salePrice']
print f3_model.evaluate(house3)
print "--------------------"

print "Prediction:", f3_model.predict(house4)
print "Real Price:",house4['salePrice']
print f3_model.evaluate(house4)
print "--------------------"

print "Prediction:", f3_model.predict(house5)
print "Real Price:",house5['salePrice']
print f3_model.evaluate(house5)


Prediction: [95975.52398547437]
Real Price: [74900.0]
{'max_error': 21075.52398547437, 'rmse': 21075.52398547437}
--------------------
Prediction: [206831.11639939062]
Real Price: [235000.0]
{'max_error': 28168.883600609377, 'rmse': 28168.883600609377}
--------------------
Prediction: [7771.123688472435]
Real Price: [129000.0]
{'max_error': 121228.87631152757, 'rmse': 121228.87631152757}
--------------------
Prediction: [201375.64596075192]
Real Price: [205000.0]
{'max_error': 3624.354039248079, 'rmse': 3624.354039248079}
--------------------
Prediction: [139226.08860558644]
Real Price: [142500.0]
{'max_error': 3273.9113944135606, 'rmse': 3273.9113944135606}

In [23]:
year_model = gl.linear_regression.create(train_data, target='salePrice', features=['yearBuilt'], validation_set=None)


Linear regression:
--------------------------------------------------------
Number of examples          : 1096
Number of features          : 1
Number of unpacked features : 1
Number of coefficients    : 2
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+---------------+
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
+-----------+----------+--------------+--------------------+---------------+
| 1         | 2        | 0.001001     | 961479.349424      | 97617.608732  |
+-----------+----------+--------------+--------------------+---------------+
SUCCESS: Optimal solution found.


In [24]:
year_model.evaluate(test_data)


Out[24]:
{'max_error': 614380.7340733628, 'rmse': 111439.54540016587}

In [25]:
year_model.get('coefficients')


Out[25]:
name index value stderr
(intercept) None -2235582.77069 280699.709377
yearBuilt None 1201.46934274 140.652276036
[2 rows x 4 columns]


In [26]:
plt.plot(test_data['yearBuilt'], test_data['salePrice'], '.',
        test_data['yearBuilt'], year_model.predict(test_data), '-')


Out[26]:
[<matplotlib.lines.Line2D at 0x11cbec490>,
 <matplotlib.lines.Line2D at 0x11cbec590>]