In [1]:
import graphlab as gl

Load datasets

In [2]:
oulu_sales = gl.SFrame.read_csv('datasets/oulu_housing_postcode.csv', delimiter=';')

This non-commercial license of GraphLab Create for academic use is assigned to and will expire on November 08, 2017.
[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: /tmp/graphlab_server_1495261568.log
Finished parsing file /Users/Jin/projects/capstoneProj/regression_FinPrices/datasets/oulu_housing_postcode.csv
Parsing completed. Parsed 100 lines in 0.073536 secs.
Inferred types from first 100 line(s) of file as 
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
Unable to parse line "26981;Höyrymyllyntie 8 A 7, Toppilansalmi Oulu;3135;2018;;178700;57;2h+kk+s;Kerrostalo;FI"
Unable to parse line "3729;Kraaselintie 3, Rajahauta Oulu;1273;1986;;134900;106;4h,k,khh,kph,s,wc,2vh.;Paritalo;FI"
Unable to parse line "13715;Höyrymyllyntie 8 A 10, Toppilansalmi Oulu;3301;2018;;275600;83.5;4h+kt+s;Kerrostalo;FI"
Unable to parse line "26992;Höyrymyllyntie 8 A 16, Toppilansalmi Oulu;3558;2018;;204600;57.5;3h+kk+s;Kerrostalo;FI"
Unable to parse line "27175;Höyrymyllyntie 8 A 4, Toppilansalmi Oulu;3109;2018;;259600;83.5;4h+k+s;Kerrostalo;FI"
Unable to parse line "4619;Fokkatie 2, Toppilansaari Oulu;2199;2005;;155000;70.5;3h, k, s;Kerrostalo;FI"
Unable to parse line "4680;Kraaselintie 47, Rajahauta Oulu;1160;1975;;138000;119;4h, k, kph, s;Omakotitalo;FI"
Unable to parse line "15674;Pitkänmöljäntie 29 B31, Toppilansaari Oulu;4041;2017;;278800;69;3h+k+s;Kerrostalo;FI"
Unable to parse line "30843;Koitelinkoskentie 1157, Huttukylä Oulu;1768;2007;;244000;138;5h+k+khh+kh/wc+s+erill.wc+et+tk+tekn.tila;Omakotitalo;FI"
Unable to parse line "16709;Höyrymyllyntie 8 A 13, Toppilansalmi Oulu;4150;2018;;107900;26;1h+kk;Kerrostalo;FI"
Read 465 lines. Lines per second: 27256.7
16 lines failed to parse correctly
Finished parsing file /Users/Jin/projects/capstoneProj/regression_FinPrices/datasets/oulu_housing_postcode.csv
Parsing completed. Parsed 1381 lines in 0.017645 secs.

In [3]:

id address pricePerSqm yearBuilt sourceLink salePrice size
10 Tepontie 5A, Kaakkuri
Oulu ...
2246 2011
de/e46243?sc=M1042147 ...
99962.0 44.5
24 Runkotie 5 B, Oulunsalo
Oulu ...
1653 1999
de/7724940?sc=M104214 ...
83500.0 50.5
42 Höyrymyllyntie 8 A 23,
Toppilansalmi Oulu ...
3405 2018
de/e44988?sc=M1042147 ...
194100.0 57.0
45 Kauppalinnankatu 1 J75,
Linnanmaa Oulu ...
3258 2017
de/e44987?sc=M1042147 ...
105900.0 32.5
48 Valjakkotie 1,
Heikkilänkangas Oulu ...
274 2000
de/e44977?sc=M1042147 ...
15095.56 55.0
56 Höyhenkuja 1, Kaakkuri
Oulu ...
272 2001
de/e44972?sc=M1042147 ...
22836.99 84.0
66 Koskitie 39 E 44, Tuira
Oulu ...
4336 2016
de/e46226?sc=M1042147 ...
424900.0 98.0
68 Asemakyläntie 135,
Haukipudas Oulu ...
1845 2006
de/e44946?sc=M1042147 ...
179000.0 97.0
78 Pakkahuoneenkatu 21,
Keskusta Oulu ...
3293 2008
de/1194825?sc=M104214 ...
298000.0 90.5
80 Kontiotie 5-7,
Välivainio Oulu ...
1309 1963
de/7733206?sc=M104214 ...
72000.0 55.0
roomInfo houseType postcode
2h + kk + sauna + p Luhtitalo 90420
2h, k, s Rivitalo 90460
2h+ kk + s Kerrostalo 90520
1h+kk+alk+parveke Kerrostalo 90570
2H+K+S Rivitalo 90310
4H+K+S Kerrostalo 90420
4h+k+s Kerrostalo 90500
2-3h, k, kph, s Omakotitalo 90840
3h, k, rt, khh/vh, kph,
s, wc, parv+varasto ...
Kerrostalo 90100
2h, k Kerrostalo 90530
[10 rows x 10 columns]

Chosen features = pricePersqm, yearBuilt, postcode

In [4]:
feature1 = ['postcode']
feature2 = ['yearBuilt', 'postcode']
feature3 = ['size', 'yearBuilt', 'postcode']

Split data -> train, test

In [5]:
train_data, test_data = oulu_sales.random_split(0.8, seed=0)
print "all:", len(oulu_sales)
print "train:", len(train_data)
print "test:", len(test_data)

all: 1381
train: 1096
test: 285

In [6]:
size_model = gl.linear_regression.create(train_data, target='salePrice', features=['size'], validation_set=None)

Linear regression:
Number of examples          : 1096
Number of features          : 1
Number of unpacked features : 1
Number of coefficients    : 2
Starting Newton Method
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
| 1         | 2        | 1.005559     | 557626.087568      | 85995.136537  |
SUCCESS: Optimal solution found.

In [7]:

{'max_error': 513109.5113596902, 'rmse': 90307.00349348203}

In [8]:
print test_data['pricePerSqm'].mean()


In [9]:

name index value stderr
(intercept) None 76219.9038385 4907.88977574
size None 1045.24028266 50.6888120576
[2 rows x 4 columns]

Visualise f1_model on test data

In [10]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(test_data['size'], test_data['salePrice'],',',
        test_data['size'], size_model.predict(test_data), '-')
plt.title('Size vs Price')
# 1st line = Real prices on test data
# 2nd line = Predicted prices from Model on test data

<matplotlib.text.Text at 0x11c85e8d0>

In [11]:
gl.canvas.set_target('ipynb')'BoxWhisker Plot', x='postcode', y='salePrice')

Graphlab Img on BoxWhisker Plot

In [12]:'Scatter Plot', x='size', y='salePrice')

Graphlab img on Scatter Plot

Graphlab img on Scatter Plot

In [13]:
f1_model = gl.linear_regression.create(train_data, target='salePrice', features=feature1, validation_set=None)
f2_model = gl.linear_regression.create(train_data, target='salePrice', features=feature2, validation_set=None)
f3_model = gl.linear_regression.create(train_data, target='salePrice', features=feature3, validation_set=None)

Linear regression:
Number of examples          : 1096
Number of features          : 1
Number of unpacked features : 1
Number of coefficients    : 2
Starting Newton Method
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
| 1         | 2        | 0.001470     | 937989.344298      | 100435.942394 |
SUCCESS: Optimal solution found.

Linear regression:
Number of examples          : 1096
Number of features          : 2
Number of unpacked features : 2
Number of coefficients    : 3
Starting Newton Method
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
| 1         | 2        | 0.001842     | 960668.627360      | 97145.231693  |
SUCCESS: Optimal solution found.

Linear regression:
Number of examples          : 1096
Number of features          : 3
Number of unpacked features : 3
Number of coefficients    : 4
Starting Newton Method
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
| 1         | 2        | 0.002450     | 597277.862578      | 71437.369798  |
SUCCESS: Optimal solution found.

In [14]:
print f1_model.evaluate(test_data)
print f2_model.evaluate(test_data)
print f3_model.evaluate(test_data)

{'max_error': 593855.0786543759, 'rmse': 111112.35792317627}
{'max_error': 613116.7512256526, 'rmse': 110500.79952024212}
{'max_error': 569085.9688430987, 'rmse': 76241.73102230058}

In [15]:

name index value stderr
(intercept) None 2772776.89594 1041518.93232
postcode None -28.857811874 11.5124669825
[2 rows x 4 columns]

In [16]:

name index value stderr
(intercept) None -249376.689947 1109896.34746
yearBuilt None 1158.37160507 142.274577041
postcode None -21.0040259648 11.3184756917
[3 rows x 4 columns]

In [17]:

name index value stderr
(intercept) None 1050311.91483 829863.943798
size None 1305.22778967 44.9557289965
yearBuilt None 1826.13863253 107.04028059
postcode None -51.2849278443 8.60539978599
[4 rows x 4 columns]

In [18]:


In [19]:

id address pricePerSqm yearBuilt sourceLink salePrice size
40288 Aleksanterinkatu 71 A 48,
Hollihaka Oulu ...
4981 2016
de/9540323?sc=M104214 ...
129500.0 26.0
40528 Koskitie 16 B 37, Tuira
Oulu ...
4763 2014
de/904235?sc=M1042147 ...
755000.0 158.5
40870 Höyrymyllyntie 18,
Toppilansalmi Oulu ...
3608 2015
de/9482342?sc=M104214 ...
142500.0 39.5
41026 Vipukuja 9 ja 11,
Ritaharju Oulu ...
2571 2016
de/c34348?sc=M1042147 ...
378000.0 147.0
41302 Ratamotie 42, Pateniemi
Oulu ...
1250 2016
de/c27848?sc=M1042147 ...
57500.0 46.0
41418 Ratamotie 42, Pateniemi
Oulu ...
1250 2016
de/c26444?sc=M1042147 ...
51250.0 41.0
42470 Solistinkatu 1,
Karjasilta Oulu ...
3219 2008
de/1157063?sc=M104214 ...
235000.0 73.0
42514 Kivirinne 14 A,
Maikkula/Kontionkangas ...
1649 1987
de/b76677?sc=M1042147 ...
249000.0 151.0
42562 Siilotie 23 B5,
Toppilansalmi Oulu ...
3598 2014
de/523566?sc=M1042147 ...
147500.0 41.0
42908 Aleksanterinkatu 71 B,
Hollihaka Oulu ...
2929 2016
de/7638246?sc=M104214 ...
205000.0 70.0
roomInfo houseType postcode
1h+kk Kerrostalo 90120
5h+k+s Kerrostalo 90500
1h+kt+s Kerrostalo 90520
3 tai 4 mh + oh,k,rh +
khh + WC, S + WC, SU, SA ...
Omakotitalo 90540
Talliosake Rivitalo 90800
Talliosake Rivitalo 90800
3h, kk, s Kerrostalo 90140
4mh, oh, k, khh, s Omakotitalo 90240
1h,k,s,parvi Kerrostalo 90520
wc+kph+s ...
Kerrostalo 90120
[10 rows x 10 columns]

Predict using f3_model (lowest rmse)

In [20]:
house1 = oulu_sales[oulu_sales['id']==156]
house2 = oulu_sales[oulu_sales['id']==1123]
house3 = oulu_sales[oulu_sales['id']==2091]
house4 = oulu_sales[oulu_sales['id']==42908]
house5 = oulu_sales[oulu_sales['id']==40870]

In [21]:
print house1
print house2
print house3
print house4
print house5

|  id |            address            | pricePerSqm | yearBuilt |
| 156 | Purjehtijantie 8 A 13, Kos... |     1198    |    1976   |
|           sourceLink          | salePrice | size |        roomInfo       |
| |  74900.0  | 62.5 | 2h+k+kph+aula+parveke |
| houseType  | postcode |
| Kerrostalo |  90560   |
[? rows x 10 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
|  id  |          address          | pricePerSqm | yearBuilt |
| 1123 | Ratamopolku 6, Jääli Oulu |     2238    |    2017   |
|           sourceLink          | salePrice |  size |
| |  235000.0 | 105.0 |
|            roomInfo           |  houseType  | postcode |
| 3 mh + oh+k+ruok+khh+ph+s+... | Omakotitalo |  90940   |
[? rows x 10 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
|  id  |           address           | pricePerSqm | yearBuilt |
| 2091 | Nummikatu 42, Keskusta Oulu |     2529    |    1923   |
|           sourceLink          | salePrice | size |
| |  129000.0 | 51.0 |
|            roomInfo           | houseType  | postcode |
| tupakeittiö, makuuhuone, k... | Kerrostalo |  90100   |
[? rows x 10 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
|   id  |            address            | pricePerSqm | yearBuilt |
| 42908 | Aleksanterinkatu 71 B, Hol... |     2929    |    2016   |
|           sourceLink          | salePrice | size |           roomInfo          |
| |  205000.0 | 70.0 | 3h+kk+vh+erillinen wc+kph+s |
| houseType  | postcode |
| Kerrostalo |  90120   |
[? rows x 10 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.
|   id  |            address            | pricePerSqm | yearBuilt |
| 40870 | Höyrymyllyntie 18, Toppila... |     3608    |    2015   |
|           sourceLink          | salePrice | size | roomInfo | houseType  |
| |  142500.0 | 39.5 | 1h+kt+s  | Kerrostalo |
| postcode |
|  90520   |
[? rows x 10 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use sf.materialize() to force materialization.

In [22]:
print "Prediction:", f3_model.predict(house1[''])
print "Real Price:", house1['salePrice']
print f3_model.evaluate(house1)
print "--------------------"

print "Prediction:", f3_model.predict(house2)
print "Real Price:",house2['salePrice']
print f3_model.evaluate(house2)
print "--------------------"

print "Prediction:", f3_model.predict(house3)
print "Real Price:",house3['salePrice']
print f3_model.evaluate(house3)
print "--------------------"

print "Prediction:", f3_model.predict(house4)
print "Real Price:",house4['salePrice']
print f3_model.evaluate(house4)
print "--------------------"

print "Prediction:", f3_model.predict(house5)
print "Real Price:",house5['salePrice']
print f3_model.evaluate(house5)

Prediction: [95975.52398547437]
Real Price: [74900.0]
{'max_error': 21075.52398547437, 'rmse': 21075.52398547437}
Prediction: [206831.11639939062]
Real Price: [235000.0]
{'max_error': 28168.883600609377, 'rmse': 28168.883600609377}
Prediction: [7771.123688472435]
Real Price: [129000.0]
{'max_error': 121228.87631152757, 'rmse': 121228.87631152757}
Prediction: [201375.64596075192]
Real Price: [205000.0]
{'max_error': 3624.354039248079, 'rmse': 3624.354039248079}
Prediction: [139226.08860558644]
Real Price: [142500.0]
{'max_error': 3273.9113944135606, 'rmse': 3273.9113944135606}

In [23]:
year_model = gl.linear_regression.create(train_data, target='salePrice', features=['yearBuilt'], validation_set=None)

Linear regression:
Number of examples          : 1096
Number of features          : 1
Number of unpacked features : 1
Number of coefficients    : 2
Starting Newton Method
| Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
| 1         | 2        | 0.001001     | 961479.349424      | 97617.608732  |
SUCCESS: Optimal solution found.

In [24]:

{'max_error': 614380.7340733628, 'rmse': 111439.54540016587}

In [25]:

name index value stderr
(intercept) None -2235582.77069 280699.709377
yearBuilt None 1201.46934274 140.652276036
[2 rows x 4 columns]

In [26]:
plt.plot(test_data['yearBuilt'], test_data['salePrice'], '.',
        test_data['yearBuilt'], year_model.predict(test_data), '-')

[<matplotlib.lines.Line2D at 0x11cbec490>,
 <matplotlib.lines.Line2D at 0x11cbec590>]