This is my first attempt at creating a model using sklearn alogithms

The algorithms I am most familiar with are Decision Trees and Random Forests, so that's where we'll start



In [1]:

    
# start with imports
import numpy as np
import pandas as pd
from pandas import DataFrame
import json
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline









    



/Users/mac28/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Load the data from our JSON file.

The data is stored as a dictionary of dictionaries in the json file. We store it that way beacause it's easy to add data to the existing master data file. Also, I haven't figured out how to get it in a database yet.



In [2]:

    
with open('/Users/mac28/CLCrawler/MasterApartmentData.json') as f:
    my_dict = json.load(f)
dframe = DataFrame(my_dict)

dframe = dframe.T
dframe









    



---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-2-d8e128a92275> in <module>()
----> 1 with open('/Users/mac28/CLCrawler/MasterApartmentData.json') as f:
      2     my_dict = json.load(f)
      3 dframe = DataFrame(my_dict)
      4 
      5 dframe = dframe.T

IOError: [Errno 2] No such file or directory: '/Users/mac28/CLCrawler/MasterApartmentData.json'

Clean up the data a bit

Right now the 'shared' and 'split' are included in number of bathrooms. If I were to conver that to a number I would consider a shared/split bathroom to be half or 0.5 of a bathroom.



In [3]:

    
dframe.bath = dframe.bath.replace('shared',0.5)
dframe.bath = dframe.bath.replace('split',0.5)

Get rid of null values

I haven't figured out the best way to clean this up yet. For now I'm going to drop any rows that have a null value, though I recognize that this is not a good analysis practice. We ended up dropping 2,014 data points, which is a litle less than 16% of the data.

😬

Also there were some CRAZY outliers, and this analysis is focused on finding a model for apartments for the 99% of us that can't afford crazy extravigant apartments



In [4]:

    
df = dframe[dframe.price < 10000][['bath','bed','feet','price']].dropna()
df









    Out[4]:






  
    
      
      bath
      bed
      feet
      price
    
  
  
    
      5399866740
      1
      1
      750
      1400
    
    
      5401772970
      1
      1
      659
      1350
    
    
      5402607488
      1
      2
      936
      1995
    
    
      5402822514
      1
      1
      624
      1495
    
    
      5402918870
      2.5
      3
      1684
      1800
    
    
      5403011764
      1
      1
      750
      1340
    
    
      5403019783
      1
      1
      640
      1095
    
    
      5403320258
      1.5
      3
      1200
      1725
    
    
      5404034182
      2
      2
      1010
      1995
    
    
      5404362542
      1
      2
      850
      1395
    
    
      5404431092
      1
      1
      700
      1195
    
    
      5404439790
      1
      2
      900
      1395
    
    
      5404442485
      1
      1
      700
      1195
    
    
      5404447075
      1
      2
      850
      1395
    
    
      5404478114
      1
      2
      800
      1295
    
    
      5404512932
      1
      2
      850
      1395
    
    
      5404543909
      1
      2
      825
      1395
    
    
      5404549721
      1
      1
      650
      1150
    
    
      5404650486
      1
      2
      1000
      1695
    
    
      5404727169
      1.5
      3
      1589
      2795
    
    
      5404855534
      2
      2
      875
      1695
    
    
      5405638400
      1
      1
      298
      1235
    
    
      5405717413
      1.5
      2
      1172
      1295
    
    
      5407108078
      1
      1
      881
      2100
    
    
      5408267602
      2
      2
      890
      995
    
    
      5408650423
      2
      3
      3000
      2350
    
    
      5408986289
      1
      0
      349
      1200
    
    
      5409002928
      2.5
      4
      2300
      2750
    
    
      5409038533
      1
      1
      875
      1395
    
    
      5409045966
      2
      2
      1133
      2975
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      5499135395
      1
      1
      748
      1770
    
    
      5499135755
      1
      1
      763
      1095
    
    
      5499136900
      1
      2
      870
      1407
    
    
      5499137516
      1
      1
      670
      1425
    
    
      5499137982
      1
      2
      968
      1685
    
    
      5499138633
      1
      1
      800
      2200
    
    
      5499139129
      1
      1
      550
      1050
    
    
      5499141339
      1
      0
      340
      895
    
    
      5499145987
      1
      0
      400
      945
    
    
      5499146335
      1
      1
      450
      875
    
    
      5499150151
      1
      1
      505
      1390
    
    
      5499151706
      2
      2
      955
      2359
    
    
      5499152039
      1.5
      2
      1017
      1420
    
    
      5499153859
      2
      2
      950
      1145
    
    
      5499155508
      2
      2
      970
      2201
    
    
      5499159726
      2
      2
      932
      2295
    
    
      5499161775
      2
      2
      936
      2166
    
    
      5499164874
      1
      0
      440
      1325
    
    
      5499167381
      2.5
      2
      1249
      1599
    
    
      5499167989
      1
      0
      578
      1536
    
    
      5499169838
      2.5
      3
      1537
      1649
    
    
      5499171887
      2.5
      3
      1537
      1699
    
    
      5499177030
      2
      3
      1045
      3097
    
    
      5499184093
      1
      1
      647
      1552
    
    
      5499189883
      1
      1
      610
      1445
    
    
      5499193325
      1
      1
      640
      1565
    
    
      5499195408
      1
      1
      580
      1153
    
    
      5499195916
      1
      1
      842
      1635
    
    
      5499196858
      2
      2
      1053
      1795
    
    
      5499207257
      1
      1
      491
      1664
    
  

12210 rows × 4 columns



In [5]:

    
df.describe()



In [6]:

    
sns.distplot(df.price)









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0x114e58bd0>

Let's simplify our data

I have a hunch that bedrooms, bathrooms, and square footage have the greatest effect on housing price. Before we get too complicated, let's see how accurate we can be with just a simple set of data



In [7]:

    
features = df[['bath','bed','feet']].values
price = df[['price']].values

Split data into Training and Testing Data



In [8]:

    
from sklearn.cross_validation import train_test_split
features_train, features_test, price_train, price_test = train_test_split(features, price, test_size=0.1, random_state=42)



In [9]:

    
from sklearn import tree
from sklearn.metrics import r2_score



In [10]:

    
clf = tree.DecisionTreeRegressor()
clf = clf.fit(features_train, price_train)



In [11]:

    
pred = clf.predict(features_test)
pred = np.array([[item] for item in pred])
print pred
print price_test









    



[[  995.8452381 ]
 [ 1308.14      ]
 [ 1995.        ]
 ..., 
 [ 1817.87878788]
 [ 1396.54761905]
 [ 2339.25      ]]
[[965]
 [1880]
 [1995]
 ..., 
 [1751]
 [1450]
 [2439]]



In [12]:

    
r2_score(pred, price_test)









    Out[12]:





0.66314278725193798

72% Woot!

Wait, is that even good? I think that for the most part, it's pretty bad, but for our first run through, with super simple data, I'm willing to go with it.

What does it look like?



In [13]:

    
plt.scatter(pred,price_test)









    Out[13]:





<matplotlib.collections.PathCollection at 0x119c22150>

Ok, so we can see that there's at least a relationship, which we already knew from the R^2 score.Visually it looks like there is more variation in the prices, and that we're better at predicting on the higher end, but it could very well be that we just have WAY more data on the lower end. Remember our plot from before? We see a similar thing going on



In [26]:

    
sns.distplot(df.price)









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0x1167a2910>

Ok, so we tried decision trees. Let's try decision trees on steroids. Random Forest!



In [14]:

    
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor()
reg = reg.fit(features_train, price_train)









    



/Users/mac28/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  app.launch_new_instance()



In [15]:

    
forest_pred = reg.predict(features_test)
forest_pred = np.array([[item] for item in forest_pred])



In [16]:

    
r2_score(forest_pred, price_test)









    Out[16]:





0.67041595546408894

hmmm, it's not actually that much higher of accuracy? Maybe overfitting the data? Maybe Not enough features? Since the dataset is relatively small let's try to up the number of "Trees" that we use. We'll up it from the default 10 to 100



In [17]:

    
reg = RandomForestRegressor(n_estimators = 100)
reg = reg.fit(features_train, price_train)









    



/Users/mac28/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:2: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  from ipykernel import kernelapp as app



In [18]:

    
forest_pred = reg.predict(features_test)
forest_pred = np.array([[item] for item in forest_pred])



In [19]:

    
r2_score(forest_pred, price_test)









    Out[19]:





0.67658076791856303

Still no difference, ok, I can take a hint. Let's look into over-fitting



In [20]:

    
reg = RandomForestRegressor(min_samples_split = 20)
reg = reg.fit(features_train, price_train)









    



/Users/mac28/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:2: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  from ipykernel import kernelapp as app



In [21]:

    
forest_pred = reg.predict(features_test)
forest_pred = np.array([[item] for item in forest_pred])



In [22]:

    
r2_score(forest_pred, price_test)









    Out[22]:





0.58103431167805031

Shoot, we got worse. Feel free to play with it, we get better at predicting as the min samples split goes down.

Let's try a more complicated set of data



In [23]:

    
df2 = dframe[dframe.price < 10000][['bath','bed','feet','dog','cat','content','getphotos', 'hasmap', 'price',]].dropna()
df2









    Out[23]:






  
    
      
      bath
      bed
      feet
      dog
      cat
      content
      getphotos
      hasmap
      price
    
  
  
    
      5399866740
      1
      1
      750
      0
      0
      754
      8
      0
      1400
    
    
      5401772970
      1
      1
      659
      1
      1
      2632
      7
      1
      1350
    
    
      5402607488
      1
      2
      936
      0
      0
      2259
      12
      1
      1995
    
    
      5402822514
      1
      1
      624
      0
      0
      1110
      16
      1
      1495
    
    
      5402918870
      2.5
      3
      1684
      0
      0
      1318
      22
      1
      1800
    
    
      5403011764
      1
      1
      750
      1
      1
      1649
      14
      1
      1340
    
    
      5403019783
      1
      1
      640
      1
      1
      1324
      5
      1
      1095
    
    
      5403320258
      1.5
      3
      1200
      0
      0
      1598
      17
      1
      1725
    
    
      5404034182
      2
      2
      1010
      1
      1
      4880
      19
      1
      1995
    
    
      5404362542
      1
      2
      850
      1
      1
      1662
      8
      1
      1395
    
    
      5404431092
      1
      1
      700
      1
      1
      1877
      14
      1
      1195
    
    
      5404439790
      1
      2
      900
      1
      1
      1860
      14
      1
      1395
    
    
      5404442485
      1
      1
      700
      1
      1
      1435
      11
      1
      1195
    
    
      5404447075
      1
      2
      850
      1
      1
      2603
      24
      1
      1395
    
    
      5404478114
      1
      2
      800
      1
      1
      2375
      17
      1
      1295
    
    
      5404512932
      1
      2
      850
      1
      1
      2564
      23
      1
      1395
    
    
      5404543909
      1
      2
      825
      0
      0
      2626
      24
      1
      1395
    
    
      5404549721
      1
      1
      650
      0
      0
      722
      4
      1
      1150
    
    
      5404650486
      1
      2
      1000
      0
      0
      3193
      18
      1
      1695
    
    
      5404727169
      1.5
      3
      1589
      0
      0
      2625
      15
      1
      2795
    
    
      5404855534
      2
      2
      875
      0
      0
      2465
      6
      1
      1695
    
    
      5405638400
      1
      1
      298
      1
      1
      2163
      16
      1
      1235
    
    
      5405717413
      1.5
      2
      1172
      0
      0
      1088
      8
      1
      1295
    
    
      5407108078
      1
      1
      881
      0
      0
      2319
      21
      1
      2100
    
    
      5408267602
      2
      2
      890
      0
      0
      992
      0
      1
      995
    
    
      5408650423
      2
      3
      3000
      0
      0
      1008
      16
      1
      2350
    
    
      5408986289
      1
      0
      349
      0
      0
      704
      15
      1
      1200
    
    
      5409002928
      2.5
      4
      2300
      0
      0
      1531
      16
      1
      2750
    
    
      5409038533
      1
      1
      875
      0
      0
      1741
      17
      1
      1395
    
    
      5409045966
      2
      2
      1133
      1
      1
      1750
      0
      1
      2975
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      5499135395
      1
      1
      748
      1
      1
      3184
      16
      1
      1770
    
    
      5499135755
      1
      1
      763
      0
      0
      1478
      8
      1
      1095
    
    
      5499136900
      1
      2
      870
      1
      1
      3328
      0
      1
      1407
    
    
      5499137516
      1
      1
      670
      1
      1
      2762
      15
      1
      1425
    
    
      5499137982
      1
      2
      968
      1
      1
      1544
      8
      1
      1685
    
    
      5499138633
      1
      1
      800
      1
      0
      2419
      9
      1
      2200
    
    
      5499139129
      1
      1
      550
      0
      0
      1777
      8
      1
      1050
    
    
      5499141339
      1
      0
      340
      0
      0
      1381
      7
      1
      895
    
    
      5499145987
      1
      0
      400
      0
      0
      1381
      7
      1
      945
    
    
      5499146335
      1
      1
      450
      0
      0
      772
      4
      1
      875
    
    
      5499150151
      1
      1
      505
      1
      1
      2456
      14
      1
      1390
    
    
      5499151706
      2
      2
      955
      1
      1
      2515
      10
      1
      2359
    
    
      5499152039
      1.5
      2
      1017
      0
      0
      2356
      9
      1
      1420
    
    
      5499153859
      2
      2
      950
      1
      1
      1444
      9
      0
      1145
    
    
      5499155508
      2
      2
      970
      1
      1
      1552
      19
      1
      2201
    
    
      5499159726
      2
      2
      932
      1
      1
      1978
      8
      1
      2295
    
    
      5499161775
      2
      2
      936
      1
      1
      2278
      8
      1
      2166
    
    
      5499164874
      1
      0
      440
      1
      1
      1490
      16
      1
      1325
    
    
      5499167381
      2.5
      2
      1249
      1
      1
      2299
      21
      1
      1599
    
    
      5499167989
      1
      0
      578
      1
      1
      2106
      23
      1
      1536
    
    
      5499169838
      2.5
      3
      1537
      1
      1
      2266
      20
      1
      1649
    
    
      5499171887
      2.5
      3
      1537
      1
      1
      2266
      20
      1
      1699
    
    
      5499177030
      2
      3
      1045
      1
      1
      2563
      13
      1
      3097
    
    
      5499184093
      1
      1
      647
      1
      1
      1418
      24
      1
      1552
    
    
      5499189883
      1
      1
      610
      1
      1
      1561
      22
      1
      1445
    
    
      5499193325
      1
      1
      640
      1
      1
      2474
      12
      1
      1565
    
    
      5499195408
      1
      1
      580
      1
      1
      1695
      8
      1
      1153
    
    
      5499195916
      1
      1
      842
      1
      1
      1300
      17
      1
      1635
    
    
      5499196858
      2
      2
      1053
      1
      1
      1285
      10
      1
      1795
    
    
      5499207257
      1
      1
      491
      1
      1
      2951
      14
      1
      1664
    
  

12210 rows × 9 columns



In [24]:

    
features = df2[['bath','bed','feet','dog','cat','content','getphotos', 'hasmap']].values
price = df2[['price']].values



In [25]:

    
features_train, features_test, price_train, price_test = train_test_split(features, price, test_size=0.1, random_state=42)



In [26]:

    
clf = tree.DecisionTreeRegressor()
clf = clf.fit(features_train, price_train)



In [27]:

    
pred = clf.predict(features_test)
pred = np.array([[item] for item in pred])



In [28]:

    
r2_score(pred, price_test)









    Out[28]:





0.66876661699345319

Ahhhhh! We're just getting worse!



In [29]:

    
plt.scatter(pred,price_test)









    Out[29]:





<matplotlib.collections.PathCollection at 0x119b32a50>

Let's try



In [40]:

    
reg = RandomForestRegressor(n_estimators=50)
reg = reg.fit(features_train, price_train)
forest_pred = reg.predict(features_test)
forest_pred = np.array([[item] for item in forest_pred])
r2_score(forest_pred, price_test)









    



/Users/mac28/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:2: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  from ipykernel import kernelapp as app






    Out[40]:





0.77622254289033721



In [31]:

    
plt.scatter(forest_pred, price_test)









    Out[31]:





<matplotlib.collections.PathCollection at 0x119f35c50>

Woo! Broke 80%!!! Well, now we've reached my limit. I'm not really sure how to progress from here. Will return after learning a bit more!

Note #1 to Riley: Next time look into another regressor? see if there's one that's inherantly better at this kind of thing.

Note #2 to Riley: Find a regressor that can take as inputs both continuous and discrete features, today we only worked with continuous

Note #3 to Riley: Clearly geography has a huge impact on housing prices. (ie. Downtown is way more expensive than the boonies.) Figure out the best way to model the effect that geography has, and then make that "multiplier" a feature in your model?



In [ ]:

	bath	bed	feet	price
5399866740	1	1	750	1400
5401772970	1	1	659	1350
5402607488	1	2	936	1995
5402822514	1	1	624	1495
5402918870	2.5	3	1684	1800
5403011764	1	1	750	1340
5403019783	1	1	640	1095
5403320258	1.5	3	1200	1725
5404034182	2	2	1010	1995
5404362542	1	2	850	1395
5404431092	1	1	700	1195
5404439790	1	2	900	1395
5404442485	1	1	700	1195
5404447075	1	2	850	1395
5404478114	1	2	800	1295
5404512932	1	2	850	1395
5404543909	1	2	825	1395
5404549721	1	1	650	1150
5404650486	1	2	1000	1695
5404727169	1.5	3	1589	2795
5404855534	2	2	875	1695
5405638400	1	1	298	1235
5405717413	1.5	2	1172	1295
5407108078	1	1	881	2100
5408267602	2	2	890	995
5408650423	2	3	3000	2350
5408986289	1	0	349	1200
5409002928	2.5	4	2300	2750
5409038533	1	1	875	1395
5409045966	2	2	1133	2975
...	...	...	...	...
5499135395	1	1	748	1770
5499135755	1	1	763	1095
5499136900	1	2	870	1407
5499137516	1	1	670	1425
5499137982	1	2	968	1685
5499138633	1	1	800	2200
5499139129	1	1	550	1050
5499141339	1	0	340	895
5499145987	1	0	400	945
5499146335	1	1	450	875
5499150151	1	1	505	1390
5499151706	2	2	955	2359
5499152039	1.5	2	1017	1420
5499153859	2	2	950	1145
5499155508	2	2	970	2201
5499159726	2	2	932	2295
5499161775	2	2	936	2166
5499164874	1	0	440	1325
5499167381	2.5	2	1249	1599
5499167989	1	0	578	1536
5499169838	2.5	3	1537	1649
5499171887	2.5	3	1537	1699
5499177030	2	3	1045	3097
5499184093	1	1	647	1552
5499189883	1	1	610	1445
5499193325	1	1	640	1565
5499195408	1	1	580	1153
5499195916	1	1	842	1635
5499196858	2	2	1053	1795
5499207257	1	1	491	1664

	bath	bed	feet	price
count	12210	12210	12210	12210
unique	11	7	1090	1350
top	1	1	700	995
freq	8085	4539	308	299