This is my first attempt at creating a model using sklearn alogithms

The algorithms I am most familiar with are Decision Trees and Random Forests, so that's where we'll start


In [1]:
# start with imports
import numpy as np
import pandas as pd
from pandas import DataFrame
import json
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


/Users/mac28/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Load the data from our JSON file.

The data is stored as a dictionary of dictionaries in the json file. We store it that way beacause it's easy to add data to the existing master data file. Also, I haven't figured out how to get it in a database yet.


In [2]:
with open('/Users/mac28/CLCrawler/MasterApartmentData.json') as f:
    my_dict = json.load(f)
dframe = DataFrame(my_dict)

dframe = dframe.T
dframe


---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-2-d8e128a92275> in <module>()
----> 1 with open('/Users/mac28/CLCrawler/MasterApartmentData.json') as f:
      2     my_dict = json.load(f)
      3 dframe = DataFrame(my_dict)
      4 
      5 dframe = dframe.T

IOError: [Errno 2] No such file or directory: '/Users/mac28/CLCrawler/MasterApartmentData.json'

Clean up the data a bit

Right now the 'shared' and 'split' are included in number of bathrooms. If I were to conver that to a number I would consider a shared/split bathroom to be half or 0.5 of a bathroom.


In [3]:
dframe.bath = dframe.bath.replace('shared',0.5)
dframe.bath = dframe.bath.replace('split',0.5)

Get rid of null values

I haven't figured out the best way to clean this up yet. For now I'm going to drop any rows that have a null value, though I recognize that this is not a good analysis practice. We ended up dropping 2,014 data points, which is a litle less than 16% of the data.

😬

Also there were some CRAZY outliers, and this analysis is focused on finding a model for apartments for the 99% of us that can't afford crazy extravigant apartments


In [4]:
df = dframe[dframe.price < 10000][['bath','bed','feet','price']].dropna()
df


Out[4]:
bath bed feet price
5399866740 1 1 750 1400
5401772970 1 1 659 1350
5402607488 1 2 936 1995
5402822514 1 1 624 1495
5402918870 2.5 3 1684 1800
5403011764 1 1 750 1340
5403019783 1 1 640 1095
5403320258 1.5 3 1200 1725
5404034182 2 2 1010 1995
5404362542 1 2 850 1395
5404431092 1 1 700 1195
5404439790 1 2 900 1395
5404442485 1 1 700 1195
5404447075 1 2 850 1395
5404478114 1 2 800 1295
5404512932 1 2 850 1395
5404543909 1 2 825 1395
5404549721 1 1 650 1150
5404650486 1 2 1000 1695
5404727169 1.5 3 1589 2795
5404855534 2 2 875 1695
5405638400 1 1 298 1235
5405717413 1.5 2 1172 1295
5407108078 1 1 881 2100
5408267602 2 2 890 995
5408650423 2 3 3000 2350
5408986289 1 0 349 1200
5409002928 2.5 4 2300 2750
5409038533 1 1 875 1395
5409045966 2 2 1133 2975
... ... ... ... ...
5499135395 1 1 748 1770
5499135755 1 1 763 1095
5499136900 1 2 870 1407
5499137516 1 1 670 1425
5499137982 1 2 968 1685
5499138633 1 1 800 2200
5499139129 1 1 550 1050
5499141339 1 0 340 895
5499145987 1 0 400 945
5499146335 1 1 450 875
5499150151 1 1 505 1390
5499151706 2 2 955 2359
5499152039 1.5 2 1017 1420
5499153859 2 2 950 1145
5499155508 2 2 970 2201
5499159726 2 2 932 2295
5499161775 2 2 936 2166
5499164874 1 0 440 1325
5499167381 2.5 2 1249 1599
5499167989 1 0 578 1536
5499169838 2.5 3 1537 1649
5499171887 2.5 3 1537 1699
5499177030 2 3 1045 3097
5499184093 1 1 647 1552
5499189883 1 1 610 1445
5499193325 1 1 640 1565
5499195408 1 1 580 1153
5499195916 1 1 842 1635
5499196858 2 2 1053 1795
5499207257 1 1 491 1664

12210 rows × 4 columns


In [5]:
df.describe()


Out[5]:
bath bed feet price
count 12210 12210 12210 12210
unique 11 7 1090 1350
top 1 1 700 995
freq 8085 4539 308 299

In [6]:
sns.distplot(df.price)


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x114e58bd0>

Let's simplify our data

I have a hunch that bedrooms, bathrooms, and square footage have the greatest effect on housing price. Before we get too complicated, let's see how accurate we can be with just a simple set of data


In [7]:
features = df[['bath','bed','feet']].values
price = df[['price']].values

Split data into Training and Testing Data


In [8]:
from sklearn.cross_validation import train_test_split
features_train, features_test, price_train, price_test = train_test_split(features, price, test_size=0.1, random_state=42)

In [9]:
from sklearn import tree
from sklearn.metrics import r2_score

In [10]:
clf = tree.DecisionTreeRegressor()
clf = clf.fit(features_train, price_train)

In [11]:
pred = clf.predict(features_test)
pred = np.array([[item] for item in pred])
print pred
print price_test


[[  995.8452381 ]
 [ 1308.14      ]
 [ 1995.        ]
 ..., 
 [ 1817.87878788]
 [ 1396.54761905]
 [ 2339.25      ]]
[[965]
 [1880]
 [1995]
 ..., 
 [1751]
 [1450]
 [2439]]

In [12]:
r2_score(pred, price_test)


Out[12]:
0.66314278725193798

72% Woot!

Wait, is that even good? I think that for the most part, it's pretty bad, but for our first run through, with super simple data, I'm willing to go with it.

What does it look like?


In [13]:
plt.scatter(pred,price_test)


Out[13]:
<matplotlib.collections.PathCollection at 0x119c22150>

Ok, so we can see that there's at least a relationship, which we already knew from the R^2 score.Visually it looks like there is more variation in the prices, and that we're better at predicting on the higher end, but it could very well be that we just have WAY more data on the lower end. Remember our plot from before? We see a similar thing going on


In [26]:
sns.distplot(df.price)


Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x1167a2910>

Ok, so we tried decision trees. Let's try decision trees on steroids. Random Forest!


In [14]:
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor()
reg = reg.fit(features_train, price_train)


/Users/mac28/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:3: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  app.launch_new_instance()

In [15]:
forest_pred = reg.predict(features_test)
forest_pred = np.array([[item] for item in forest_pred])

In [16]:
r2_score(forest_pred, price_test)


Out[16]:
0.67041595546408894

hmmm, it's not actually that much higher of accuracy? Maybe overfitting the data? Maybe Not enough features? Since the dataset is relatively small let's try to up the number of "Trees" that we use. We'll up it from the default 10 to 100


In [17]:
reg = RandomForestRegressor(n_estimators = 100)
reg = reg.fit(features_train, price_train)


/Users/mac28/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:2: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  from ipykernel import kernelapp as app

In [18]:
forest_pred = reg.predict(features_test)
forest_pred = np.array([[item] for item in forest_pred])

In [19]:
r2_score(forest_pred, price_test)


Out[19]:
0.67658076791856303
Still no difference, ok, I can take a hint. Let's look into over-fitting

In [20]:
reg = RandomForestRegressor(min_samples_split = 20)
reg = reg.fit(features_train, price_train)


/Users/mac28/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:2: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  from ipykernel import kernelapp as app

In [21]:
forest_pred = reg.predict(features_test)
forest_pred = np.array([[item] for item in forest_pred])

In [22]:
r2_score(forest_pred, price_test)


Out[22]:
0.58103431167805031

Shoot, we got worse. Feel free to play with it, we get better at predicting as the min samples split goes down.

Let's try a more complicated set of data


In [23]:
df2 = dframe[dframe.price < 10000][['bath','bed','feet','dog','cat','content','getphotos', 'hasmap', 'price',]].dropna()
df2


Out[23]:
bath bed feet dog cat content getphotos hasmap price
5399866740 1 1 750 0 0 754 8 0 1400
5401772970 1 1 659 1 1 2632 7 1 1350
5402607488 1 2 936 0 0 2259 12 1 1995
5402822514 1 1 624 0 0 1110 16 1 1495
5402918870 2.5 3 1684 0 0 1318 22 1 1800
5403011764 1 1 750 1 1 1649 14 1 1340
5403019783 1 1 640 1 1 1324 5 1 1095
5403320258 1.5 3 1200 0 0 1598 17 1 1725
5404034182 2 2 1010 1 1 4880 19 1 1995
5404362542 1 2 850 1 1 1662 8 1 1395
5404431092 1 1 700 1 1 1877 14 1 1195
5404439790 1 2 900 1 1 1860 14 1 1395
5404442485 1 1 700 1 1 1435 11 1 1195
5404447075 1 2 850 1 1 2603 24 1 1395
5404478114 1 2 800 1 1 2375 17 1 1295
5404512932 1 2 850 1 1 2564 23 1 1395
5404543909 1 2 825 0 0 2626 24 1 1395
5404549721 1 1 650 0 0 722 4 1 1150
5404650486 1 2 1000 0 0 3193 18 1 1695
5404727169 1.5 3 1589 0 0 2625 15 1 2795
5404855534 2 2 875 0 0 2465 6 1 1695
5405638400 1 1 298 1 1 2163 16 1 1235
5405717413 1.5 2 1172 0 0 1088 8 1 1295
5407108078 1 1 881 0 0 2319 21 1 2100
5408267602 2 2 890 0 0 992 0 1 995
5408650423 2 3 3000 0 0 1008 16 1 2350
5408986289 1 0 349 0 0 704 15 1 1200
5409002928 2.5 4 2300 0 0 1531 16 1 2750
5409038533 1 1 875 0 0 1741 17 1 1395
5409045966 2 2 1133 1 1 1750 0 1 2975
... ... ... ... ... ... ... ... ... ...
5499135395 1 1 748 1 1 3184 16 1 1770
5499135755 1 1 763 0 0 1478 8 1 1095
5499136900 1 2 870 1 1 3328 0 1 1407
5499137516 1 1 670 1 1 2762 15 1 1425
5499137982 1 2 968 1 1 1544 8 1 1685
5499138633 1 1 800 1 0 2419 9 1 2200
5499139129 1 1 550 0 0 1777 8 1 1050
5499141339 1 0 340 0 0 1381 7 1 895
5499145987 1 0 400 0 0 1381 7 1 945
5499146335 1 1 450 0 0 772 4 1 875
5499150151 1 1 505 1 1 2456 14 1 1390
5499151706 2 2 955 1 1 2515 10 1 2359
5499152039 1.5 2 1017 0 0 2356 9 1 1420
5499153859 2 2 950 1 1 1444 9 0 1145
5499155508 2 2 970 1 1 1552 19 1 2201
5499159726 2 2 932 1 1 1978 8 1 2295
5499161775 2 2 936 1 1 2278 8 1 2166
5499164874 1 0 440 1 1 1490 16 1 1325
5499167381 2.5 2 1249 1 1 2299 21 1 1599
5499167989 1 0 578 1 1 2106 23 1 1536
5499169838 2.5 3 1537 1 1 2266 20 1 1649
5499171887 2.5 3 1537 1 1 2266 20 1 1699
5499177030 2 3 1045 1 1 2563 13 1 3097
5499184093 1 1 647 1 1 1418 24 1 1552
5499189883 1 1 610 1 1 1561 22 1 1445
5499193325 1 1 640 1 1 2474 12 1 1565
5499195408 1 1 580 1 1 1695 8 1 1153
5499195916 1 1 842 1 1 1300 17 1 1635
5499196858 2 2 1053 1 1 1285 10 1 1795
5499207257 1 1 491 1 1 2951 14 1 1664

12210 rows × 9 columns


In [24]:
features = df2[['bath','bed','feet','dog','cat','content','getphotos', 'hasmap']].values
price = df2[['price']].values

In [25]:
features_train, features_test, price_train, price_test = train_test_split(features, price, test_size=0.1, random_state=42)

In [26]:
clf = tree.DecisionTreeRegressor()
clf = clf.fit(features_train, price_train)

In [27]:
pred = clf.predict(features_test)
pred = np.array([[item] for item in pred])

In [28]:
r2_score(pred, price_test)


Out[28]:
0.66876661699345319

Ahhhhh! We're just getting worse!


In [29]:
plt.scatter(pred,price_test)


Out[29]:
<matplotlib.collections.PathCollection at 0x119b32a50>

Let's try


In [40]:
reg = RandomForestRegressor(n_estimators=50)
reg = reg.fit(features_train, price_train)
forest_pred = reg.predict(features_test)
forest_pred = np.array([[item] for item in forest_pred])
r2_score(forest_pred, price_test)


/Users/mac28/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:2: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  from ipykernel import kernelapp as app
Out[40]:
0.77622254289033721

In [31]:
plt.scatter(forest_pred, price_test)


Out[31]:
<matplotlib.collections.PathCollection at 0x119f35c50>

Woo! Broke 80%!!! Well, now we've reached my limit. I'm not really sure how to progress from here. Will return after learning a bit more!

Note #1 to Riley: Next time look into another regressor? see if there's one that's inherantly better at this kind of thing.

Note #2 to Riley: Find a regressor that can take as inputs both continuous and discrete features, today we only worked with continuous

Note #3 to Riley: Clearly geography has a huge impact on housing prices. (ie. Downtown is way more expensive than the boonies.) Figure out the best way to model the effect that geography has, and then make that "multiplier" a feature in your model?


In [ ]: