In [1]:
# start with imports
import numpy as np
import pandas as pd
from pandas import DataFrame
import json
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
with open('/Users/mac28/CLCrawler/MasterApartmentData.json') as f:
my_dict = json.load(f)
dframe = DataFrame(my_dict)
dframe = dframe.T
dframe
In [3]:
dframe.bath = dframe.bath.replace('shared',0.5)
dframe.bath = dframe.bath.replace('split',0.5)
I haven't figured out the best way to clean this up yet. For now I'm going to drop any rows that have a null value, though I recognize that this is not a good analysis practice. We ended up dropping 2,014 data points, which is a litle less than 16% of the data.
😬
Also there were some CRAZY outliers, and this analysis is focused on finding a model for apartments for the 99% of us that can't afford crazy extravigant apartments
In [4]:
df = dframe[dframe.price < 10000][['bath','bed','feet','price']].dropna()
df
Out[4]:
In [5]:
df.describe()
Out[5]:
In [6]:
sns.distplot(df.price)
Out[6]:
In [7]:
features = df[['bath','bed','feet']].values
price = df[['price']].values
In [8]:
from sklearn.cross_validation import train_test_split
features_train, features_test, price_train, price_test = train_test_split(features, price, test_size=0.1, random_state=42)
In [9]:
from sklearn import tree
from sklearn.metrics import r2_score
In [10]:
clf = tree.DecisionTreeRegressor()
clf = clf.fit(features_train, price_train)
In [11]:
pred = clf.predict(features_test)
pred = np.array([[item] for item in pred])
print pred
print price_test
In [12]:
r2_score(pred, price_test)
Out[12]:
In [13]:
plt.scatter(pred,price_test)
Out[13]:
Ok, so we can see that there's at least a relationship, which we already knew from the R^2 score.Visually it looks like there is more variation in the prices, and that we're better at predicting on the higher end, but it could very well be that we just have WAY more data on the lower end. Remember our plot from before? We see a similar thing going on
In [26]:
sns.distplot(df.price)
Out[26]:
In [14]:
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor()
reg = reg.fit(features_train, price_train)
In [15]:
forest_pred = reg.predict(features_test)
forest_pred = np.array([[item] for item in forest_pred])
In [16]:
r2_score(forest_pred, price_test)
Out[16]:
In [17]:
reg = RandomForestRegressor(n_estimators = 100)
reg = reg.fit(features_train, price_train)
In [18]:
forest_pred = reg.predict(features_test)
forest_pred = np.array([[item] for item in forest_pred])
In [19]:
r2_score(forest_pred, price_test)
Out[19]:
In [20]:
reg = RandomForestRegressor(min_samples_split = 20)
reg = reg.fit(features_train, price_train)
In [21]:
forest_pred = reg.predict(features_test)
forest_pred = np.array([[item] for item in forest_pred])
In [22]:
r2_score(forest_pred, price_test)
Out[22]:
Shoot, we got worse. Feel free to play with it, we get better at predicting as the min samples split goes down.
In [23]:
df2 = dframe[dframe.price < 10000][['bath','bed','feet','dog','cat','content','getphotos', 'hasmap', 'price',]].dropna()
df2
Out[23]:
In [24]:
features = df2[['bath','bed','feet','dog','cat','content','getphotos', 'hasmap']].values
price = df2[['price']].values
In [25]:
features_train, features_test, price_train, price_test = train_test_split(features, price, test_size=0.1, random_state=42)
In [26]:
clf = tree.DecisionTreeRegressor()
clf = clf.fit(features_train, price_train)
In [27]:
pred = clf.predict(features_test)
pred = np.array([[item] for item in pred])
In [28]:
r2_score(pred, price_test)
Out[28]:
Ahhhhh! We're just getting worse!
In [29]:
plt.scatter(pred,price_test)
Out[29]:
Let's try
In [40]:
reg = RandomForestRegressor(n_estimators=50)
reg = reg.fit(features_train, price_train)
forest_pred = reg.predict(features_test)
forest_pred = np.array([[item] for item in forest_pred])
r2_score(forest_pred, price_test)
Out[40]:
In [31]:
plt.scatter(forest_pred, price_test)
Out[31]:
Woo! Broke 80%!!! Well, now we've reached my limit. I'm not really sure how to progress from here. Will return after learning a bit more!
Note #1 to Riley: Next time look into another regressor? see if there's one that's inherantly better at this kind of thing.
Note #2 to Riley: Find a regressor that can take as inputs both continuous and discrete features, today we only worked with continuous
Note #3 to Riley: Clearly geography has a huge impact on housing prices. (ie. Downtown is way more expensive than the boonies.) Figure out the best way to model the effect that geography has, and then make that "multiplier" a feature in your model?
In [ ]: