This episode is a discussion of multiple regression: the use of observations that are a vector of values to predict a response variable. For this episode, we consider how features of a home such as the number of bedrooms, number of bathrooms, and square footage can predict the sale price.
Unlike a typical episode of Data Skeptic, these show notes are not just supporting material, but are actually featured in the episode.
The site Redfin graciously allows users to download a CSV of results they are viewing. Unfortunately, they limit this extract to 500 listings, but you can still use it to try the same approach on your own using the download link shown in the figure below.
In [4]:
from IPython.display import Image
Image(filename='redfin.png', width=300)
Out[4]:
In [7]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statsmodels.api as sm
import math
In [8]:
file = 'redfin_2016-02-06-20-27-26_results.csv'
df = pd.read_csv(file)
In [9]:
df['IS SHORT SALE'] = df['IS SHORT SALE'].fillna(0).astype(int)
df['LAST SALE DATE'] = pd.to_datetime(df['LAST SALE DATE'])
df['LAST SALE MONTH'] = df['LAST SALE DATE'].apply(lambda x: str(x.year) + '-' + str(x.month).zfill(2))
df['intercept'] = 1
In [10]:
df.columns
Out[10]:
In [11]:
# Overall trend of prices for the input by month
h_mean = df.groupby(['LAST SALE MONTH'])['LAST SALE PRICE'].mean()
h_q1 = df.groupby(['LAST SALE MONTH'])['LAST SALE PRICE'].quantile(.25)
h_q2 = df.groupby(['LAST SALE MONTH'])['LAST SALE PRICE'].quantile(.5)
h_q3 = df.groupby(['LAST SALE MONTH'])['LAST SALE PRICE'].quantile(.75)
historical = pd.concat([h_q1, h_q2, h_q3, h_mean], axis=1)
historical.columns = ['q1', 'median', 'q3', 'mean']
historical = historical.reset_index()
plt.figure(figsize=(15,5))
#plt.plot(historical.index, historical['mean'])
plt.plot(historical.index, historical['q1'])
plt.plot(historical.index, historical['median'])
plt.plot(historical.index, historical['q3'])
plt.xticks(historical.index, historical['LAST SALE MONTH'], rotation=90)
plt.legend(loc=4)
plt.title('Historical monthly prices')
plt.show()
In [12]:
# RATIO OF LAST SALE PRICE TO LISTING PRICE (i.e. margin of negotation)
delta = df['LAST SALE PRICE'] / df['LIST PRICE']
delta.dropna(inplace=True)
n, bins, patches = plt.hist(delta, 50, normed=1)
plt.gca().xaxis.grid(False)
In [13]:
# RATIO OF SQFT TO LOT SIZE (just for fun)
delta = df['SQFT'] / df['LOT SIZE']
delta.dropna(inplace=True)
n, bins, patches = plt.hist(delta, 10, normed=1)
plt.gca().xaxis.grid(False)
In [14]:
df['YEAR BUILT'].hist()
plt.title('Year Built')
plt.show()
In [16]:
ht = df.groupby(['HOME TYPE'])['HOME TYPE'].count()
ht.sort_values(inplace=True)
ht = pd.DataFrame(ht)
ht.columns=['count']
ht['count'] = ht['count'] / sum(ht['count'])
ht.reset_index(inplace=True)
plt.barh(ht.index, ht['count'])
plt.yticks(ht.index + 0.4, ht['HOME TYPE'])
plt.gca().yaxis.grid(False)
plt.title('Breakdown of home types')
plt.show()
In [17]:
plt.figure(figsize=(7,7))
gb = df.groupby(['BEDS'])['LAST SALE PRICE'].median()
plt.scatter(gb.index, gb, color='green', s=400, alpha=1)
gb = df.groupby(['BEDS'])['LAST SALE PRICE'].quantile(.25)
plt.scatter(gb.index, gb, color='green', s=300, alpha=0.15)
gb = df.groupby(['BEDS'])['LAST SALE PRICE'].quantile(.75)
plt.scatter(gb.index, gb, color='green', s=300, alpha=0.15)
plt.scatter(df['BEDS'], df['LAST SALE PRICE'], alpha=0.1, s=40)
plt.xlabel('Beds')
plt.ylabel('Sale price')
plt.show()
In [18]:
fig = plt.figure(figsize=(14,7))
ax1 = fig.add_subplot(121)
df['fBATHS'] = df['BATHS'].apply(math.floor)
gb = df.groupby(['BATHS'])['LAST SALE PRICE'].median()
ax1.scatter(gb.index, gb, color='green', s=400, alpha=1)
ax1.scatter(df['BATHS'], df['LAST SALE PRICE'], alpha=0.1, s=40)
ax1.set_title('Baths vs. sale price')
ax1.set_xlabel('Baths')
ax1.set_ylabel('Sale price')
ax2 = fig.add_subplot(122)
gb = df.groupby(['fBATHS'])['LAST SALE PRICE'].median()
ax2.scatter(gb.index, gb, color='green', s=400, alpha=1)
ax2.scatter(df['fBATHS'], df['LAST SALE PRICE'], alpha=0.1, s=40)
ax2.set_title('Round all bathrooms down')
ax2.set_xlabel('Baths')
plt.show()
In [19]:
f = sm.OLS(df['LAST SALE PRICE'], df[['intercept', 'SQFT']], missing='drop').fit()
f.summary()
Out[19]:
In [20]:
plt.scatter(df['SQFT'], df['LAST SALE PRICE'])
pm = 3000
plt.plot([0, pm], [f.predict([1,0])[0], f.predict([1, pm])[0]])
print f.rsquared
plt.ylabel('last sale price')
plt.xlabel('sqft')
plt.xlim(0,4500)
plt.show()
For a new problem, I typically get an initial fit just to see what I'm working with. There's some obvious feature extraction and changes that need to be done, but the results will be looked at with dis-trust anyway, so let's just see what happens when we throw the data at the wall. This is (arguably) a sloppy practice. I personally choose to do it because even when my problem is ill-formed, and my solution suffers from GIGO, I often learn a bit from the diagnostics. So let's take a skeptical look at an effortless model...
In [21]:
cols = ['intercept', 'BEDS', 'BATHS', 'SQFT', 'LOT SIZE', 'YEAR BUILT', 'PARKING SPOTS', 'IS SHORT SALE', 'intercept']
y = df['LAST SALE PRICE']
X = sm.add_constant(df[cols])
est = sm.OLS(y, X, missing='drop').fit()
est.summary()
Out[21]:
First stat we check (and good fodder for a future episode) is the R^2. My model explains about half of the variance, but for the limited dataset, this is weak. My coefficients are clearly garbage. Why would any of them be negative, except prehaps "is short sale"? You should be skeptical of this model. Let's make one more small attempt here and leave the real work for future posts.
In [23]:
# Convert to dummy variables, especially for baths which have a strange non-linear relationship.
df['Beds0'] = (df['BEDS'] == 0).astype(int)
df['Beds1'] = (df['BEDS'] == 1).astype(int)
df['Beds2'] = (df['BEDS'] == 2).astype(int)
df['Beds+'] = (df['BEDS'] >= 3).astype(int)
df['Baths0'] = (df['BATHS'] == 0).astype(int)
df['Baths1'] = (df['BATHS'] == 1).astype(int)
df['Baths2'] = (df['BATHS'] == 2).astype(int)
df['Baths+'] = (df['BATHS'] >= 3).astype(int)
In [24]:
cols = ['intercept', 'Beds0', 'Beds1', 'Beds2', 'Beds+', 'Baths0', 'Baths1', 'Baths2', 'Baths+', 'SQFT', 'LOT SIZE', 'YEAR BUILT', 'PARKING SPOTS', 'IS SHORT SALE', 'intercept']
y = df['LAST SALE PRICE']
X = sm.add_constant(df[cols])
est = sm.OLS(y, X, missing='drop').fit()
est.summary()
Out[24]:
Ok, my fit is better, but not much. I see some extremely small coefficients, so if we like this feature set (that's still to be debated), we should apply L1 regression to refocus on the features that have predictive power. More on all that in future episodes and blog posts!