This notebook will develop a predictive model for rent prices in the Bay Area using rental listings from Craigslist.
Data: The dataset used here is Craigslist listings merged with Census and point of interest data. It includes all unique listings in the 'apartments/housing' section of Craigslist from 11/13/2016 - 3/17/2017.
Features include:
The data has been cleaned and filtered, but features have not yet been carefully selected.
Goal: Develop a model to predict rent as accurately as possible.
In [1]:
import numpy as np
import pandas as pd
import os
import math
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In [2]:
DATA_DIR = os.path.join('..','data','urbansim')
In [22]:
# this particular dataset is a subset of all the features available
infile = 'ba_block_small.csv'
df = pd.read_csv(os.path.join(DATA_DIR,infile))
df = df.ix[:,1:] # exclude first column, which was empty
print(df.shape)
df.head()
Out[22]:
In [4]:
# Let's look at some summary stats for key features
# property characteristics
df[['rent','sqft','rent_sqft','bedrooms','bathrooms']].describe()
Out[4]:
In [5]:
# census features
census_features = ['bgpop', 'bgacres', 'bgjobs', 'bgmedkids', 'bgmedhhs',
'bgmedinc', 'proprent', 'propwhite','propblack', 'propasian', 'pumahhden', 'prop1per', 'prop2per', 'bgmedagehd',
'pct1per', 'pct2per', 'pctrent','pctblack', 'pctwhite', 'pctasian', 'bgpopden', 'bgjobden']
df[census_features].describe()
Out[5]:
In [6]:
# accessibility features
access_features = ['lowinc1500m', 'highinc1500m', 'lnjobs5000m','lnjobs30km', 'lnpop400m', 'lnpop800m', 'lnjobs800m','lntcpuw3000m',
'pumajobden', 'lnjobs40km', 'lnret3000m', 'lnfire3000m', 'lnserv3000m','highlowinc1500m']
df[access_features].describe()
Out[6]:
In [7]:
def log_var(x):
"""Return log of x, but NaN if zero."""
if x==0:
return np.nan
else:
return np.log(x)
# add ln rent
df['lnrent'] = df.rent.apply(log_var)
In [8]:
#df.columns
In [9]:
df.rent_sqft.mean()
df.sqft.mean()*.35/df.rent.mean()
Out[9]:
Inspecting the columns, some are obviously not useful or redundant. E.g., 'median_income' is just the median income for the entire region.
Define the columns we'll use...
In [10]:
cols_to_use = ['rent','rent_sqft','lnrent', 'sqft','bedrooms','bathrooms','longitude', 'latitude','bgpop','bgjobs', 'bgmedkids', 'bgmedhhs','bgmedinc', 'proprent', 'lowinc1500m', 'highinc1500m', 'lnjobs5000m',
'lnjobs30km', 'lnpop400m', 'lnpop800m', 'lnjobs800m', 'propwhite','propblack', 'propasian', 'pumahhden', 'lnbasic3000m', 'lntcpuw3000m',
'pumajobden', 'lnjobs40km', 'lnret3000m', 'lnfire3000m', 'lnserv3000m','prop1per', 'prop2per', 'bgmedagehd', 'puma1', 'puma2', 'puma3',
'puma4', 'northsf', 'pct1per', 'pct2per', 'pctrent','pctblack', 'pctwhite', 'pctasian', 'y17jan', 'y17feb', 'y17mar',
'bgpopden', 'bgjobden', 'highlowinc1500m']
x_cols = ['sqft','bedrooms','bathrooms','longitude', 'latitude','bgpop','bgjobs', 'bgmedkids', 'bgmedhhs',
'bgmedinc', 'proprent', 'lowinc1500m', 'highinc1500m', 'lnjobs5000m','lnjobs30km', 'lnpop400m', 'lnpop800m', 'lnjobs800m', 'propwhite',
'propblack', 'propasian', 'pumahhden', 'lnbasic3000m', 'lntcpuw3000m','pumajobden', 'lnjobs40km', 'lnret3000m', 'lnfire3000m', 'lnserv3000m',
'prop1per', 'prop2per', 'bgmedagehd', 'puma1', 'puma2', 'puma3','puma4', 'northsf', 'pct1per', 'pct2per', 'pctrent',
'pctblack', 'pctwhite', 'pctasian', 'y17jan', 'y17feb', 'y17mar','bgpopden', 'bgjobden', 'highlowinc1500m']
y_col = 'lnrent'
print(len(x_cols))
In [11]:
df = df[cols_to_use]
print('total rows:',len(df))
df_notnull = df.dropna(how='any')
print('excluding NAs:',len(df_notnull))
df = df_notnull
About half of rows have bathrooms missing. Turns out this is because bathrooms was only added to the scraper at the end of December. Is it better to exclude the bathrooms feature and use all the data? Or better to include the bathrooms feature and only use data collected beginning in January? I tried it both ways and the model is more accurate (and more interesting) when we include the bathrooms feature. Plus, all the data we collect from now on will have bathrooms.
In [12]:
plot_rows = math.ceil(len(cols_to_use)/2)
f, axes = plt.subplots(plot_rows,2, figsize=(8,35))
sns.despine(left=True)
for i,col in enumerate(cols_to_use):
row_position = math.floor(i/2)
col_position = i%2
sns.distplot(df_notnull[col], ax=axes[row_position, col_position],kde=False)
axes[row_position, col_position].set_title('{}'.format(col))
plt.tight_layout()
plt.show()
In [13]:
sns.distplot(df['lnrent'])
Out[13]:
In [14]:
# $/sqft as a function of sqft
#plt.scatter(df.sqft, df.rent_sqft)
#plt.ylabel('rent per sq ft')
#plt.xlabel('sqft')
#plt.show()
(Second iteration)
In the first iteration of models, there appeared to be an interaction effect with # bedrooms and # bathrooms; i.e., the number of bathrooms per bedroom seems to matter. E.g., a property with 4 bedrooms is probably not very desirable if it only has 1 bathroom, but might be much more desirable if it has 4 bedrooms and 3 bathrooms. So let's try adding a feature for bathrooms per bedroom. (Turns out this improved the linear model but not GB.. as expected.)
In [15]:
def bath_var(row):
"""Make bathrooms/bedroom variable. """
# Avoid 0 in demoninator. When bedrooms = 0, it's probably a studio. so for practical purposes, br=1
if row['bedrooms']==0:
br = 1
else:
br = row['bedrooms']
return row['bathrooms']/br
# add a variable bath_bed (bathrooms per bedroom)
df['bath_bed'] = df.apply(bath_var, axis=1)
cols_to_use.append('bath_bed')
x_cols.append('bath_bed')
It also appears the relationship between rent and sqft is not linear. Let's add a sqft2 term
In [16]:
df['sqft2'] = df['sqft']**2
cols_to_use.append('sqft2')
x_cols.append('sqft2')
In [31]:
from sklearn import linear_model, model_selection
In [36]:
print('target variable:',y_col)
X_train, X_test, y_train, y_test = model_selection.train_test_split(df[x_cols],df[y_col], test_size = .3, random_state = 201)
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
Out[36]:
In [ ]:
# Intercept
print('Intercept:', regr.intercept_)
# The coefficients
print('Coefficients:')
pd.Series(regr.coef_, index=x_cols)
These coefficients aren't very useful without the standard errors. If I were interested in building a model for interpretation, I would use the statsmodels package, which has more functionality for statistical tests.
In [41]:
from sklearn.metrics import r2_score
# See mean square error, using test data
print("Mean squared error: %.2f" % np.mean((regr.predict(X_test) - y_test) ** 2))
print("RMSE:", np.sqrt(np.mean((regr.predict(X_test) - y_test) ** 2)))
# Explained variance score: 1 is perfect prediction.
print('Variance score: %.2f' % regr.score(X_test, y_test))
# R2
print('R2:',r2_score(y_test, regr.predict(X_test)))
Scores for full dataset w/o bathrooms feature
Scores w/ bathrooms feature, dropping missing values - slightly better
Scores w/ bath_bed and sqft^2 features added
In [42]:
# It's a good idea to look at the residuals to make sure we don't have any gross violations of OLS
# Plot predicted values vs. observed
plt.scatter(regr.predict(X_train),y_train, color='blue',s=1, alpha=.5)
plt.show()
In [43]:
# plot residuals vs predicted values
plt.scatter(regr.predict(X_train), regr.predict(X_train)- y_train, color='blue',s=1, alpha=.5)
plt.scatter(regr.predict(X_test), regr.predict(X_test)- y_test, color='green',s=1, alpha=.5)
plt.show()
In [44]:
print("Training set. Mean squared error: %.5f" % np.mean((regr.predict(X_train) - y_train) ** 2), '| Variance score: %.5f' % regr.score(X_train, y_train))
print("Test set. Mean squared error: %.5f" % np.mean((regr.predict(X_test) - y_test) ** 2), '| Variance score: %.5f' % regr.score(X_test, y_test))
Does not look like overfitting. (If there were, we'd probably have a much smaller error on training set and a larger error on test set.') We do have a problem with colinearity and if we were using this model for interpretation we'd have to do better feature selection...
In [17]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
In [18]:
def RMSE(y_actual, y_predicted):
return np.sqrt(mean_squared_error(y_actual, y_predicted))
def cross_val_gb(X,y,cv_method='kfold',k=5, **params):
"""Estimate gradient boosting regressor using cross validation.
Args:
X (DataFrame): features data
y (Series): target data
cv_method (str): how to split the data ('kfold' (default) or 'timeseries')
k (int): number of folds (default=5)
**params: keyword arguments for regressor
Returns:
float: mean error (RMSE) across all training/test sets.
"""
if cv_method == 'kfold':
kf = KFold(n_splits=k, shuffle=True, random_state=2012016) # use random seed for reproducibility.
E = np.ones(k) # this array will hold the errors.
i=0
for train, test in kf.split(X, y):
train_data_x = X.iloc[train]
train_data_y = y.iloc[train]
test_data_x = X.iloc[test]
test_data_y = y.iloc[test]
# n_estimators is number of trees to build.
grad_boost = GradientBoostingRegressor(loss='ls',criterion='mse', **params)
grad_boost.fit(train_data_x,train_data_y)
predict_y=grad_boost.predict(test_data_x)
E[i] = RMSE(test_data_y, predict_y)
i+=1
return np.mean(E)
In [17]:
df_X = df[x_cols]
df_y = df[y_col]
In [18]:
Out[18]:
Tune the parameters for GB
In [20]:
params = {'n_estimators':100,
'learning_rate':0.1,
'max_depth':1,
'min_samples_leaf':4
}
grad_boost = GradientBoostingRegressor(loss='ls',criterion='mse', **params)
grad_boost.fit(df_X,df_y)
cross_val_gb(df_X,df_y, **params)
Out[20]:
In [193]:
param_grid = {'learning_rate':[.5 ,.1, .05],
'max_depth':[2,4,6,12],
'min_samples_leaf': [5,9,17],
'max_features': [1, .3, .1]
}
est= GradientBoostingRegressor(n_estimators = 500)
gs_cv = GridSearchCV(est,param_grid).fit(df_X,df_y)
In [194]:
print(gs_cv.best_params_)
print(gs_cv.best_score_)
In [189]:
param_grid = {'learning_rate':[5,.2,.1],
'max_depth':[6,8,12],
'min_samples_leaf': [17,25],
'max_features': [.3]
}
#est= GradientBoostingRegressor(n_estimators = 100)
#gs_cv = GridSearchCV(est,param_grid).fit(df_X,df_y)
In [99]:
print(gs_cv.best_params_)
print(gs_cv.best_score_)
In [196]:
# best parameters
params = {'n_estimators':500,
'learning_rate':0.1,
'max_depth':4,
'min_samples_leaf':9,
'max_features':.3
}
grad_boost = GradientBoostingRegressor(loss='ls',criterion='mse', **params)
grad_boost.fit(df_X,df_y)
cross_val_gb(df_X,df_y, **params, k=3)
Out[196]:
Best RMSE when using no bathrooms: 0.1308920
Best RMSE with bathrooms feature: 0.132212
In [197]:
# plot the importances
gb_o = pd.DataFrame({'features':x_cols,'importance':grad_boost.feature_importances_})
gb_o= gb_o.sort_values(by='importance',ascending=False)
plt.figure(1,figsize=(12, 6))
plt.xticks(range(len(gb_o)), gb_o.features,rotation=45)
plt.plot(range(len(gb_o)),gb_o.importance,"o")
plt.title('Feature importances')
plt.show()
In [198]:
from sklearn.ensemble.partial_dependence import plot_partial_dependence
from sklearn.ensemble.partial_dependence import partial_dependence
In [23]:
#for i,col in enumerate(df_X.columns):
# print(i,col)
In [202]:
features = [0,1,2,3,4, 13, 46,5,11]
names = df_X.columns
fig, axs = plot_partial_dependence(grad_boost, df_X, features,feature_names=names, grid_resolution=50, figsize = (10,8))
fig.suptitle('Partial dependence of rental price features')
plt.subplots_adjust(top=0.9) # tight_layout causes overlap with suptitle
plt.show()
In [204]:
features = [(0,2),(1,2),(3,4),(13,2)]
names = df_X.columns
fig, axs = plot_partial_dependence(grad_boost, df_X, features,feature_names=names, grid_resolution=50, figsize = (9,6))
fig.suptitle('Partial dependence of rental price features')
plt.subplots_adjust(top=0.9) # tight_layout causes overlap with suptitle
plt.show()
In [21]:
y_col = 'rent_sqft'
df_y = df[y_col]
x_cols = cols_to_use[3:]
df_X = df[x_cols]
sns.distplot(df['rent_sqft'])
Out[21]:
In [52]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(df[x_cols],df[y_col], test_size = .3, random_state = 201)
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
Out[52]:
In [54]:
# Intercept
#print('Intercept:', regr.intercept_)
# The coefficients
#print('Coefficients:')
#pd.Series(regr.coef_, index=x_cols)
In [56]:
# See mean square error, using test data
print("Mean squared error: %.2f" % np.mean((regr.predict(X_test) - y_test) ** 2))
print("RMSE:", np.sqrt(np.mean((regr.predict(X_test) - y_test) ** 2)))
# Explained variance score: 1 is perfect prediction.
print('Variance score: %.2f' % regr.score(X_test, y_test))
print('R2:',r2_score(y_test, regr.predict(X_test)))
Since the rent_sqft has a different scale than lnrent, we can't directly compare the RMSE, but we can compare variance score. This variance (0.66) is a bit lower than the linear model before (0.69), which is what we'd expect when we take the most important covariate and put it into the target variable. Still not a bad score here. Not great either, though.
Added sqft as feature, with rent_sqft still as target:
Added bath/bedrooms feature
Added sqft^2 feature
In [57]:
# Plot predicted values vs. observed
plt.scatter(regr.predict(X_train),y_train, color='blue',s=1, alpha=.5)
plt.show()
In [58]:
# plot residuals vs predicted values
plt.scatter(regr.predict(X_train), regr.predict(X_train)- y_train, color='blue',s=1, alpha=.5)
plt.scatter(regr.predict(X_test), regr.predict(X_test)- y_test, color='green',s=1, alpha=.5)
plt.show()
In [22]:
# because GB can model nonlinear relationships and interaction effects, we don't need sqft^2 or bed_bath as features.
x_cols = ['sqft','bedrooms','bathrooms','longitude', 'latitude','bgpop','bgjobs', 'bgmedkids', 'bgmedhhs',
'bgmedinc', 'proprent', 'lowinc1500m', 'highinc1500m', 'lnjobs5000m','lnjobs30km', 'lnpop400m', 'lnpop800m', 'lnjobs800m', 'propwhite',
'propblack', 'propasian', 'pumahhden', 'lnbasic3000m', 'lntcpuw3000m','pumajobden', 'lnjobs40km', 'lnret3000m', 'lnfire3000m', 'lnserv3000m',
'prop1per', 'prop2per', 'bgmedagehd', 'puma1', 'puma2', 'puma3','puma4', 'northsf', 'pct1per', 'pct2per', 'pctrent',
'pctblack', 'pctwhite', 'pctasian', 'y17jan', 'y17feb', 'y17mar','bgpopden', 'bgjobden', 'highlowinc1500m']
df_X = df[x_cols]
y_col = 'rent_sqft'
df_y = df[y_col]
df_X.shape
Out[22]:
In [23]:
params = {'n_estimators':100,
'learning_rate':0.1,
'max_depth':1,
'min_samples_leaf':4
}
#grad_boost = GradientBoostingRegressor(loss='ls',criterion='mse', **params)
#grad_boost.fit(df_X,df_y)
#cross_val_gb(df_X,df_y, **params)
Without parameter tuning, error is about the same as linear model. RMSE: 0.5947
Adding sqft as covariate : RMSE: 0.5258
Tune the parameters
In [24]:
# find the optimal parameters
param_grid = {'learning_rate':[.5 ,.1, .05],
'max_depth':[2,4,6,12],
'min_samples_leaf': [5,9,17],
'max_features': [1, .3, .1]
}
#est= GradientBoostingRegressor(n_estimators = 500)
#gs_cv = GridSearchCV(est,param_grid).fit(df_X,df_y)
In [25]:
print(gs_cv.best_params_)
print(gs_cv.best_score_)
In [26]:
# run with best paramters
params = {'n_estimators':500,
'learning_rate':0.05,
'max_depth':2,
'min_samples_leaf':5,
'max_features':.3
}
#grad_boost = GradientBoostingRegressor(loss='ls',criterion='mse', **params)
#grad_boost.fit(df_X,df_y)
#cross_val_gb(df_X,df_y, **params, k=3)
In [27]:
# best parameters, using 1000 estimators
params = {'n_estimators':1000,
'learning_rate':0.05,
'max_depth':2,
'min_samples_leaf':5,
'max_features':.3
}
grad_boost = GradientBoostingRegressor(loss='ls',criterion='mse', **params)
grad_boost.fit(df_X,df_y)
cross_val_gb(df_X,df_y, **params, k=3)
Out[27]:
In [28]:
import pickle
# save as pickle for web app
fname = 'fitted_gb.p'
with open(os.path.join(DATA_DIR,fname), 'wb') as pfile:
pickle.dump(grad_boost, pfile)
In [35]:
# to load the pickled object:
fname = 'fitted_gb.p'
with open(os.path.join(DATA_DIR,fname), 'rb') as pfile:
grad_boost2 = pickle.load(pfile)
In [32]:
# save median/mean feature values to use as defaults in the web app
df_means = df_X.mean()
fname = 'data_averages.p'
with open(os.path.join(DATA_DIR, fname), 'wb') as pfile:
pickle.dump(df_means, pfile)
In [36]:
# to load the pickled object:
fname = 'data_averages.p'
with open(os.path.join(DATA_DIR,fname), 'rb') as pfile:
df_means2 = pickle.load(pfile)
print(df_means2.head())
In [42]:
# testing prediction
y_predicted = grad_boost2.predict(df_means2)
print(y_predicted)
print(df_y.mean())
without sqft as covariate
with sqft as covariate
In [81]:
# plot the importances
gb_o = pd.DataFrame({'features':x_cols,'importance':grad_boost.feature_importances_})
gb_o= gb_o.sort_values(by='importance',ascending=False)
plt.figure(1,figsize=(12, 6))
plt.xticks(range(len(gb_o)), gb_o.features,rotation=45)
plt.plot(range(len(gb_o)),gb_o.importance,"o")
plt.title('Feature importances')
plt.show()
In [82]:
from sklearn.ensemble.partial_dependence import plot_partial_dependence
from sklearn.ensemble.partial_dependence import partial_dependence
In [83]:
# choose features for partial dependent plots
#for i,col in enumerate(df_X.columns):
# print(i,col)
In [99]:
features = [0,1,2,3,4,8,14,16,12,13,24]
names = df_X.columns
fig, axs = plot_partial_dependence(grad_boost, df_X, features,feature_names=names, grid_resolution=50, figsize = (10,10))
fig.suptitle('Partial dependence of rental price features')
plt.subplots_adjust(top=0.9) # tight_layout causes overlap with suptitle
plt.show()
In [100]:
features = [(0,1),(0,2),(1,2),(0,3),(0,9),(3,4)]
names = df_X.columns
fig, axs = plot_partial_dependence(grad_boost, df_X, features,feature_names=names, grid_resolution=50, figsize = (9,6))
fig.suptitle('Partial dependence of rental price features')
plt.subplots_adjust(top=0.9) # tight_layout causes overlap with suptitle
plt.show()
Looks like bathrooms per bedroom might be a good feature to add. Tried this; it made the linear model better but the gradient boosting worse. Maybe because the GB already takes interactiion effects into account.
In [255]:
y_pred = grad_boost.predict(df_X)
# plot predicted vs. actual
plt.scatter(y_pred,df_y, color='blue',s=1, alpha=.5)
plt.show()
In [256]:
# plot errors vs. predicted
plt.scatter(y_pred, y_pred-df_y, color='blue',s=1,alpha=.5 )
plt.show()
In [257]:
sns.distplot(y_pred-df_y)
Out[257]:
In [260]:
from matplotlib.colors import Normalize
# map the errors.
x = df_X['longitude']
y = df_X['latitude']
z = y_pred-df_y
norm = Normalize(vmin=-1,vmax=1) # zoom in on middle of error range
plt.figure(figsize=(6,9))
plt.scatter(x,y, c=z, cmap='jet',s=8,alpha=.5,edgecolors='face', norm=norm)
#plt.xlim(-122.6,-122)
#plt.ylim(37.6,38)
plt.colorbar()
plt.show()
In [261]:
# it's hard to see a pattern in the above, but looks like errors are larger in the central city areas.
# only show large errors, to make them more clear.
# map the errors.
err_cutoff = .8
print('# obs where abs(error)>{}:'.format(err_cutoff),len(z[abs(z)>err_cutoff]))
z_large_err = z[abs(z)>err_cutoff]
x_large_err = x.ix[z_large_err.index,]
y_large_err = y.ix[z_large_err.index,]
plt.figure(figsize=(6,9))
plt.scatter(x_large_err,y_large_err, c=z_large_err, cmap='jet',s=8,alpha=.5,edgecolors='face')
plt.colorbar()
plt.show()
In [262]:
plt.figure(figsize=(6,6))
plt.scatter(x_large_err,y_large_err, c=z_large_err, cmap='jet',s=8,alpha=.5,edgecolors='face')
plt.xlim(-122.5,-122.2)
plt.ylim(37.7,37.9)
plt.colorbar()
plt.show()
I can't find a clear spatial pattern with the errors. Can try with real map...
In [31]:
# this particular dataset is a subset of all the features available
infile = 'ba_block_small.csv'
df = pd.read_csv(os.path.join(DATA_DIR,infile), dtype={'fips_block':str})
df = df.ix[:,1:] # exclude first column, which was empty
print(df.shape)
df.head()
Out[31]:
In [37]:
fips_list = df['fips_block'].unique()
len(fips_list)
Out[37]:
In [45]:
cols_for_db = ['pid','fips_block','latitude','longitude']
df[cols_for_db].to_csv('/Users/lisarayle/rent_predictor/local_setup_files/data/ba_data_temp.csv', index=False)
In [42]:
# map the errors.
x = df['longitude']
y = df['latitude']
plt.figure(figsize=(6,9))
plt.scatter(x,y,s=1,alpha=.5)
#plt.xlim(-122.6,-122)
#plt.ylim(37.6,38)
plt.show()
In [ ]: