Modeling rent prices for the SF Bay Area

This notebook will develop a predictive model for rent prices in the Bay Area using rental listings from Craigslist.

Data: The dataset used here is Craigslist listings merged with Census and point of interest data. It includes all unique listings in the 'apartments/housing' section of Craigslist from 11/13/2016 - 3/17/2017.

Features include:

  • property characteristics (rent, sq ft, # bedrooms, # bathrooms, coordinates)
  • neighborhood characteristics at the Census block and block group levels (e.g., med hh income, race)
  • jobs accessibility (e.g., jobs within a given radius)
  • date listed on craigslist

The data has been cleaned and filtered, but features have not yet been carefully selected.

Goal: Develop a model to predict rent as accurately as possible.


In [1]:
import numpy as np
import pandas as pd
import os
import math

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
DATA_DIR = os.path.join('..','data','urbansim')

Load and prepare data


In [22]:
# this particular dataset is a subset of all the features available
infile = 'ba_block_small.csv'
df = pd.read_csv(os.path.join(DATA_DIR,infile))
df = df.ix[:,1:]  # exclude first column, which was empty
print(df.shape)
df.head()


(16560, 70)
Out[22]:
pid date region neighborhood rent bedrooms sqft rent_sqft longitude latitude ... pctrent pctblack pctwhite pctasian y17jan y17feb y17mar bgpopden bgjobden highlowinc1500m
0 5873897352 2016-11-13 sfbay dublin / pleasanton / livermore 2250.0 1.0 687.0 3.275109 -121.888125 37.708458 ... 64.434687 0.0 0.0 0.0 False False False 10.073916 7.875687 1.504727
1 5875136804 2016-11-14 sfbay dublin / pleasanton / livermore 1995.0 1.0 687.0 2.903930 -121.888125 37.708458 ... 64.434687 0.0 0.0 0.0 False False False 10.073916 7.875687 1.504727
2 5869197992 2016-11-15 sfbay dublin / pleasanton / livermore 2395.0 1.0 817.0 2.931457 -121.888125 37.708458 ... 64.434687 0.0 0.0 0.0 False False False 10.073916 7.875687 1.504727
3 5885759159 2016-11-20 sfbay dublin / pleasanton / livermore 2250.0 1.0 687.0 3.275109 -121.888125 37.708458 ... 64.434687 0.0 0.0 0.0 False False False 10.073916 7.875687 1.504727
4 5886665547 2016-11-21 sfbay dublin / pleasanton / livermore 1995.0 1.0 629.0 3.171701 -121.888125 37.708458 ... 64.434687 0.0 0.0 0.0 False False False 10.073916 7.875687 1.504727

5 rows × 70 columns

What kind of features do we have?


In [4]:
# Let's look at some summary stats for key features
# property characteristics
df[['rent','sqft','rent_sqft','bedrooms','bathrooms']].describe()


Out[4]:
rent sqft rent_sqft bedrooms bathrooms
count 16560.000000 16560.000000 16560.000000 16560.000000 8253.000000
mean 2994.327597 978.960870 3.214822 1.617754 1.481764
std 1056.350152 356.909985 1.002537 0.823385 0.564599
min 705.000000 180.000000 0.452830 0.000000 0.000000
25% 2299.000000 734.000000 2.568345 1.000000 1.000000
50% 2783.000000 946.000000 3.067740 2.000000 1.000000
75% 3495.000000 1150.000000 3.676923 2.000000 2.000000
max 12500.000000 4645.000000 9.750000 6.000000 4.000000

In [5]:
# census features
census_features = ['bgpop', 'bgacres', 'bgjobs', 'bgmedkids', 'bgmedhhs',
       'bgmedinc', 'proprent', 'propwhite','propblack', 'propasian', 'pumahhden', 'prop1per', 'prop2per', 'bgmedagehd',
    'pct1per', 'pct2per', 'pctrent','pctblack', 'pctwhite', 'pctasian', 'bgpopden', 'bgjobden']
df[census_features].describe()


Out[5]:
bgpop bgacres bgjobs bgmedkids bgmedhhs bgmedinc proprent propwhite propblack propasian ... prop2per bgmedagehd pct1per pct2per pctrent pctblack pctwhite pctasian bgpopden bgjobden
count 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 ... 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16559.000000 16559.000000
mean 2278.621558 1677.654083 2619.841486 0.046135 2.121377 86422.977959 0.593939 0.436069 0.035347 0.220206 ... 0.336906 45.766697 31.276702 33.690634 59.393853 3.534688 43.606874 22.020556 17.852474 14.368919
std 1692.329312 7599.460446 3897.735448 0.209784 0.623896 29934.893921 0.234371 0.310481 0.096435 0.234738 ... 0.092543 7.242487 12.433276 9.254301 23.437098 9.643516 31.048116 23.473798 17.947365 29.362617
min 0.000000 0.000000 0.000000 0.000000 1.000000 21300.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 33.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.003252
25% 1311.000000 78.938239 247.000000 0.000000 2.000000 63050.000000 0.450000 0.000000 0.000000 0.000000 ... 0.271248 40.000000 21.786834 27.124774 45.000000 0.000000 0.000000 0.000000 6.577624 1.708691
50% 1746.000000 150.049915 748.000000 0.000000 2.000000 84000.000000 0.603254 0.492701 0.000000 0.155556 ... 0.338614 46.500000 30.257009 33.861386 60.325407 0.000000 49.270073 15.555556 13.071077 5.760017
75% 2409.000000 326.435557 3085.000000 0.000000 2.000000 104000.000000 0.789809 0.675676 0.037037 0.380952 ... 0.405316 52.000000 39.347768 40.531561 78.980892 3.703704 67.567568 38.095238 23.933577 16.003817
max 10491.000000 42198.002649 19110.000000 1.000000 4.000000 203000.000000 1.000000 1.000000 0.903226 1.000000 ... 0.701342 64.000000 81.132075 70.134228 100.000000 90.322581 100.000000 100.000000 125.664465 517.608658

8 rows × 22 columns


In [6]:
# accessibility features
access_features = ['lowinc1500m', 'highinc1500m', 'lnjobs5000m','lnjobs30km', 'lnpop400m', 'lnpop800m', 'lnjobs800m','lntcpuw3000m',
       'pumajobden', 'lnjobs40km', 'lnret3000m', 'lnfire3000m', 'lnserv3000m','highlowinc1500m']
df[access_features].describe()


Out[6]:
lowinc1500m highinc1500m lnjobs5000m lnjobs30km lnpop400m lnpop800m lnjobs800m lntcpuw3000m pumajobden lnjobs40km lnret3000m lnfire3000m lnserv3000m highlowinc1500m
count 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000 16560.000000
mean 5.677033 6.074045 9.956677 12.693569 6.640413 7.507952 6.162258 5.626036 11.706303 13.053952 7.106154 5.885792 8.170593 1.117190
std 1.386705 1.305157 1.265675 0.802371 1.669962 1.748253 1.823210 1.675362 22.086445 0.681933 1.703126 1.556564 1.280461 0.570569
min 0.100059 0.627523 4.230915 10.523860 0.000000 0.000000 0.000000 0.000000 0.068592 10.822515 0.000000 0.406504 3.202994 0.633968
25% 5.313185 5.740918 9.374096 12.428492 6.383507 7.027315 5.185174 4.509885 1.295310 13.007757 6.380528 5.047726 7.619956 0.975782
50% 5.830030 6.299926 10.000783 13.065360 7.015712 7.921173 6.351109 5.814377 3.821264 13.357129 7.332416 5.707664 8.017110 1.058767
75% 6.413313 6.709478 10.415682 13.215977 7.596392 8.391403 7.335358 6.623755 10.404358 13.445881 7.971928 6.544980 8.590465 1.183156
max 8.997189 8.520856 12.594025 13.490251 8.947025 10.003016 9.829233 9.110520 82.100619 13.672037 10.514658 10.297959 11.515756 22.039128

In [7]:
def log_var(x):
    """Return log of x, but NaN if zero."""
    if x==0:
        return np.nan
    else:
        return np.log(x)

# add ln rent
df['lnrent'] = df.rent.apply(log_var)

In [8]:
#df.columns

In [9]:
df.rent_sqft.mean()
df.sqft.mean()*.35/df.rent.mean()


Out[9]:
0.1144284629159422

Inspecting the columns, some are obviously not useful or redundant. E.g., 'median_income' is just the median income for the entire region.

Define the columns we'll use...


In [10]:
cols_to_use = ['rent','rent_sqft','lnrent', 'sqft','bedrooms','bathrooms','longitude', 'latitude','bgpop','bgjobs', 'bgmedkids', 'bgmedhhs','bgmedinc', 'proprent', 'lowinc1500m', 'highinc1500m', 'lnjobs5000m',
       'lnjobs30km', 'lnpop400m', 'lnpop800m', 'lnjobs800m', 'propwhite','propblack', 'propasian', 'pumahhden', 'lnbasic3000m', 'lntcpuw3000m',
       'pumajobden', 'lnjobs40km', 'lnret3000m', 'lnfire3000m', 'lnserv3000m','prop1per', 'prop2per', 'bgmedagehd', 'puma1', 'puma2', 'puma3',
       'puma4', 'northsf', 'pct1per', 'pct2per', 'pctrent','pctblack', 'pctwhite', 'pctasian', 'y17jan', 'y17feb', 'y17mar',
       'bgpopden', 'bgjobden', 'highlowinc1500m']

x_cols = ['sqft','bedrooms','bathrooms','longitude', 'latitude','bgpop','bgjobs', 'bgmedkids', 'bgmedhhs',
       'bgmedinc', 'proprent', 'lowinc1500m', 'highinc1500m', 'lnjobs5000m','lnjobs30km', 'lnpop400m', 'lnpop800m', 'lnjobs800m', 'propwhite',
       'propblack', 'propasian', 'pumahhden', 'lnbasic3000m', 'lntcpuw3000m','pumajobden', 'lnjobs40km', 'lnret3000m', 'lnfire3000m', 'lnserv3000m',
       'prop1per', 'prop2per', 'bgmedagehd', 'puma1', 'puma2', 'puma3','puma4', 'northsf', 'pct1per', 'pct2per', 'pctrent',
       'pctblack', 'pctwhite', 'pctasian', 'y17jan', 'y17feb', 'y17mar','bgpopden', 'bgjobden', 'highlowinc1500m']
y_col = 'lnrent'
print(len(x_cols))


49

Missing data?


In [11]:
df = df[cols_to_use]
print('total rows:',len(df))
df_notnull = df.dropna(how='any')
print('excluding NAs:',len(df_notnull))
df = df_notnull


total rows: 16560
excluding NAs: 8253

About half of rows have bathrooms missing. Turns out this is because bathrooms was only added to the scraper at the end of December. Is it better to exclude the bathrooms feature and use all the data? Or better to include the bathrooms feature and only use data collected beginning in January? I tried it both ways and the model is more accurate (and more interesting) when we include the bathrooms feature. Plus, all the data we collect from now on will have bathrooms.

Variable distributions

Look at distributions to get a better sense of what the variables mean and to look for outliers.


In [12]:
plot_rows = math.ceil(len(cols_to_use)/2)

f, axes = plt.subplots(plot_rows,2, figsize=(8,35))
sns.despine(left=True)

for i,col in enumerate(cols_to_use):
    row_position = math.floor(i/2)
    col_position = i%2
    sns.distplot(df_notnull[col], ax=axes[row_position, col_position],kde=False)
    axes[row_position, col_position].set_title('{}'.format(col)) 

plt.tight_layout()
plt.show()



In [13]:
sns.distplot(df['lnrent'])


Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x112353240>

In [14]:
# $/sqft as a function of sqft 
#plt.scatter(df.sqft, df.rent_sqft)
#plt.ylabel('rent per sq ft')
#plt.xlabel('sqft')
#plt.show()

Create some additional features that might help

(Second iteration)

In the first iteration of models, there appeared to be an interaction effect with # bedrooms and # bathrooms; i.e., the number of bathrooms per bedroom seems to matter. E.g., a property with 4 bedrooms is probably not very desirable if it only has 1 bathroom, but might be much more desirable if it has 4 bedrooms and 3 bathrooms. So let's try adding a feature for bathrooms per bedroom. (Turns out this improved the linear model but not GB.. as expected.)


In [15]:
def bath_var(row):
    """Make bathrooms/bedroom variable. """
    # Avoid 0 in demoninator. When bedrooms = 0, it's probably a studio. so for practical purposes, br=1
    if row['bedrooms']==0:
        br = 1
    else:
        br = row['bedrooms']
    return row['bathrooms']/br

# add a variable bath_bed (bathrooms per bedroom)
df['bath_bed'] = df.apply(bath_var, axis=1)
cols_to_use.append('bath_bed')
x_cols.append('bath_bed')

It also appears the relationship between rent and sqft is not linear. Let's add a sqft2 term


In [16]:
df['sqft2'] = df['sqft']**2
cols_to_use.append('sqft2')
x_cols.append('sqft2')

Linear model

Try a linear model first, at the very least to use as a baseline.


In [31]:
from sklearn import linear_model, model_selection

In [36]:
print('target variable:',y_col)
X_train, X_test, y_train, y_test = model_selection.train_test_split(df[x_cols],df[y_col], test_size = .3, random_state = 201)

regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)


target variable: lnrent
Out[36]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [ ]:
# Intercept
print('Intercept:', regr.intercept_)

# The coefficients
print('Coefficients:')
pd.Series(regr.coef_, index=x_cols)

These coefficients aren't very useful without the standard errors. If I were interested in building a model for interpretation, I would use the statsmodels package, which has more functionality for statistical tests.


In [41]:
from sklearn.metrics import r2_score
# See mean square error, using test data
print("Mean squared error: %.2f" % np.mean((regr.predict(X_test) - y_test) ** 2))
print("RMSE:", np.sqrt(np.mean((regr.predict(X_test) - y_test) ** 2)))
# Explained variance score: 1 is perfect prediction. 
print('Variance score: %.2f' % regr.score(X_test, y_test))
# R2
print('R2:',r2_score(y_test, regr.predict(X_test)))


Mean squared error: 0.03
RMSE: 0.175778312162
Variance score: 0.71
R2: 0.712132308323

Scores for full dataset w/o bathrooms feature

  • Mean squared error: 0.03
  • RMSE: 0.186277629605
  • Variance score: 0.68

Scores w/ bathrooms feature, dropping missing values - slightly better

  • Mean squared error: 0.03
  • RMSE: 0.181372180936
  • Variance score: 0.69

Scores w/ bath_bed and sqft^2 features added

  • Mean squared error: 0.03
  • RMSE: 0.175778312162
  • Variance score: 0.71

In [42]:
# It's a good idea to look at the residuals to make sure we don't have any gross violations of OLS
# Plot predicted values vs. observed
plt.scatter(regr.predict(X_train),y_train, color='blue',s=1, alpha=.5)
plt.show()



In [43]:
# plot residuals vs predicted values
plt.scatter(regr.predict(X_train), regr.predict(X_train)- y_train, color='blue',s=1, alpha=.5)
plt.scatter(regr.predict(X_test), regr.predict(X_test)- y_test, color='green',s=1, alpha=.5)
plt.show()



In [44]:
print("Training set. Mean squared error: %.5f" % np.mean((regr.predict(X_train) - y_train) ** 2), '| Variance score: %.5f' % regr.score(X_train, y_train))
print("Test set. Mean squared error: %.5f" % np.mean((regr.predict(X_test) - y_test) ** 2), '| Variance score: %.5f' % regr.score(X_test, y_test))


Training set. Mean squared error: 0.02863 | Variance score: 0.72516
Test set. Mean squared error: 0.03090 | Variance score: 0.71213

Does not look like overfitting. (If there were, we'd probably have a much smaller error on training set and a larger error on test set.') We do have a problem with colinearity and if we were using this model for interpretation we'd have to do better feature selection...

Gradient boosting


In [17]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV

In [18]:
def RMSE(y_actual, y_predicted):
    return np.sqrt(mean_squared_error(y_actual, y_predicted))

def cross_val_gb(X,y,cv_method='kfold',k=5, **params):
    """Estimate gradient boosting regressor using cross validation.
    
    Args: 
        X (DataFrame): features data
        y (Series): target data
        cv_method (str): how to split the data ('kfold' (default) or 'timeseries')
        k (int): number of folds (default=5)
        **params: keyword arguments for regressor
    Returns: 
        float: mean error (RMSE) across all training/test sets.
    """
    if cv_method == 'kfold':
        kf = KFold(n_splits=k, shuffle=True, random_state=2012016)  # use random seed for reproducibility. 
    
    E = np.ones(k)   # this array will hold the errors. 
    i=0
    for train, test in kf.split(X, y): 
        train_data_x = X.iloc[train]
        train_data_y = y.iloc[train]    
        test_data_x = X.iloc[test]
        test_data_y = y.iloc[test]

        # n_estimators is number of trees to build. 
        grad_boost = GradientBoostingRegressor(loss='ls',criterion='mse', **params)
        grad_boost.fit(train_data_x,train_data_y)
        predict_y=grad_boost.predict(test_data_x)
        E[i] = RMSE(test_data_y, predict_y)
        i+=1
    return np.mean(E)

In [17]:
df_X = df[x_cols]
df_y = df[y_col]

In [18]:



Out[18]:
Index(['sqft', 'bedrooms', 'bathrooms', 'longitude', 'latitude', 'bgpop',
       'bgjobs', 'bgmedkids', 'bgmedhhs', 'bgmedinc', 'proprent',
       'lowinc1500m', 'highinc1500m', 'lnjobs5000m', 'lnjobs30km', 'lnpop400m',
       'lnpop800m', 'lnjobs800m', 'propwhite', 'propblack', 'propasian',
       'pumahhden', 'lnbasic3000m', 'lntcpuw3000m', 'pumajobden', 'lnjobs40km',
       'lnret3000m', 'lnfire3000m', 'lnserv3000m', 'prop1per', 'prop2per',
       'bgmedagehd', 'puma1', 'puma2', 'puma3', 'puma4', 'northsf', 'pct1per',
       'pct2per', 'pctrent', 'pctblack', 'pctwhite', 'pctasian', 'y17jan',
       'y17feb', 'y17mar', 'bgpopden', 'bgjobden', 'highlowinc1500m',
       'bath_bed', 'sqft2'],
      dtype='object')

Tune the parameters for GB


In [20]:
params = {'n_estimators':100,
          'learning_rate':0.1,
          'max_depth':1,
          'min_samples_leaf':4
          }
grad_boost = GradientBoostingRegressor(loss='ls',criterion='mse', **params)
grad_boost.fit(df_X,df_y)
cross_val_gb(df_X,df_y, **params)


Out[20]:
0.17768555628211918

In [193]:
param_grid = {'learning_rate':[.5 ,.1, .05],
              'max_depth':[2,4,6,12],
              'min_samples_leaf': [5,9,17],
              'max_features': [1, .3, .1]
              }

est= GradientBoostingRegressor(n_estimators = 500)
gs_cv = GridSearchCV(est,param_grid).fit(df_X,df_y)

In [194]:
print(gs_cv.best_params_)
print(gs_cv.best_score_)


{'max_depth': 2, 'learning_rate': 0.05, 'min_samples_leaf': 17, 'max_features': 0.3}
0.612986032534

In [189]:
param_grid = {'learning_rate':[5,.2,.1],
              'max_depth':[6,8,12],
              'min_samples_leaf': [17,25],
              'max_features': [.3]
              }

#est= GradientBoostingRegressor(n_estimators = 100)
#gs_cv = GridSearchCV(est,param_grid).fit(df_X,df_y)

In [99]:
print(gs_cv.best_params_)
print(gs_cv.best_score_)


{'max_depth': 12, 'learning_rate': 0.1, 'min_samples_leaf': 17, 'max_features': 0.3}
0.539942208017

In [196]:
# best parameters
params = {'n_estimators':500,
          'learning_rate':0.1,
          'max_depth':4,
          'min_samples_leaf':9,
          'max_features':.3
          }
grad_boost = GradientBoostingRegressor(loss='ls',criterion='mse', **params)
grad_boost.fit(df_X,df_y)
cross_val_gb(df_X,df_y, **params, k=3)


Out[196]:
0.13212563068130101

Best RMSE when using no bathrooms: 0.1308920

Best RMSE with bathrooms feature: 0.132212


In [197]:
# plot the importances
gb_o = pd.DataFrame({'features':x_cols,'importance':grad_boost.feature_importances_})
gb_o= gb_o.sort_values(by='importance',ascending=False)


plt.figure(1,figsize=(12, 6))
plt.xticks(range(len(gb_o)), gb_o.features,rotation=45)
plt.plot(range(len(gb_o)),gb_o.importance,"o")
plt.title('Feature importances')
plt.show()



In [198]:
from sklearn.ensemble.partial_dependence import plot_partial_dependence
from sklearn.ensemble.partial_dependence import partial_dependence

In [23]:
#for i,col in enumerate(df_X.columns): 
#    print(i,col)

In [202]:
features = [0,1,2,3,4, 13, 46,5,11]
names = df_X.columns
fig, axs = plot_partial_dependence(grad_boost, df_X, features,feature_names=names, grid_resolution=50, figsize = (10,8))
fig.suptitle('Partial dependence of rental price features')
plt.subplots_adjust(top=0.9)  # tight_layout causes overlap with suptitle
plt.show()



In [204]:
features = [(0,2),(1,2),(3,4),(13,2)]
names = df_X.columns
fig, axs = plot_partial_dependence(grad_boost, df_X, features,feature_names=names, grid_resolution=50, figsize = (9,6))
fig.suptitle('Partial dependence of rental price features')
plt.subplots_adjust(top=0.9)  # tight_layout causes overlap with suptitle
plt.show()


//anaconda/lib/python3.5/site-packages/matplotlib/contour.py:380: RuntimeWarning: invalid value encountered in true_divide
  dist = np.add.reduce(([(abs(s)[i] / L[i]) for i in range(xsize)]), -1)

Using $/sqft as target variable

What if we use price per sqft as the y variable? Makes more sense conceptually.


In [21]:
y_col = 'rent_sqft'
df_y = df[y_col]

x_cols = cols_to_use[3:]
df_X = df[x_cols]
sns.distplot(df['rent_sqft'])


Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1181c53c8>

Downside is that the distribution is not as normally distributed. It's not that bad though.

Linear model with $/sqft as target


In [52]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(df[x_cols],df[y_col], test_size = .3, random_state = 201)

regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)


Out[52]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [54]:
# Intercept
#print('Intercept:', regr.intercept_)

# The coefficients
#print('Coefficients:')
#pd.Series(regr.coef_, index=x_cols)

In [56]:
# See mean square error, using test data
print("Mean squared error: %.2f" % np.mean((regr.predict(X_test) - y_test) ** 2))
print("RMSE:", np.sqrt(np.mean((regr.predict(X_test) - y_test) ** 2)))
# Explained variance score: 1 is perfect prediction. 
print('Variance score: %.2f' % regr.score(X_test, y_test))
print('R2:',r2_score(y_test, regr.predict(X_test)))


Mean squared error: 0.28
RMSE: 0.526323190891
Variance score: 0.73
R2: 0.732591555545

Since the rent_sqft has a different scale than lnrent, we can't directly compare the RMSE, but we can compare variance score. This variance (0.66) is a bit lower than the linear model before (0.69), which is what we'd expect when we take the most important covariate and put it into the target variable. Still not a bad score here. Not great either, though.

Added sqft as feature, with rent_sqft still as target:

  • Mean squared error: 0.32
  • RMSE: 0.56438373549
  • Variance score: 0.69

Added bath/bedrooms feature

  • Mean squared error: 0.31
  • RMSE: 0.559876928818
  • Variance score: 0.70

Added sqft^2 feature

  • Mean squared error: 0.28
  • RMSE: 0.526323190891
  • Variance score: 0.73

In [57]:
# Plot predicted values vs. observed
plt.scatter(regr.predict(X_train),y_train, color='blue',s=1, alpha=.5)
plt.show()



In [58]:
# plot residuals vs predicted values
plt.scatter(regr.predict(X_train), regr.predict(X_train)- y_train, color='blue',s=1, alpha=.5)
plt.scatter(regr.predict(X_test), regr.predict(X_test)- y_test, color='green',s=1, alpha=.5)
plt.show()


Gradient boosting with $/sqft as target


In [22]:
# because GB can model nonlinear relationships and interaction effects, we don't need sqft^2 or bed_bath as features. 
x_cols = ['sqft','bedrooms','bathrooms','longitude', 'latitude','bgpop','bgjobs', 'bgmedkids', 'bgmedhhs',
       'bgmedinc', 'proprent', 'lowinc1500m', 'highinc1500m', 'lnjobs5000m','lnjobs30km', 'lnpop400m', 'lnpop800m', 'lnjobs800m', 'propwhite',
       'propblack', 'propasian', 'pumahhden', 'lnbasic3000m', 'lntcpuw3000m','pumajobden', 'lnjobs40km', 'lnret3000m', 'lnfire3000m', 'lnserv3000m',
       'prop1per', 'prop2per', 'bgmedagehd', 'puma1', 'puma2', 'puma3','puma4', 'northsf', 'pct1per', 'pct2per', 'pctrent',
       'pctblack', 'pctwhite', 'pctasian', 'y17jan', 'y17feb', 'y17mar','bgpopden', 'bgjobden', 'highlowinc1500m']

df_X = df[x_cols]
y_col = 'rent_sqft'
df_y = df[y_col]
df_X.shape


Out[22]:
(8253, 49)

In [23]:
params = {'n_estimators':100,
          'learning_rate':0.1,
          'max_depth':1,
          'min_samples_leaf':4
          }
#grad_boost = GradientBoostingRegressor(loss='ls',criterion='mse', **params)
#grad_boost.fit(df_X,df_y)
#cross_val_gb(df_X,df_y, **params)

Without parameter tuning, error is about the same as linear model. RMSE: 0.5947

Adding sqft as covariate : RMSE: 0.5258

Tune the parameters


In [24]:
# find the optimal parameters
param_grid = {'learning_rate':[.5 ,.1, .05],
              'max_depth':[2,4,6,12],
              'min_samples_leaf': [5,9,17],
              'max_features': [1, .3, .1]
              }

#est= GradientBoostingRegressor(n_estimators = 500)
#gs_cv = GridSearchCV(est,param_grid).fit(df_X,df_y)

In [25]:
print(gs_cv.best_params_)
print(gs_cv.best_score_)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-25-905164be9736> in <module>()
----> 1 print(gs_cv.best_params_)
      2 print(gs_cv.best_score_)

NameError: name 'gs_cv' is not defined

In [26]:
# run with best paramters
params = {'n_estimators':500,
          'learning_rate':0.05,
          'max_depth':2,
          'min_samples_leaf':5,
          'max_features':.3
          }
#grad_boost = GradientBoostingRegressor(loss='ls',criterion='mse', **params)
#grad_boost.fit(df_X,df_y)
#cross_val_gb(df_X,df_y, **params, k=3)

In [27]:
# best parameters, using 1000 estimators
params = {'n_estimators':1000,
          'learning_rate':0.05,
          'max_depth':2,
          'min_samples_leaf':5,
          'max_features':.3
          }
grad_boost = GradientBoostingRegressor(loss='ls',criterion='mse', **params)
grad_boost.fit(df_X,df_y)
cross_val_gb(df_X,df_y, **params, k=3)


Out[27]:
0.40539199896246719

In [28]:
import pickle
# save as pickle for web app
fname = 'fitted_gb.p'
with open(os.path.join(DATA_DIR,fname), 'wb') as pfile:
    pickle.dump(grad_boost, pfile)

In [35]:
# to load the pickled object: 
fname = 'fitted_gb.p'
with open(os.path.join(DATA_DIR,fname), 'rb') as pfile:
    grad_boost2 = pickle.load(pfile)

In [32]:
# save median/mean feature values to use as defaults in the web app
df_means = df_X.mean()
fname = 'data_averages.p'
with open(os.path.join(DATA_DIR, fname), 'wb') as pfile: 
    pickle.dump(df_means, pfile)

In [36]:
# to load the pickled object: 
fname = 'data_averages.p'
with open(os.path.join(DATA_DIR,fname), 'rb') as pfile:
    df_means2 = pickle.load(pfile)
print(df_means2.head())


sqft         969.239428
bedrooms       1.620380
bathrooms      1.481764
longitude   -122.166756
latitude      37.628084
dtype: float64

In [42]:
# testing prediction
y_predicted = grad_boost2.predict(df_means2)
print(y_predicted)
print(df_y.mean())


[ 3.39677989]
3.2427675728367986
//anaconda/lib/python3.5/site-packages/sklearn/utils/validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)

without sqft as covariate

  • RMSE with 500 estimators: 0.4708
  • RMSE with 1000 estimators: 0.4658
  • linear model: 0.5965

with sqft as covariate

  • RMSE with 500 estimators: 0.3503
  • RMSE with 1000 estimators: 0.3508
  • linear model: 0.5644

In [81]:
# plot the importances
gb_o = pd.DataFrame({'features':x_cols,'importance':grad_boost.feature_importances_})
gb_o= gb_o.sort_values(by='importance',ascending=False)


plt.figure(1,figsize=(12, 6))
plt.xticks(range(len(gb_o)), gb_o.features,rotation=45)
plt.plot(range(len(gb_o)),gb_o.importance,"o")
plt.title('Feature importances')
plt.show()



In [82]:
from sklearn.ensemble.partial_dependence import plot_partial_dependence
from sklearn.ensemble.partial_dependence import partial_dependence

In [83]:
# choose features for partial dependent plots
#for i,col in enumerate(df_X.columns): 
 #   print(i,col)

In [99]:
features = [0,1,2,3,4,8,14,16,12,13,24]
names = df_X.columns
fig, axs = plot_partial_dependence(grad_boost, df_X, features,feature_names=names, grid_resolution=50, figsize = (10,10))
fig.suptitle('Partial dependence of rental price features')
plt.subplots_adjust(top=0.9)  # tight_layout causes overlap with suptitle
plt.show()



In [100]:
features = [(0,1),(0,2),(1,2),(0,3),(0,9),(3,4)]
names = df_X.columns
fig, axs = plot_partial_dependence(grad_boost, df_X, features,feature_names=names, grid_resolution=50, figsize = (9,6))
fig.suptitle('Partial dependence of rental price features')
plt.subplots_adjust(top=0.9)  # tight_layout causes overlap with suptitle
plt.show()


Looks like bathrooms per bedroom might be a good feature to add. Tried this; it made the linear model better but the gradient boosting worse. Maybe because the GB already takes interactiion effects into account.

Error analysis

  • plot errors
  • map errors by long/lat

In [255]:
y_pred = grad_boost.predict(df_X)
# plot predicted vs. actual

plt.scatter(y_pred,df_y, color='blue',s=1, alpha=.5)
plt.show()



In [256]:
# plot errors vs. predicted
plt.scatter(y_pred, y_pred-df_y, color='blue',s=1,alpha=.5 )
plt.show()



In [257]:
sns.distplot(y_pred-df_y)


Out[257]:
<matplotlib.axes._subplots.AxesSubplot at 0x11470f358>

In [260]:
from matplotlib.colors import Normalize
# map the errors. 
x = df_X['longitude']
y = df_X['latitude']
z = y_pred-df_y

norm = Normalize(vmin=-1,vmax=1)  # zoom in on middle of error range

plt.figure(figsize=(6,9))
plt.scatter(x,y, c=z, cmap='jet',s=8,alpha=.5,edgecolors='face', norm=norm)
#plt.xlim(-122.6,-122)
#plt.ylim(37.6,38)
plt.colorbar()
plt.show()



In [261]:
# it's hard to see a pattern in the above, but looks like errors are larger in the central city areas. 
# only show large errors, to make them more clear. 
# map the errors. 

err_cutoff = .8
print('# obs where abs(error)>{}:'.format(err_cutoff),len(z[abs(z)>err_cutoff]))
z_large_err = z[abs(z)>err_cutoff]
x_large_err = x.ix[z_large_err.index,]
y_large_err = y.ix[z_large_err.index,]


plt.figure(figsize=(6,9))
plt.scatter(x_large_err,y_large_err, c=z_large_err, cmap='jet',s=8,alpha=.5,edgecolors='face')

plt.colorbar()
plt.show()


# obs where abs(error)>0.8: 320

In [262]:
plt.figure(figsize=(6,6))
plt.scatter(x_large_err,y_large_err, c=z_large_err, cmap='jet',s=8,alpha=.5,edgecolors='face')
plt.xlim(-122.5,-122.2)
plt.ylim(37.7,37.9)
plt.colorbar()
plt.show()


I can't find a clear spatial pattern with the errors. Can try with real map...

Write to database for use in rent-predictor app


In [31]:
# this particular dataset is a subset of all the features available
infile = 'ba_block_small.csv'
df = pd.read_csv(os.path.join(DATA_DIR,infile), dtype={'fips_block':str})
df = df.ix[:,1:]  # exclude first column, which was empty
print(df.shape)
df.head()


(16560, 70)
Out[31]:
pid date region neighborhood rent bedrooms sqft rent_sqft longitude latitude ... pctrent pctblack pctwhite pctasian y17jan y17feb y17mar bgpopden bgjobden highlowinc1500m
0 5873897352 2016-11-13 sfbay dublin / pleasanton / livermore 2250.0 1.0 687.0 3.275109 -121.888125 37.708458 ... 64.434687 0.0 0.0 0.0 False False False 10.073916 7.875687 1.504727
1 5875136804 2016-11-14 sfbay dublin / pleasanton / livermore 1995.0 1.0 687.0 2.903930 -121.888125 37.708458 ... 64.434687 0.0 0.0 0.0 False False False 10.073916 7.875687 1.504727
2 5869197992 2016-11-15 sfbay dublin / pleasanton / livermore 2395.0 1.0 817.0 2.931457 -121.888125 37.708458 ... 64.434687 0.0 0.0 0.0 False False False 10.073916 7.875687 1.504727
3 5885759159 2016-11-20 sfbay dublin / pleasanton / livermore 2250.0 1.0 687.0 3.275109 -121.888125 37.708458 ... 64.434687 0.0 0.0 0.0 False False False 10.073916 7.875687 1.504727
4 5886665547 2016-11-21 sfbay dublin / pleasanton / livermore 1995.0 1.0 629.0 3.171701 -121.888125 37.708458 ... 64.434687 0.0 0.0 0.0 False False False 10.073916 7.875687 1.504727

5 rows × 70 columns


In [37]:
fips_list = df['fips_block'].unique()
len(fips_list)


Out[37]:
262

In [45]:
cols_for_db = ['pid','fips_block','latitude','longitude']
df[cols_for_db].to_csv('/Users/lisarayle/rent_predictor/local_setup_files/data/ba_data_temp.csv', index=False)

In [42]:
# map the errors. 
x = df['longitude']
y = df['latitude']

plt.figure(figsize=(6,9))
plt.scatter(x,y,s=1,alpha=.5)
#plt.xlim(-122.6,-122)
#plt.ylim(37.6,38)

plt.show()



In [ ]: