Tutorial on Hedonic Regression

This material uses Python to demonstrate some aspects of hedonic regression, but the objective here is not to learn to program, but to understand the hedonic regression methodology. We begin with an example in which we generate some synthetic data using a set of coefficients and a mathematical model, and learn those coefficients using a statistical method called multiple regression.

Simple Linear Regression

Simple linear regression is an approach for predicting a quantitative response using a single feature (or "predictor" or "input variable"). It takes the following form:

$y = \beta_0 + \beta_1x$

What does each term represent?

$y$ is the response
$x$ is the feature
$\beta_0$ is the intercept
$\beta_1$ is the coefficient for x

Together, $\beta_0$ and $\beta_1$ are called the model coefficients. To create your model, you must "learn" the values of these coefficients. And once we've learned these coefficients, we can use the model to predict values of $y$ based on new values of $x$.

Estimating ("Learning") Model Coefficients

Generally speaking, coefficients are estimated using the least squares criterion, which means we are find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors"):

What elements are present in the diagram?

The black dots are the observed values of x and y.
The blue line is our least squares line.
The red lines are the residuals, which are the distances between the observed values and the least squares line.

How do the model coefficients relate to the least squares line?

$\beta_0$ is the intercept (the value of $y$ when $x$=0)
$\beta_1$ is the slope (the change in $y$ divided by change in $x$)

Here is a graphical depiction of those calculations:



In [18]:

    
from IPython.display import display, Image
display(Image("regression.png", width=100))



In [20]:

    
# Startup steps
import pandas as pd, numpy as np, statsmodels.api as sm
import matplotlib.pyplot as plt, matplotlib.cm as cm, matplotlib.font_manager as fm
import matplotlib.mlab as mlab
from scipy.stats import pearsonr, ttest_rel
%matplotlib inline









    



/Users/waddell/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

Generate data using a model we define:



In [21]:

    
nsample = 100
x = np.linspace(0, 10, nsample)
beta = np.array([0, 2])
e = np.random.normal(size=nsample)
X = sm.add_constant(x)
y = np.dot(X, beta) + e

Plot the data and the model. Note that the intercept is set to zero in this example initially.



In [22]:

    
plt.figure(1, figsize=(10,8), )
plt.plot([0, 10], [0, 20])
plt.scatter(x, y, marker=0, s=10, c='g')
plt.axis([0, 10, 0, 20])
plt.show();



In [23]:

    
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.969
Model:                            OLS   Adj. R-squared:                  0.968
Method:                 Least Squares   F-statistic:                     3027.
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           1.70e-75
Time:                        12:26:28   Log-Likelihood:                -145.37
No. Observations:                 100   AIC:                             294.7
Df Residuals:                      98   BIC:                             299.9
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0404      0.208      0.194      0.846      -0.372       0.452
x1             1.9736      0.036     55.022      0.000       1.902       2.045
==============================================================================
Omnibus:                        3.182   Durbin-Watson:                   1.750
Prob(Omnibus):                  0.204   Jarque-Bera (JB):                2.545
Skew:                          -0.299   Prob(JB):                        0.280
Kurtosis:                       3.503   Cond. No.                         11.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



In [24]:

    
print('Parameters: ', results.params)
print('R2: ', results.rsquared)









    



Parameters:  [ 0.04035354  1.97359169]
R2:  0.968644727852



In [26]:

    
results.params









    Out[26]:





array([ 0.04035354,  1.97359169])



In [25]:

    
results.conf_int()









    Out[25]:





array([[-0.37164397,  0.45235105],
       [ 1.90241115,  2.04477223]])



In [21]:

    
nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x**2))
beta = np.array([1, 2, .5])
e = np.random.normal(size=nsample)



In [22]:

    
X = sm.add_constant(X)
y = np.dot(X, beta) + e



In [23]:

    
X[:10]









    Out[23]:





array([[ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.1010101 ,  0.01020304],
       [ 1.        ,  0.2020202 ,  0.04081216],
       [ 1.        ,  0.3030303 ,  0.09182736],
       [ 1.        ,  0.4040404 ,  0.16324865],
       [ 1.        ,  0.50505051,  0.25507601],
       [ 1.        ,  0.60606061,  0.36730946],
       [ 1.        ,  0.70707071,  0.49994898],
       [ 1.        ,  0.80808081,  0.65299459],
       [ 1.        ,  0.90909091,  0.82644628]])



In [24]:

    
plt.scatter(x,y, s=5)









    Out[24]:





<matplotlib.collections.PathCollection at 0x112873278>



In [15]:

    
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.997
Model:                            OLS   Adj. R-squared:                  0.997
Method:                 Least Squares   F-statistic:                 1.650e+04
Date:                Wed, 15 Nov 2017   Prob (F-statistic):          1.39e-123
Time:                        18:13:00   Log-Likelihood:                -153.79
No. Observations:                 100   AIC:                             313.6
Df Residuals:                      97   BIC:                             321.4
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.1250      0.336      3.345      0.001       0.457       1.793
x1             1.9813      0.155     12.746      0.000       1.673       2.290
x2             0.5023      0.015     33.395      0.000       0.472       0.532
==============================================================================
Omnibus:                        2.131   Durbin-Watson:                   2.151
Prob(Omnibus):                  0.345   Jarque-Bera (JB):                1.580
Skew:                           0.102   Prob(JB):                        0.454
Kurtosis:                       2.419   Cond. No.                         144.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



In [ ]:



In [ ]:



In [25]:

    
sf = pd.read_csv('data/redfin_2017-03-05-17-45-34-san-francisco-county-1-month.csv')
sf.columns









    Out[25]:





Index(['SALE TYPE', 'SOLD DATE', 'PROPERTY TYPE', 'ADDRESS', 'CITY', 'STATE',
       'ZIP', 'PRICE', 'BEDS', 'BATHS', 'LOCATION', 'SQUARE FEET', 'LOT SIZE',
       'YEAR BUILT', 'DAYS ON MARKET', '$/SQUARE FEET', 'HOA/MONTH', 'STATUS',
       'NEXT OPEN HOUSE START TIME', 'NEXT OPEN HOUSE END TIME',
       'URL (SEE http://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)',
       'SOURCE', 'MLS#', 'FAVORITE', 'INTERESTED', 'LATITUDE', 'LONGITUDE'],
      dtype='object')



In [26]:

    
sf1 = sf.rename(index=str, columns={'SALE TYPE': 'saletype',
    'SOLD DATE': 'solddate', 'PROPERTY TYPE': 'proptype', 'ADDRESS': 'address',
    'CITY': 'city', 'STATE': 'state', 'ZIP': 'zip', 'PRICE': 'price', 'BEDS': 'beds',
    'BATHS': 'baths', 'LOCATION': 'location', 'SQUARE FEET': 'sqft', 'LOT SIZE': 'lotsize',
    'YEAR BUILT': 'yrbuilt', 'DAYS ON MARKET': 'daysonmkt', '$/SQUARE FEET': 'pricesqft',
    'LATITUDE': 'latitude', 'LONGITUDE': 'longitude', 'HOA/MONTH': 'hoamonth',
    'URL (SEE http://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)': 'url',
    'STATUS': 'status', 'NEXT OPEN HOUSE START TIME': 'nextopenstart', 'NEXT OPEN HOUSE END TIME': 'nextopenend',
    'SOURCE': 'source', 'MLS#': 'mls', 'FAVORITE': 'favorite', 'INTERESTED': 'interested'
    })

sf1.head()









    Out[26]:







  
    
      
      saletype
      solddate
      proptype
      address
      city
      state
      zip
      price
      beds
      baths
      ...
      status
      nextopenstart
      nextopenend
      url
      source
      mls
      favorite
      interested
      latitude
      longitude
    
  
  
    
      0
      PAST SALE
      February-7-2017
      Single Family Residential
      521 Cayuga Ave
      San Francisco
      CA
      94112.0
      1200000.0
      2.0
      1.25
      ...
      Sold
      NaN
      NaN
      http://www.redfin.com/CA/San-Francisco/521-Cay...
      San Francisco MLS
      453167
      N
      Y
      37.728860
      -122.435308
    
    
      1
      PAST SALE
      February-10-2017
      Condo/Co-op
      95 Red Rock Way Unit 105M
      San Francisco
      CA
      94131.0
      590000.0
      0.0
      1.00
      ...
      Sold
      NaN
      NaN
      http://www.redfin.com/CA/San-Francisco/95-Red-...
      San Francisco MLS
      452726
      N
      Y
      37.746324
      -122.441517
    
    
      2
      PAST SALE
      February-10-2017
      Single Family Residential
      3157 Baker St
      San Francisco
      CA
      94123.0
      3260000.0
      5.0
      3.50
      ...
      Sold
      NaN
      NaN
      http://www.redfin.com/CA/San-Francisco/3157-Ba...
      San Francisco MLS
      451943
      N
      Y
      37.800218
      -122.446455
    
    
      3
      PAST SALE
      February-17-2017
      Single Family Residential
      2594 San Jose Ave
      San Francisco
      CA
      94112.0
      920000.0
      3.0
      2.00
      ...
      Sold
      NaN
      NaN
      http://www.redfin.com/CA/San-Francisco/2594-Sa...
      MetroList
      16067859
      N
      Y
      37.716435
      -122.450444
    
    
      4
      PAST SALE
      February-27-2017
      Single Family Residential
      130 Brazil Ave
      SAN FRANCISCO
      CA
      94112.0
      790000.0
      2.0
      1.00
      ...
      Sold
      NaN
      NaN
      http://www.redfin.com/CA/San-Francisco/130-Bra...
      MLSListings
      ML81635907
      N
      Y
      37.723920
      -122.433735
    
  

5 rows × 27 columns



In [27]:

    
sf1.describe()









    Out[27]:







  
    
      
      zip
      price
      beds
      baths
      sqft
      lotsize
      yrbuilt
      daysonmkt
      pricesqft
      hoamonth
      nextopenstart
      nextopenend
      latitude
      longitude
    
  
  
    
      count
      308.000000
      3.080000e+02
      301.000000
      293.000000
      266.000000
      139.000000
      288.000000
      306.00000
      264.000000
      159.000000
      0.0
      0.0
      309.000000
      309.000000
    
    
      mean
      93888.253247
      1.521055e+06
      2.647841
      1.880546
      1662.593985
      3194.935252
      1952.006944
      14.30719
      949.905303
      543.874214
      NaN
      NaN
      37.760820
      -122.431289
    
    
      std
      2820.378426
      1.063400e+06
      1.862499
      0.896519
      1169.031917
      4656.050424
      39.950230
      8.87308
      279.135062
      436.562243
      NaN
      NaN
      0.025052
      0.027834
    
    
      min
      59059.000000
      3.500000e+03
      0.000000
      1.000000
      300.000000
      611.000000
      1885.000000
      2.00000
      339.000000
      4.000000
      NaN
      NaN
      37.709249
      -122.510730
    
    
      25%
      94109.000000
      8.787500e+05
      2.000000
      1.000000
      978.750000
      2319.500000
      1916.750000
      6.00000
      773.750000
      342.500000
      NaN
      NaN
      37.739065
      -122.443876
    
    
      50%
      94114.000000
      1.271000e+06
      2.000000
      2.000000
      1360.000000
      2513.000000
      1941.000000
      12.00000
      963.500000
      473.000000
      NaN
      NaN
      37.762549
      -122.428527
    
    
      75%
      94122.000000
      1.813750e+06
      3.000000
      2.000000
      1904.750000
      3090.000000
      1996.000000
      23.00000
      1106.000000
      683.000000
      NaN
      NaN
      37.781520
      -122.412238
    
    
      max
      94158.000000
      1.025000e+07
      18.000000
      5.500000
      9504.000000
      56024.000000
      2017.000000
      30.00000
      2053.000000
      4413.000000
      NaN
      NaN
      37.805816
      -122.371007



In [28]:

    
plt.figure(1, figsize=(10,8), )
plt.scatter(sf1['sqft'], sf1['price'], marker=0, s=10, c='g')
#plt.axis([12, 16, 12, 16])
plt.show();



In [20]:

    
import statsmodels.api as sm
import numpy as np
from patsy import dmatrices
y, X = dmatrices('np.log(price) ~ np.log(sqft) + C(baths)', 
                 data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.743
Model:                            OLS   Adj. R-squared:                  0.732
Method:                 Least Squares   F-statistic:                     70.43
Date:                Wed, 15 Nov 2017   Prob (F-statistic):           3.65e-66
Time:                        18:13:16   Log-Likelihood:                -14.895
No. Observations:                 255   AIC:                             51.79
Df Residuals:                     244   BIC:                             90.74
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            9.1469      0.351     26.076      0.000       8.456       9.838
C(baths)[T.1.25]    -0.1167      0.189     -0.616      0.538      -0.490       0.256
C(baths)[T.1.5]      0.0383      0.080      0.482      0.630      -0.118       0.195
C(baths)[T.2.0]      0.1334      0.047      2.853      0.005       0.041       0.226
C(baths)[T.2.5]      0.2938      0.072      4.098      0.000       0.153       0.435
C(baths)[T.3.0]      0.1741      0.084      2.067      0.040       0.008       0.340
C(baths)[T.3.5]      0.5117      0.124      4.125      0.000       0.267       0.756
C(baths)[T.4.0]     -0.0601      0.150     -0.402      0.688      -0.355       0.234
C(baths)[T.4.5]      0.9103      0.171      5.324      0.000       0.574       1.247
C(baths)[T.5.0]      0.1603      0.168      0.955      0.341      -0.170       0.491
C(baths)[T.5.5]           0          0        nan        nan           0           0
np.log(sqft)         0.6632      0.052     12.872      0.000       0.562       0.765
==============================================================================
Omnibus:                       21.167   Durbin-Watson:                   1.990
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               30.400
Skew:                          -0.551   Prob(JB):                     2.50e-07
Kurtosis:                       4.283   Cond. No.                          inf
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is      0. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.






    



/Users/Paul/anaconda3/lib/python3.6/site-packages/statsmodels/regression/linear_model.py:1471: RuntimeWarning: divide by zero encountered in double_scalars
  return np.sqrt(eigvals[0]/eigvals[-1])
/Users/Paul/anaconda3/lib/python3.6/site-packages/statsmodels/base/model.py:1036: RuntimeWarning: invalid value encountered in true_divide
  return self.params / self.bse
/Users/Paul/anaconda3/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
/Users/Paul/anaconda3/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
/Users/Paul/anaconda3/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1818: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)



In [21]:

    
plt.hist(residuals, bins=25, normed=True, alpha=.5)
mu = residuals.mean()
variance = residuals.var()
sigma = residuals.std()
x = np.linspace(-3, 3, 100)
plt.plot(x,mlab.normpdf(x, mu, sigma));



In [22]:

    
plt.figure(1, figsize=(10,8), )
plt.plot([12, 16], [0, 0], c='b')
plt.scatter(predicted, residuals, marker=0, s=10, c='g');
plt.axis([12, 16, -0.8, 0.8])
plt.show();



In [24]:

    
plt.figure(1, figsize=(10,8), )
plt.plot([12, 16], [12, 16])
plt.scatter(observed, predicted, marker=0, s=10, c='g')
plt.axis([12, 16, 12, 16])
plt.show();



In [25]:

    
vars = pd.read_csv('data/ba_block_variables.csv', dtype={'block_id':object})



In [26]:

    
vars.shape









    Out[26]:





(109228, 1015)



In [27]:

    
smallvars = vars[['block_id','block_groups_sum_persons','block_groups_sum_acres','block_groups_total_jobs',
                 'block_groups_median_children', 'block_groups_median_persons' , 'block_groups_median_income',
                 'block_groups_prop_tenure_2', 'nodes_low_income_hh_1500m', 'nodes_high_income_hh_1500m',
                 'nodes_jobs_5000m', 'nodes_jobs_30km', 'nodes_population_400m', 'nodes_population_800m',
                 'nodes_jobs_800m', 'prop_race_of_head_1', 'prop_race_of_head_2', 'pumas_density_households',
                 'nodes_jobs_3000m_agg1', 'nodes_jobs_3000m_agg2', 'pumas_density_jobs', 'nodes_jobs_40km',
                 'nodes_jobs_3000m_agg3', 'nodes_jobs_3000m_agg4', 'nodes_jobs_3000m_agg5',
                 'block_groups_prop_persons_1', 'block_groups_prop_persons_2', 'block_groups_median_age_of_head',
                 'puma10_id_is_0607501', 'puma10_id_is_0607502', 'puma10_id_is_0607503', 'puma10_id_is_0607504']]
sv = smallvars.rename(index=str, columns={'block_groups_sum_persons': 'bgpop',
                                     'block_groups_sum_acres': 'bgacres',
                                    'block_groups_total_jobs': 'bgjobs',
                                    'block_groups_median_children': 'bgmedkids',
                                    'block_groups_median_persons': 'bgmedhhs',
                                    'block_groups_median_income': 'bgmedinc',
                                    'block_groups_prop_tenure_2': 'proprent',
                                    'nodes_low_income_hh_1500m': 'lowinc1500m',
                                    'nodes_high_income_hh_1500m': 'highinc1500m',
                                    'nodes_jobs_5000m': 'lnjobs5000m',
                                    'nodes_jobs_30km': 'lnjobs30km',
                                    'nodes_jobs_40km': 'lnjobs40km',
                                    'nodes_population_400m': 'lnpop400m',
                                    'nodes_population_800m': 'lnpop800m',
                                    'nodes_jobs_800m': 'lnjobs800m',
                                    'prop_race_of_head_1': 'propwhite',
                                    'prop_race_of_head_2': 'propblack',
                                    'pumas_density_households': 'pumahhden',
                                    'pumas_density_jobs': 'pumajobden',
                                    'nodes_jobs_3000m_agg1': 'lnbasic3000m',
                                    'nodes_jobs_3000m_agg2': 'lntcpuw3000m',
                                    'nodes_jobs_3000m_agg3': 'lnret3000m',
                                    'nodes_jobs_3000m_agg4': 'lnfire3000m',
                                    'nodes_jobs_3000m_agg5': 'lnserv3000m',
                                    'block_groups_prop_persons_1': 'prop1per',
                                    'block_groups_prop_persons_2': 'prop2per',
                                    'block_groups_median_age_of_head': 'bgmedagehd',
                                    'puma10_id_is_0607501': 'puma1',
                                    'puma10_id_is_0607502': 'puma2',
                                    'puma10_id_is_0607503': 'puma3',
                                    'puma10_id_is_0607504': 'puma4'
                                         })
sv.head()









    Out[27]:







  
    
      
      block_id
      bgpop
      bgacres
      bgjobs
      bgmedkids
      bgmedhhs
      bgmedinc
      proprent
      lowinc1500m
      highinc1500m
      ...
      lnret3000m
      lnfire3000m
      lnserv3000m
      prop1per
      prop2per
      bgmedagehd
      puma1
      puma2
      puma3
      puma4
    
  
  
    
      0
      060014271001000
      888.0
      90.106156
      465.0
      0.0
      2.0
      80000.0
      0.558603
      5.991451
      6.114498
      ...
      7.457814
      5.329694
      8.082779
      0.34414
      0.381546
      57.0
      0
      0
      0
      0
    
    
      1
      060014271001001
      888.0
      90.106156
      465.0
      0.0
      2.0
      80000.0
      0.558603
      5.991451
      6.114498
      ...
      7.457814
      5.329694
      8.082779
      0.34414
      0.381546
      57.0
      0
      0
      0
      0
    
    
      2
      060014271001002
      888.0
      90.106156
      465.0
      0.0
      2.0
      80000.0
      0.558603
      5.991451
      6.114498
      ...
      7.457814
      5.329694
      8.082779
      0.34414
      0.381546
      57.0
      0
      0
      0
      0
    
    
      3
      060014271001003
      888.0
      90.106156
      465.0
      0.0
      2.0
      80000.0
      0.558603
      6.231131
      6.339032
      ...
      7.558539
      5.431654
      8.188618
      0.34414
      0.381546
      57.0
      0
      0
      0
      0
    
    
      4
      060014271001004
      888.0
      90.106156
      465.0
      0.0
      2.0
      80000.0
      0.558603
      6.107350
      5.763724
      ...
      7.690577
      5.521359
      8.287781
      0.34414
      0.381546
      57.0
      0
      0
      0
      0
    
  

5 rows × 32 columns

Recoded detailed race code 1 .White alone 2 .Black or African American alone 3 .American Indian alone 4 .Alaska Native alone 5 .American Indian and Alaska Native tribes specified; or American .Indian or Alaska native, not specified and no other races 6 .Asian alone 7 .Native Hawaiian and Other Pacific Islander alone 8 .Some other race alone 9 .Two or more major race groups



In [29]:

    
with open('pg_engine.txt') as f:
    pg_engine = f.readlines()
from sqlalchemy import create_engine
engine = create_engine(pg_engine[0])



In [30]:

    
%%time
import pandas as pd
df = pd.read_sql_query('select * from "rental_listings"',con=engine)









    



CPU times: user 49.8 s, sys: 11.5 s, total: 1min 1s
Wall time: 2min 7s



In [31]:

    
print(df.dtypes)
df.describe()









    



pid              object
date             object
region           object
neighborhood     object
rent            float64
bedrooms        float64
sqft            float64
rent_sqft       float64
longitude       float64
latitude        float64
county           object
fips_block       object
state            object
bathrooms       float64
dtype: object






    Out[31]:







  
    
      
      rent
      bedrooms
      sqft
      rent_sqft
      longitude
      latitude
      bathrooms
    
  
  
    
      count
      1.215242e+07
      1.215242e+07
      1.215242e+07
      1.215242e+07
      1.215242e+07
      1.215242e+07
      1.037351e+07
    
    
      mean
      4.493626e+04
      1.803428e+00
      1.061739e+03
      5.942926e+01
      -9.616490e+01
      3.693181e+01
      1.540318e+06
    
    
      std
      9.081297e+06
      9.543504e-01
      9.648805e+03
      1.454028e+04
      1.685027e+01
      5.198811e+00
      3.515004e+09
    
    
      min
      1.000000e+00
      0.000000e+00
      1.000000e+00
      1.000001e-06
      -1.793005e+02
      -8.539851e+01
      0.000000e+00
    
    
      25%
      8.710000e+02
      1.000000e+00
      7.360000e+02
      9.332310e-01
      -1.150669e+02
      3.343878e+01
      1.000000e+00
    
    
      50%
      1.210000e+03
      2.000000e+00
      9.400000e+02
      1.267183e+00
      -9.420281e+01
      3.719705e+01
      1.000000e+00
    
    
      75%
      1.730000e+03
      2.000000e+00
      1.170000e+03
      1.875809e+00
      -8.128140e+01
      4.011384e+01
      2.000000e+00
    
    
      max
      2.147484e+09
      1.270000e+02
      8.388607e+06
      1.651910e+07
      1.000000e+03
      1.653830e+02
      8.494864e+12



In [32]:

    
# convert the date column to yyyy-mm-dd date format
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')



In [33]:

    
# load the 2014 census data set of MSAs
census = pd.read_csv('data/census_pop_income.csv', encoding='ISO-8859-1')
census['median_income'] = census['2014_median_income'].str.replace(',','').astype(int)
census['population'] = census['2014_pop_est'].str.replace(',','').astype(int)
census = census.drop(labels='notes', axis=1, inplace=False)
census = census.set_index('region')
census.head()









    Out[33]:







  
    
      
      2014_median_income
      2014_pop_est
      census_name
      median_income
      population
    
    
      region
      
      
      
      
      
    
  
  
    
      newyork
      67,066
      20,092,883
      New York-Newark-Jersey City, NY-NJ-PA MSA
      67066
      20092883
    
    
      losangeles
      60,514
      13,262,220
      Los Angeles-Long Beach-Anaheim, CA MSA
      60514
      13262220
    
    
      chicago
      61,598
      9,554,598
      Chicago-Naperville-Elgin, IL-IN-WI MSA
      61598
      9554598
    
    
      sfbay
      80,600
      8,607,423
      San Jose-San Francisco-Oakland, CA CSA
      80600
      8607423
    
    
      dallas
      59,530
      6,954,330
      Dallas-Fort Worth-Arlington, TX MSA
      59530
      6954330



In [39]:

    
# these are the 15 most populous metros by population, defined by census bureau 2014 estimates
most_populous_regions = census['2014_pop_est'].sort_values(ascending=False)
print(most_populous_regions.head(15))









    



region
honolulu          991,788
tulsa             969,224
fresno            965,974
albuquerque       904,587
chicago         9,554,598
newhaven          861,277
sfbay           8,607,423
dallas          6,954,330
houston         6,490,180
philadelphia    6,051,170
washingtondc    6,033,737
miami           5,929,819
atlanta         5,614,323
boston          4,732,161
phoenix         4,489,109
Name: 2014_pop_est, dtype: object



In [35]:

    
df = df.merge(census, left_on='region', right_index=True)
df.head()









    Out[35]:







  
    
      
      pid
      date
      region
      neighborhood
      rent
      bedrooms
      sqft
      rent_sqft
      longitude
      latitude
      county
      fips_block
      state
      bathrooms
      2014_median_income
      2014_pop_est
      census_name
      median_income
      population
    
  
  
    
      59
      5873902666
      2016-11-13
      cincinnati
      Cincinnati
      875.0
      1.0
      750.0
      1.166667
      -84.555389
      39.108654
      Hamilton
      390610096001000
      OH
      NaN
      55,729
      2,149,449
      Cincinnati, OH-KY-IN MSA
      55729
      2149449
    
    
      1702
      5839862773
      2016-11-13
      cincinnati
      Cincinnati, Sycamore, Mason, Symmes
      1250.0
      2.0
      1151.0
      1.086012
      -84.330700
      39.269200
      Hamilton
      390610243014010
      OH
      NaN
      55,729
      2,149,449
      Cincinnati, OH-KY-IN MSA
      55729
      2149449
    
    
      3264
      5839904702
      2016-11-13
      cincinnati
      Cincinnati, Sycamore, Mason, Symmes
      935.0
      1.0
      690.0
      1.355072
      -84.330700
      39.269200
      Hamilton
      390610243014010
      OH
      NaN
      55,729
      2,149,449
      Cincinnati, OH-KY-IN MSA
      55729
      2149449
    
    
      4376
      5839895285
      2016-11-13
      cincinnati
      Cincinnati, Sycamore, Mason, Symmes
      935.0
      1.0
      690.0
      1.355072
      -84.330700
      39.269200
      Hamilton
      390610243014010
      OH
      NaN
      55,729
      2,149,449
      Cincinnati, OH-KY-IN MSA
      55729
      2149449
    
    
      5020
      5874127885
      2016-11-13
      cincinnati
      Oakley, UT
      700.0
      3.0
      2085.0
      0.335731
      -111.245600
      40.724300
      Summit
      490439642021006
      UT
      NaN
      55,729
      2,149,449
      Cincinnati, OH-KY-IN MSA
      55729
      2149449



In [ ]:

    
# Create an HDF5 file if desired, in the data directory
#df.to_hdf('data/rents.h5','rents',append=False)



In [ ]:

    
# Load from HDF5 if desired
#df = pd.HDFStore('data/rents.h5')
#df



In [36]:

    
upper_percentile = 0.998
lower_percentile = 0.002

# how many rows would be within the upper and lower percentiles?
upper = int(len(df) * upper_percentile)
lower = int(len(df) * lower_percentile)

# get the rent/sqft values at the upper and lower percentiles
rent_sqft_sorted = df['rent_sqft'].sort_values(ascending=True, inplace=False)
upper_rent_sqft = rent_sqft_sorted.iloc[upper]
lower_rent_sqft = rent_sqft_sorted.iloc[lower]

# get the rent values at the upper and lower percentiles
rent_sorted = df['rent'].sort_values(ascending=True, inplace=False)
upper_rent = rent_sorted.iloc[upper]
lower_rent = rent_sorted.iloc[lower]

# get the sqft values at the upper and lower percentiles
sqft_sorted = df['sqft'].sort_values(ascending=True, inplace=False)
upper_sqft = sqft_sorted.iloc[upper]
lower_sqft = sqft_sorted.iloc[lower]

print('valid rent_sqft range:', [lower_rent_sqft, upper_rent_sqft])
print('valid rent range:', [lower_rent, upper_rent])
print('valid sqft range:', [lower_sqft, upper_sqft])









    



valid rent_sqft range: [0.094654788418708197, 8.8050847457627093]
valid rent range: [100.0, 9995.0]
valid sqft range: [202.0, 4428.0]



In [37]:

    
# create a boolean vector mask to filter out any rows with rent_sqft outside of the reasonable values
rent_sqft_mask = (df['rent_sqft'] > lower_rent_sqft) & (df['rent_sqft'] < upper_rent_sqft)

# create boolean vector masks to filter out any rows with rent or sqft outside of the reasonable values
rent_mask = (df['rent'] > lower_rent) & (df['rent'] < upper_rent)
sqft_mask = (df['sqft'] > lower_sqft) & (df['sqft'] < upper_sqft)

# filter the thorough listings according to these masks
filtered_listings = pd.DataFrame(df[rent_sqft_mask & rent_mask & sqft_mask])
len(filtered_listings)









    Out[37]:





7792502



In [ ]:

    
filtered_listings.describe()



In [ ]:

    
#sfbay = filtered_listings[filtered_listings['region']=='sfbay']
sfbay = df[df['region']=='sfbay']
sfbay.describe()



In [ ]:

    
#sfbay['rent_sqft'].quantile(.01)



In [ ]:

    
#sfbay['sqft'].quantile(.01)



In [ ]:

    
# create a boolean vector mask to filter out any rows with rent_sqft and sqft in Bay Area under 1 percentile
#sfbay_rent_sqft_mask = (sfbay['rent_sqft'] > sfbay['rent_sqft'].quantile(.01) )

# create boolean vector masks to filter out any rows with rent or sqft outside of the reasonable values
#sfbay_sqft_mask = (sfbay['sqft'] > sfbay['sqft'].quantile(.01) )

# filter the thorough listings according to these masks
#sfbay_filtered = pd.DataFrame(sfbay[sfbay_rent_sqft_mask & sfbay_sqft_mask])
#len(sfbay_filtered)



In [ ]:

    
upper_percentile = 0.998
lower_percentile = 0.002

# how many rows would be within the upper and lower percentiles?
upper = int(len(sfbay) * upper_percentile)
lower = int(len(sfbay) * lower_percentile)

# get the rent/sqft values at the upper and lower percentiles
rent_sqft_sorted = sfbay['rent_sqft'].sort_values(ascending=True, inplace=False)
upper_rent_sqft = rent_sqft_sorted.iloc[upper]
lower_rent_sqft = rent_sqft_sorted.iloc[lower]

# get the rent values at the upper and lower percentiles
rent_sorted = sfbay['rent'].sort_values(ascending=True, inplace=False)
upper_rent = rent_sorted.iloc[upper]
lower_rent = rent_sorted.iloc[lower]

# get the sqft values at the upper and lower percentiles
sqft_sorted = sfbay['sqft'].sort_values(ascending=True, inplace=False)
upper_sqft = sqft_sorted.iloc[upper]
lower_sqft = sqft_sorted.iloc[lower]

print('valid rent_sqft range:', [lower_rent_sqft, upper_rent_sqft])
print('valid rent range:', [lower_rent, upper_rent])
print('valid sqft range:', [lower_sqft, upper_sqft])



In [ ]:

    
# create a boolean vector mask to filter out any rows with rent_sqft outside of the reasonable values
rent_sqft_mask = (sfbay['rent_sqft'] > lower_rent_sqft) & (sfbay['rent_sqft'] < upper_rent_sqft)

# create boolean vector masks to filter out any rows with rent or sqft outside of the reasonable values
rent_mask = (sfbay['rent'] > lower_rent) & (sfbay['rent'] < upper_rent)
sqft_mask = (sfbay['sqft'] > lower_sqft) & (sfbay['sqft'] < upper_sqft)

# filter the thorough listings according to these masks
sfbay_filtered = pd.DataFrame(sfbay[rent_sqft_mask & rent_mask & sqft_mask])
len(sfbay_filtered)



In [ ]:

    
sfbay_filtered.describe()



In [ ]:

    
sf = sfbay_filtered.merge(sv, left_on='fips_block', right_on='block_id')
sf['northsf'] = sf['puma1']+sf['puma2']+sf['puma3']+sf['puma4']
sf['sqft1'] = sf['sqft']*sf['sqft']<1500
sf['sqft2'] = sf['sqft']*sf['sqft']>1499
sf['pct1per'] = sf['prop1per']*100
sf['pct2per'] = sf['prop2per']*100
sf['pctrent'] = sf['proprent']*100
sf['pctblack'] = sf['propblack']*100
sf.head()



In [ ]:

    
sf.describe()



In [ ]:

    
sf['bgpopden'] = sf['bgpop']/sf['bgacres']
sf['bgjobden'] = sf['bgjobs']/sf['bgacres']
sf['highlowinc1500m'] = sf['highinc1500m']/sf['lowinc1500m']



In [ ]:

    
sf.dtypes



In [ ]:

    
# Use either filtered_listings for the national dataset or sfbay_filtered for the Bay Area subset

import statsmodels.api as sm
import numpy as np
from patsy import dmatrices
y, X = dmatrices('np.log(rent) ~ np.log(sqft) + bedrooms + bathrooms \
                 ', 
                 data=sf, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())



In [ ]:

    
# Use either filtered_listings for the national dataset or sfbay_filtered for the Bay Area subset

import statsmodels.api as sm
import numpy as np
from patsy import dmatrices
y, X = dmatrices('np.log(rent) ~ np.log(sqft) + bedrooms + bathrooms  +\
                 np.log(bgmedinc) + lnjobs5000m + lnjobs40km + pctblack \
                  + pumahhden + pctrent + pct1per + pct2per + bgmedagehd  \
                 ', 
                 data=sf, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())



In [ ]:

    
plt.hist(residuals, bins=25, normed=True, alpha=.5)
mu = residuals.mean()
variance = residuals.var()
sigma = residuals.std()
x = np.linspace(-3, 3, 100)
plt.plot(x,mlab.normpdf(x, mu, sigma));



In [ ]:

    
plt.figure(1, figsize=(10,8), )
plt.plot([7, 9], [0, 0], c='b')
plt.scatter(predicted, residuals, marker=0, s=2, c='g');
plt.axis([7.25, 9, -1.5, 1.5])
plt.show();



In [ ]:

    
plt.figure(1, figsize=(10,8), )
plt.plot([6, 9.5], [6, 9.5])
plt.scatter(observed, predicted, marker=0, s=2, c='g')
plt.axis([6.5, 9.5, 6.5, 9.5])
plt.show();



In [ ]:

    
print(residuals.mean())
print(residuals.std())



In [ ]:

    
# If we want to use WLS we need a useful set of weights. The default produces the same results as OLS 
mod = sm.WLS(y, X, weights=1.)
res = mod.fit()
print(res.summary())



In [ ]:

    
# Warning, this is a very intensive process and will take a while
%%time
from pymc3 import Model, NUTS, sample
from pymc3.glm import glm

with Model() as model_glm:
    glm('np.log(rent) ~ np.log(sqft) + bedrooms + bathrooms', sfbay_filtered)
    trace = sample(5000)



In [ ]:

    
from pymc3 import traceplot
%matplotlib inline
traceplot(trace);



In [ ]:

    
from scipy import optimize
from pymc3 import find_MAP
with model_glm:

    # obtain starting values via MAP
    start = find_MAP(fmin=optimize.fmin_powell)

    # draw 2000 posterior samples
    trace = sample(5000, start=start)



In [ ]:

    
traceplot(trace);



In [ ]:

    
import matplotlib.pyplot as plt
import theano
import pymc3 as pm



In [ ]:

    
fig = plt.figure(figsize=(7, 7))
ax = fig.add_subplot(111, xlabel='x', ylabel='y', title='Generated data and underlying model')
ax.plot(np.log(sfbay['sqft']), np.log(sfbay['rent']), 'o', markersize=.5, color='blue', label='sampled data')
#ax.plot(x, true_regression_line, label='true regression line', lw=2.)
#pm.glm.plot_posterior_predictive(trace, samples=100,
#                                 label='posterior predictive regression lines')
plt.legend(loc=0);



In [ ]:

	saletype	solddate	proptype	address	city	state	zip	price	beds	baths	...	status	nextopenstart	nextopenend	url	source	mls	favorite	interested	latitude	longitude
0	PAST SALE	February-7-2017	Single Family Residential	521 Cayuga Ave	San Francisco	CA	94112.0	1200000.0	2.0	1.25	...	Sold	NaN	NaN	http://www.redfin.com/CA/San-Francisco/521-Cay...	San Francisco MLS	453167	N	Y	37.728860	-122.435308
1	PAST SALE	February-10-2017	Condo/Co-op	95 Red Rock Way Unit 105M	San Francisco	CA	94131.0	590000.0	0.0	1.00	...	Sold	NaN	NaN	http://www.redfin.com/CA/San-Francisco/95-Red-...	San Francisco MLS	452726	N	Y	37.746324	-122.441517
2	PAST SALE	February-10-2017	Single Family Residential	3157 Baker St	San Francisco	CA	94123.0	3260000.0	5.0	3.50	...	Sold	NaN	NaN	http://www.redfin.com/CA/San-Francisco/3157-Ba...	San Francisco MLS	451943	N	Y	37.800218	-122.446455
3	PAST SALE	February-17-2017	Single Family Residential	2594 San Jose Ave	San Francisco	CA	94112.0	920000.0	3.0	2.00	...	Sold	NaN	NaN	http://www.redfin.com/CA/San-Francisco/2594-Sa...	MetroList	16067859	N	Y	37.716435	-122.450444
4	PAST SALE	February-27-2017	Single Family Residential	130 Brazil Ave	SAN FRANCISCO	CA	94112.0	790000.0	2.0	1.00	...	Sold	NaN	NaN	http://www.redfin.com/CA/San-Francisco/130-Bra...	MLSListings	ML81635907	N	Y	37.723920	-122.433735

	zip	price	beds	baths	sqft	lotsize	yrbuilt	daysonmkt	pricesqft	hoamonth	nextopenstart	nextopenend	latitude	longitude
count	308.000000	3.080000e+02	301.000000	293.000000	266.000000	139.000000	288.000000	306.00000	264.000000	159.000000	0.0	0.0	309.000000	309.000000
mean	93888.253247	1.521055e+06	2.647841	1.880546	1662.593985	3194.935252	1952.006944	14.30719	949.905303	543.874214	NaN	NaN	37.760820	-122.431289
std	2820.378426	1.063400e+06	1.862499	0.896519	1169.031917	4656.050424	39.950230	8.87308	279.135062	436.562243	NaN	NaN	0.025052	0.027834
min	59059.000000	3.500000e+03	0.000000	1.000000	300.000000	611.000000	1885.000000	2.00000	339.000000	4.000000	NaN	NaN	37.709249	-122.510730
25%	94109.000000	8.787500e+05	2.000000	1.000000	978.750000	2319.500000	1916.750000	6.00000	773.750000	342.500000	NaN	NaN	37.739065	-122.443876
50%	94114.000000	1.271000e+06	2.000000	2.000000	1360.000000	2513.000000	1941.000000	12.00000	963.500000	473.000000	NaN	NaN	37.762549	-122.428527
75%	94122.000000	1.813750e+06	3.000000	2.000000	1904.750000	3090.000000	1996.000000	23.00000	1106.000000	683.000000	NaN	NaN	37.781520	-122.412238
max	94158.000000	1.025000e+07	18.000000	5.500000	9504.000000	56024.000000	2017.000000	30.00000	2053.000000	4413.000000	NaN	NaN	37.805816	-122.371007

	block_id	bgpop	bgacres	bgjobs	bgmedhhs	bgmedinc	proprent	lowinc1500m	highinc1500m	...	lnret3000m	lnfire3000m	lnserv3000m	prop1per	prop2per	bgmedagehd
0	060014271001000	888.0	90.106156	465.0	2.0	80000.0	0.558603	5.991451	6.114498	...	7.457814	5.329694	8.082779	0.34414	0.381546	57.0
1	060014271001001	888.0	90.106156	465.0	2.0	80000.0	0.558603	5.991451	6.114498	...	7.457814	5.329694	8.082779	0.34414	0.381546	57.0
2	060014271001002	888.0	90.106156	465.0	2.0	80000.0	0.558603	5.991451	6.114498	...	7.457814	5.329694	8.082779	0.34414	0.381546	57.0
3	060014271001003	888.0	90.106156	465.0	2.0	80000.0	0.558603	6.231131	6.339032	...	7.558539	5.431654	8.188618	0.34414	0.381546	57.0
4	060014271001004	888.0	90.106156	465.0	2.0	80000.0	0.558603	6.107350	5.763724	...	7.690577	5.521359	8.287781	0.34414	0.381546	57.0

	rent	bedrooms	sqft	rent_sqft	longitude	latitude	bathrooms
count	1.215242e+07	1.215242e+07	1.215242e+07	1.215242e+07	1.215242e+07	1.215242e+07	1.037351e+07
mean	4.493626e+04	1.803428e+00	1.061739e+03	5.942926e+01	-9.616490e+01	3.693181e+01	1.540318e+06
std	9.081297e+06	9.543504e-01	9.648805e+03	1.454028e+04	1.685027e+01	5.198811e+00	3.515004e+09
min	1.000000e+00	0.000000e+00	1.000000e+00	1.000001e-06	-1.793005e+02	-8.539851e+01	0.000000e+00
25%	8.710000e+02	1.000000e+00	7.360000e+02	9.332310e-01	-1.150669e+02	3.343878e+01	1.000000e+00
50%	1.210000e+03	2.000000e+00	9.400000e+02	1.267183e+00	-9.420281e+01	3.719705e+01	1.000000e+00
75%	1.730000e+03	2.000000e+00	1.170000e+03	1.875809e+00	-8.128140e+01	4.011384e+01	2.000000e+00
max	2.147484e+09	1.270000e+02	8.388607e+06	1.651910e+07	1.000000e+03	1.653830e+02	8.494864e+12

	2014_median_income	2014_pop_est	census_name	median_income	population
region
newyork	67,066	20,092,883	New York-Newark-Jersey City, NY-NJ-PA MSA	67066	20092883
losangeles	60,514	13,262,220	Los Angeles-Long Beach-Anaheim, CA MSA	60514	13262220
chicago	61,598	9,554,598	Chicago-Naperville-Elgin, IL-IN-WI MSA	61598	9554598
sfbay	80,600	8,607,423	San Jose-San Francisco-Oakland, CA CSA	80600	8607423
dallas	59,530	6,954,330	Dallas-Fort Worth-Arlington, TX MSA	59530	6954330

	pid	date	region	neighborhood	rent	bedrooms	sqft	rent_sqft	longitude	latitude	county	fips_block	state	bathrooms	2014_median_income	2014_pop_est	census_name	median_income	population
59	5873902666	2016-11-13	cincinnati	Cincinnati	875.0	1.0	750.0	1.166667	-84.555389	39.108654	Hamilton	390610096001000	OH	NaN	55,729	2,149,449	Cincinnati, OH-KY-IN MSA	55729	2149449
1702	5839862773	2016-11-13	cincinnati	Cincinnati, Sycamore, Mason, Symmes	1250.0	2.0	1151.0	1.086012	-84.330700	39.269200	Hamilton	390610243014010	OH	NaN	55,729	2,149,449	Cincinnati, OH-KY-IN MSA	55729	2149449
3264	5839904702	2016-11-13	cincinnati	Cincinnati, Sycamore, Mason, Symmes	935.0	1.0	690.0	1.355072	-84.330700	39.269200	Hamilton	390610243014010	OH	NaN	55,729	2,149,449	Cincinnati, OH-KY-IN MSA	55729	2149449
4376	5839895285	2016-11-13	cincinnati	Cincinnati, Sycamore, Mason, Symmes	935.0	1.0	690.0	1.355072	-84.330700	39.269200	Hamilton	390610243014010	OH	NaN	55,729	2,149,449	Cincinnati, OH-KY-IN MSA	55729	2149449
5020	5874127885	2016-11-13	cincinnati	Oakley, UT	700.0	3.0	2085.0	0.335731	-111.245600	40.724300	Summit	490439642021006	UT	NaN	55,729	2,149,449	Cincinnati, OH-KY-IN MSA	55729	2149449