Tutorial on Hedonic Regression

This material uses Python to demonstrate some aspects of hedonic regression. The objective here is not to learn to program, but to understand the hedonic regression methodology. We begin with an example in which we generate some synthetic data using a set of coefficients and a mathematical model, and learn those coefficients using a statistical method called multiple regression.

The first part of this notebook follows closely the content in the Statsmodels online documentation. We won't try to provide the theoretical basis for regression models here. You can consult any number of online resources for this, including Wikipedia's explanation of Ordinary Least Squares Regression. We will be using the Statsmodels library for this (documentation here).

The statistical model is assumed to be $Y = X\beta + \epsilon$, where $\epsilon\sim N\left(0,\sigma^{2}\Sigma\right)$

Depending on the assumption on $\Sigma$, Statsmodels have currently four classes available

  • GLS : generalized least squares for arbitrary covariance $\Sigma$
  • OLS : ordinary least squares for i.i.d. errors $\Sigma=\textbf{I}$
  • WLS : weighted least squares for heteroskedastic errors $\text{diag}\left (\Sigma\right)$
  • GLSAR : feasible generalized least squares with autocorrelated AR(p) errors $\Sigma=\Sigma\left(\rho\right)$

We focus here on the simple Ordinary Least Squares (OLS) model that is the most widely used, but makes strong assumptions about the errors being indepentently and identically distributed (i.i.d.). When these conditions are met, the OLS parameter estimates are the Best Linear Unbiased Estimates (BLUE).

More intuitively (perhaps), what linear regression using the OLS estimator attempts to do is find the vector of parameters ($\beta$), such that when you compute a linear function $X\beta$ you generate a predicted array of $\hat{y}$ that, compared to the observed $y$, the squared sum of each observation's error ($\epsilon_{i} = \hat{y}_{i} - y_{i}$) across all the observations ($\Sigma^{2}\epsilon_{i}$), is minimized.


In [1]:
# Startup steps
import pandas as pd, numpy as np, statsmodels.api as sm
import matplotlib.pyplot as plt, matplotlib.cm as cm, matplotlib.font_manager as fm
import matplotlib.mlab as mlab
from scipy.stats import pearsonr, ttest_rel
%matplotlib inline

To introduce concepts, let's begin by defining a hypothetical relationship between a dependent variable $y$ and an explanatory, or independent variable $x$.

We are only going to explore correlation, but there is an implicit causal story that x is influencing y, and not the other way around.

Let's say we have a relationship in which y is expected to be twice the value of x. A pretty simple model: $y = 2x$

Another way to describe this is as a line with an intercept of zero: $y = 0 + 2x$ We will look at it initially with no intercept and then add that in, before going to more than one independent variable.

If we plot the 'model', we can see that at a value of x=0, the intercept is 0, so the value of y will be zero. And at a value of x=10, the value of y is 2x = 2*10 = 20.


In [2]:
plt.figure(1, figsize=(10,8), )
plt.plot([0, 10], [0, 20])
plt.axis([0, 10, 0, 20])
plt.show();


Now let's try to approximate real world data, in which we might be collecting observations of x and y, with some measurement error. Let's impose the assumption that the observations will deviate from this line with some randomly distributed error. We will use the standard assumption that the errors are normally distributed, like a bell curve, as is common. We can use the equation for the model to generate points that would fall on the line above if there were no error, but we will add random errors to it to demonstrate how regression models work.

Generate data using a model we define:


In [3]:
nsample = 50
x = np.linspace(0, 10, 50)
X = x
beta = np.array([0, 2])
e = np.random.normal(size=nsample)
X = sm.add_constant(X)
y = np.dot(X, beta) + e

Plot the data as a scatterplot.


In [4]:
plt.figure(1, figsize=(10,8), )
plt.scatter(x, y, marker=0, s=10, c='g')
plt.axis([0, 10, 0, 20])
plt.show();


Add line based on Intercept = 0, beta = slope of line = 2

Now we can see the original 'model' and the generated observations. What regression analysis enables is to 'learn' the parameters of a model that most closely approximates the process that generated a set of observations. In this case, we have a controlled setting, because we generated the data and know the 'true' values of the parameters of the model: intercept = 0, and slope of the line = beta = 2.

Look at the plot below with the line superimposed on the 'observed' (generated) data. We can intuit that if we tilted the line or shifted it up or down, and calculated the 'errors', or the distance of each point to the line, that the square of their sum would get bigger. The reason for squaring is that if we didn't, the negative and positive errors would more or less cancel out. So we can infer that the 'best fit' model parameters would be ones that minimize the sum of the squares errors between the observed data and the y values predicted by the model.


In [5]:
plt.figure(1, figsize=(10,8), )
plt.plot([0, 10], [0, 20])
plt.scatter(x, y, marker=0, s=10, c='g')
plt.axis([0, 10, 0, 20])
plt.show();


Regenerate the data using an intercept = 2


In [6]:
nsample = 20
x = np.linspace(0, 10, 20)
X = x
beta = np.array([2, 2])
e = np.random.normal(size=nsample)
X = sm.add_constant(X)
y = np.dot(X, beta) + e

Specifying regression models using design matrices (dmatrices) in statsmodels

To fit most of the models covered by statsmodels, you will need to create two design matrices. The first is a matrix of endogenous variable(s) (i.e. dependent, response, regressand, etc.). The second is a matrix of exogenous variable(s) (i.e. independent, predictor, regressor, etc.). The OLS coefficient estimates are calculated using linear algebra to find the parameters that minimize the sum of the squared errors:

$$\hat{\beta} = (X'X)^{-1} X'y$$

where $y$ is an $N \times 1$ column of data on sales price. $X$ is $N \times 2$ with an intercept and the x variable.

Run a simple linear regression and compare the coefficients to the ones used to generate the data.


In [7]:
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.975
Model:                            OLS   Adj. R-squared:                  0.974
Method:                 Least Squares   F-statistic:                     714.4
Date:                Wed, 08 Mar 2017   Prob (F-statistic):           6.13e-16
Time:                        09:46:04   Log-Likelihood:                -27.480
No. Observations:                  20   AIC:                             58.96
Df Residuals:                      18   BIC:                             60.95
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          2.2661      0.434      5.218      0.000         1.354     3.178
x1             1.9847      0.074     26.729      0.000         1.829     2.141
==============================================================================
Omnibus:                        3.396   Durbin-Watson:                   1.955
Prob(Omnibus):                  0.183   Jarque-Bera (JB):                1.688
Skew:                          -0.400   Prob(JB):                        0.430
Kurtosis:                       1.822   Cond. No.                         11.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [8]:
print('Parameters: ', results.params)
inter = results.params[0]
beta = results.params[1]
print('Intercept =', inter)
print('Beta = ', beta)
print('Rsquared = ', results.rsquared)


('Parameters: ', array([ 2.26606338,  1.9846515 ]))
('Intercept =', 2.2660633804329802)
('Beta = ', 1.984651495439955)
('Rsquared = ', 0.97542413925618676)

Rsquared = Explained Variation / Total Variation

Rsquared = 1 - (Unexplained Variation / Total Variation)

Rsquared = 1 – (sum of squared residuals / sum of squared deviation of prices from the mean price)

Plot the 'true' line using the original coefficients, and the 'predicted' line, using the estimated coefficients. Try with smaller and larger samples.


In [9]:
plt.figure(1, figsize=(10,8), )
plt.plot([0, 10], [2, 22], c='r')
plt.plot([0,10], [inter,(inter+10*beta)], c='b')
plt.scatter(x, y, marker=0, s=10, c='g')
plt.axis([0, 10, 0, 22])
plt.show();


Now on to some real data and Hedonic Regression

We will use a large sample of single family housing sales from the San Francisco Bay Area to estimate a linear regression model in which the dependent variable $y$ is the price of a house at the time of sale, and $X$ is a set of exogenous, or explanatory variables.

What exactly does this give us? A statistical way to figure out what the component amenities in a house are worth, if you could buy them a la carte. Another way to think of it is, how much do house buyers in the Bay Area during this period pay, on average, for an additional unit of each amenity: square foot of living space, bedroom, bathroom, etc.

Here we use the sales transactions in San Francisco over a month from early-February through early-March.

First we load the data from a csv file. Then we rename columns to make the data easier to work with.


In [10]:
sf = pd.read_csv('data/redfin_2017-03-05-17-45-34-san-francisco-county-1-month.csv')
sf.columns


Out[10]:
Index([u'SALE TYPE', u'SOLD DATE', u'PROPERTY TYPE', u'ADDRESS', u'CITY',
       u'STATE', u'ZIP', u'PRICE', u'BEDS', u'BATHS', u'LOCATION',
       u'SQUARE FEET', u'LOT SIZE', u'YEAR BUILT', u'DAYS ON MARKET',
       u'$/SQUARE FEET', u'HOA/MONTH', u'STATUS',
       u'NEXT OPEN HOUSE START TIME', u'NEXT OPEN HOUSE END TIME',
       u'URL (SEE http://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)',
       u'SOURCE', u'MLS#', u'FAVORITE', u'INTERESTED', u'LATITUDE',
       u'LONGITUDE'],
      dtype='object')

In [11]:
sf1 = sf.rename(index=str, columns={'SALE TYPE': 'saletype',
    'SOLD DATE': 'solddate', 'PROPERTY TYPE': 'proptype', 'ADDRESS': 'address',
    'CITY': 'city', 'STATE': 'state', 'ZIP': 'zip', 'PRICE': 'price', 'BEDS': 'beds',
    'BATHS': 'baths', 'LOCATION': 'location', 'SQUARE FEET': 'sqft', 'LOT SIZE': 'lotsize',
    'YEAR BUILT': 'yrbuilt', 'DAYS ON MARKET': 'daysonmkt', '$/SQUARE FEET': 'pricesqft',
    'LATITUDE': 'latitude', 'LONGITUDE': 'longitude', 'HOA/MONTH': 'hoamonth',
    'URL (SEE http://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)': 'url',
    'STATUS': 'status', 'NEXT OPEN HOUSE START TIME': 'nextopenstart', 'NEXT OPEN HOUSE END TIME': 'nextopenend',
    'SOURCE': 'source', 'MLS#': 'mls', 'FAVORITE': 'favorite', 'INTERESTED': 'interested'
    })

sf1.head()


Out[11]:
saletype solddate proptype address city state zip price beds baths ... status nextopenstart nextopenend url source mls favorite interested latitude longitude
0 PAST SALE February-7-2017 Single Family Residential 521 Cayuga Ave San Francisco CA 94112.0 1200000.0 2.0 1.25 ... Sold NaN NaN http://www.redfin.com/CA/San-Francisco/521-Cay... San Francisco MLS 453167 N Y 37.728860 -122.435308
1 PAST SALE February-10-2017 Condo/Co-op 95 Red Rock Way Unit 105M San Francisco CA 94131.0 590000.0 0.0 1.00 ... Sold NaN NaN http://www.redfin.com/CA/San-Francisco/95-Red-... San Francisco MLS 452726 N Y 37.746324 -122.441517
2 PAST SALE February-10-2017 Single Family Residential 3157 Baker St San Francisco CA 94123.0 3260000.0 5.0 3.50 ... Sold NaN NaN http://www.redfin.com/CA/San-Francisco/3157-Ba... San Francisco MLS 451943 N Y 37.800218 -122.446455
3 PAST SALE February-17-2017 Single Family Residential 2594 San Jose Ave San Francisco CA 94112.0 920000.0 3.0 2.00 ... Sold NaN NaN http://www.redfin.com/CA/San-Francisco/2594-Sa... MetroList 16067859 N Y 37.716435 -122.450444
4 PAST SALE February-27-2017 Single Family Residential 130 Brazil Ave SAN FRANCISCO CA 94112.0 790000.0 2.0 1.00 ... Sold NaN NaN http://www.redfin.com/CA/San-Francisco/130-Bra... MLSListings ML81635907 N Y 37.723920 -122.433735

5 rows × 27 columns


In [12]:
sf1.describe()


Out[12]:
zip price beds baths sqft lotsize yrbuilt daysonmkt pricesqft hoamonth nextopenstart nextopenend latitude longitude
count 308.000000 3.080000e+02 301.000000 293.000000 266.000000 139.000000 288.000000 306.00000 264.000000 159.000000 0.0 0.0 309.000000 309.000000
mean 93888.253247 1.521055e+06 2.647841 1.880546 1662.593985 3194.935252 1952.006944 14.30719 949.905303 543.874214 NaN NaN 37.760820 -122.431289
std 2820.378426 1.063400e+06 1.862499 0.896519 1169.031917 4656.050424 39.950230 8.87308 279.135062 436.562243 NaN NaN 0.025052 0.027834
min 59059.000000 3.500000e+03 0.000000 1.000000 300.000000 611.000000 1885.000000 2.00000 339.000000 4.000000 NaN NaN 37.709249 -122.510730
25% 94109.000000 8.787500e+05 2.000000 1.000000 978.750000 2319.500000 1916.750000 6.00000 773.750000 342.500000 NaN NaN 37.739065 -122.443876
50% 94114.000000 1.271000e+06 2.000000 2.000000 1360.000000 2513.000000 1941.000000 12.00000 963.500000 473.000000 NaN NaN 37.762549 -122.428527
75% 94122.000000 1.813750e+06 3.000000 2.000000 1904.750000 3090.000000 1996.000000 23.00000 1106.000000 683.000000 NaN NaN 37.781520 -122.412238
max 94158.000000 1.025000e+07 18.000000 5.500000 9504.000000 56024.000000 2017.000000 30.00000 2053.000000 4413.000000 NaN NaN 37.805816 -122.371007

Here is a scatterplot of sqft and price


In [13]:
plt.figure(1, figsize=(10,8), )
plt.scatter(sf1['sqft'], sf1['price'], marker=0, s=10, c='g')
#plt.axis([12, 16, 12, 16])
plt.show();


Here is a scatterplot of beds and price


In [14]:
plt.figure(1, figsize=(10,8), )
plt.scatter(sf1['beds'], sf1['price'], marker=0, s=10, c='g')
#plt.axis([12, 16, 12, 16])
plt.show();


Evaluating correlations among multiple variables and price

What if we want to know how price is affected by both sqft and beds, and other variables as well? We would generally use multiple regression.

Recoding variables

Sometimes variables have larger values than you intend to use. You can either drop those records, or recode the data so that values above some limit are capped at that limit.


In [ ]:


In [15]:
sf1['beds4'] = sf1['beds']
sf1['baths4'] = sf1['baths']

In [16]:
sf1.describe()


Out[16]:
zip price beds baths sqft lotsize yrbuilt daysonmkt pricesqft hoamonth nextopenstart nextopenend latitude longitude beds4 baths4
count 308.000000 3.080000e+02 301.000000 293.000000 266.000000 139.000000 288.000000 306.00000 264.000000 159.000000 0.0 0.0 309.000000 309.000000 301.000000 293.000000
mean 93888.253247 1.521055e+06 2.647841 1.880546 1662.593985 3194.935252 1952.006944 14.30719 949.905303 543.874214 NaN NaN 37.760820 -122.431289 2.647841 1.880546
std 2820.378426 1.063400e+06 1.862499 0.896519 1169.031917 4656.050424 39.950230 8.87308 279.135062 436.562243 NaN NaN 0.025052 0.027834 1.862499 0.896519
min 59059.000000 3.500000e+03 0.000000 1.000000 300.000000 611.000000 1885.000000 2.00000 339.000000 4.000000 NaN NaN 37.709249 -122.510730 0.000000 1.000000
25% 94109.000000 8.787500e+05 2.000000 1.000000 978.750000 2319.500000 1916.750000 6.00000 773.750000 342.500000 NaN NaN 37.739065 -122.443876 2.000000 1.000000
50% 94114.000000 1.271000e+06 2.000000 2.000000 1360.000000 2513.000000 1941.000000 12.00000 963.500000 473.000000 NaN NaN 37.762549 -122.428527 2.000000 2.000000
75% 94122.000000 1.813750e+06 3.000000 2.000000 1904.750000 3090.000000 1996.000000 23.00000 1106.000000 683.000000 NaN NaN 37.781520 -122.412238 3.000000 2.000000
max 94158.000000 1.025000e+07 18.000000 5.500000 9504.000000 56024.000000 2017.000000 30.00000 2053.000000 4413.000000 NaN NaN 37.805816 -122.371007 18.000000 5.500000

Since the maximum bedrooms is 18 and the maximum bathrooms is 5.5, let's create a recoded version of these to cap the maximum value at 4 for each.


In [17]:
#sf1.loc[:,'beds4'][sf1['beds']>3] =  4
#sf1.loc[:,'baths4'][sf1['baths']>3] =  4
sf1.loc[sf1.beds > 3, 'beds4'] =  4
sf1.loc[sf1.baths > 3, 'baths4'] =  4

sf1.describe()


Out[17]:
zip price beds baths sqft lotsize yrbuilt daysonmkt pricesqft hoamonth nextopenstart nextopenend latitude longitude beds4 baths4
count 308.000000 3.080000e+02 301.000000 293.000000 266.000000 139.000000 288.000000 306.00000 264.000000 159.000000 0.0 0.0 309.000000 309.000000 301.000000 293.000000
mean 93888.253247 1.521055e+06 2.647841 1.880546 1662.593985 3194.935252 1952.006944 14.30719 949.905303 543.874214 NaN NaN 37.760820 -122.431289 2.385382 1.875427
std 2820.378426 1.063400e+06 1.862499 0.896519 1169.031917 4656.050424 39.950230 8.87308 279.135062 436.562243 NaN NaN 0.025052 0.027834 1.034884 0.868154
min 59059.000000 3.500000e+03 0.000000 1.000000 300.000000 611.000000 1885.000000 2.00000 339.000000 4.000000 NaN NaN 37.709249 -122.510730 0.000000 1.000000
25% 94109.000000 8.787500e+05 2.000000 1.000000 978.750000 2319.500000 1916.750000 6.00000 773.750000 342.500000 NaN NaN 37.739065 -122.443876 2.000000 1.000000
50% 94114.000000 1.271000e+06 2.000000 2.000000 1360.000000 2513.000000 1941.000000 12.00000 963.500000 473.000000 NaN NaN 37.762549 -122.428527 2.000000 2.000000
75% 94122.000000 1.813750e+06 3.000000 2.000000 1904.750000 3090.000000 1996.000000 23.00000 1106.000000 683.000000 NaN NaN 37.781520 -122.412238 3.000000 2.000000
max 94158.000000 1.025000e+07 18.000000 5.500000 9504.000000 56024.000000 2017.000000 30.00000 2053.000000 4413.000000 NaN NaN 37.805816 -122.371007 4.000000 4.000000

Now let's estimate a series of models using the sales data.

Here we specify models using R syntax. This uses the patsy language See http://patsy.readthedocs.org/en/latest/ for complete documentation


In [18]:
import statsmodels.api as sm
import numpy as np
from patsy import dmatrices
y, X = dmatrices('np.log(price) ~ np.log(sqft) + beds + baths', 
                 data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.739
Model:                            OLS   Adj. R-squared:                  0.735
Method:                 Least Squares   F-statistic:                     236.4
Date:                Wed, 08 Mar 2017   Prob (F-statistic):           8.26e-73
Time:                        09:46:41   Log-Likelihood:                -16.926
No. Observations:                 255   AIC:                             41.85
Df Residuals:                     251   BIC:                             56.02
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [95.0% Conf. Int.]
--------------------------------------------------------------------------------
Intercept        7.5064      0.401     18.727      0.000         6.717     8.296
np.log(sqft)     0.9100      0.063     14.374      0.000         0.785     1.035
beds            -0.1189      0.021     -5.657      0.000        -0.160    -0.077
baths            0.1425      0.029      4.881      0.000         0.085     0.200
==============================================================================
Omnibus:                       17.160   Durbin-Watson:                   1.933
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               20.918
Skew:                          -0.529   Prob(JB):                     2.87e-05
Kurtosis:                       3.923   Cond. No.                         198.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Experiment with the log transformations and practice interpretation

Most hedonic regression models use a log-transformation of the dependent variable (price), by taking the logarithm of the price of each sale and using it as the dependent variable. It changes the interpretation of the coefficients. If the variable on the right hand side is untransformed (in its original scale) and the dependent variable is log-transformed, then one unit increase in the right hand side variable is predicted to increase the price of a house by the percentage indicated by the coefficient. If the right hand side variable is also log transformed, then the interpretation is one percent change in the independent variable is associated with the percentage change in the dependent variable indicated by the coefficient. If neither is transformed, then the coefficient indicates the dollar amount of change in price expected from a one unit change in the independent variable.

How well the does model fit the data?

The errors appear to be normally distributed - with half having positive errors and half having negative errors, and the mean value being zero. This is one indicator of whether the model is inacurate (statistically biased).


In [19]:
plt.hist(residuals, bins=25, normed=True, alpha=.5)
mu = residuals.mean()
variance = residuals.var()
sigma = residuals.std()
x = np.linspace(-3, 3, 100)
plt.plot(x,mlab.normpdf(x, mu, sigma));


Another way to look at these results is to plot the errors against the range of the y variable, to see if the errors appear to be higher at one end of the range of y or the other. It seems to be fairly uniform across the scale of y.


In [20]:
plt.figure(1, figsize=(10,8), )
plt.plot([12, 16], [0, 0], c='b')
plt.scatter(predicted, residuals, marker=0, s=10, c='g');
plt.axis([12, 16, -0.8, 0.8])
plt.show();


The next plot compares observed values on the x axis to predicted values from the model on the y axis.


In [21]:
plt.figure(1, figsize=(10,8), )
plt.plot([12, 16], [12, 16])
plt.scatter(observed, predicted, marker=0, s=10, c='g')
plt.axis([12, 16, 12, 16])
plt.show();



In [ ]: