This material uses Python to demonstrate some aspects of hedonic regression. The objective here is not to learn to program, but to understand the hedonic regression methodology. We begin with an example in which we generate some synthetic data using a set of coefficients and a mathematical model, and learn those coefficients using a statistical method called multiple regression.
The first part of this notebook follows closely the content in the Statsmodels online documentation. We won't try to provide the theoretical basis for regression models here. You can consult any number of online resources for this, including Wikipedia's explanation of Ordinary Least Squares Regression. We will be using the Statsmodels library for this (documentation here).
The statistical model is assumed to be $Y = X\beta + \epsilon$, where $\epsilon\sim N\left(0,\sigma^{2}\Sigma\right)$
Depending on the assumption on $\Sigma$, Statsmodels have currently four classes available
We focus here on the simple Ordinary Least Squares (OLS) model that is the most widely used, but makes strong assumptions about the errors being indepentently and identically distributed (i.i.d.). When these conditions are met, the OLS parameter estimates are the Best Linear Unbiased Estimates (BLUE).
More intuitively (perhaps), what linear regression using the OLS estimator attempts to do is find the vector of parameters ($\beta$), such that when you compute a linear function $X\beta$ you generate a predicted array of $\hat{y}$ that, compared to the observed $y$, the squared sum of each observation's error ($\epsilon_{i} = \hat{y}_{i} - y_{i}$) across all the observations ($\Sigma^{2}\epsilon_{i}$), is minimized.
In [1]:
# Startup steps
import pandas as pd, numpy as np, statsmodels.api as sm
import matplotlib.pyplot as plt, matplotlib.cm as cm, matplotlib.font_manager as fm
import matplotlib.mlab as mlab
from scipy.stats import pearsonr, ttest_rel
%matplotlib inline
To introduce concepts, let's begin by defining a hypothetical relationship between a dependent variable $y$ and an explanatory, or independent variable $x$.
We are only going to explore correlation, but there is an implicit causal story that x is influencing y, and not the other way around.
Let's say we have a relationship in which y is expected to be twice the value of x. A pretty simple model: $y = 2x$
Another way to describe this is as a line with an intercept of zero: $y = 0 + 2x$ We will look at it initially with no intercept and then add that in, before going to more than one independent variable.
If we plot the 'model', we can see that at a value of x=0, the intercept is 0, so the value of y will be zero. And at a value of x=10, the value of y is 2x = 2*10 = 20.
In [2]:
plt.figure(1, figsize=(10,8), )
plt.plot([0, 10], [0, 20])
plt.axis([0, 10, 0, 20])
plt.show();
Now let's try to approximate real world data, in which we might be collecting observations of x and y, with some measurement error. Let's impose the assumption that the observations will deviate from this line with some randomly distributed error. We will use the standard assumption that the errors are normally distributed, like a bell curve, as is common. We can use the equation for the model to generate points that would fall on the line above if there were no error, but we will add random errors to it to demonstrate how regression models work.
Generate data using a model we define:
In [3]:
nsample = 50
x = np.linspace(0, 10, 50)
X = x
beta = np.array([0, 2])
e = np.random.normal(size=nsample)
X = sm.add_constant(X)
y = np.dot(X, beta) + e
Plot the data as a scatterplot.
In [4]:
plt.figure(1, figsize=(10,8), )
plt.scatter(x, y, marker=0, s=10, c='g')
plt.axis([0, 10, 0, 20])
plt.show();
Add line based on Intercept = 0, beta = slope of line = 2
Now we can see the original 'model' and the generated observations. What regression analysis enables is to 'learn' the parameters of a model that most closely approximates the process that generated a set of observations. In this case, we have a controlled setting, because we generated the data and know the 'true' values of the parameters of the model: intercept = 0, and slope of the line = beta = 2.
Look at the plot below with the line superimposed on the 'observed' (generated) data. We can intuit that if we tilted the line or shifted it up or down, and calculated the 'errors', or the distance of each point to the line, that the square of their sum would get bigger. The reason for squaring is that if we didn't, the negative and positive errors would more or less cancel out. So we can infer that the 'best fit' model parameters would be ones that minimize the sum of the squares errors between the observed data and the y values predicted by the model.
In [5]:
plt.figure(1, figsize=(10,8), )
plt.plot([0, 10], [0, 20])
plt.scatter(x, y, marker=0, s=10, c='g')
plt.axis([0, 10, 0, 20])
plt.show();
Regenerate the data using an intercept = 2
In [6]:
nsample = 20
x = np.linspace(0, 10, 20)
X = x
beta = np.array([2, 2])
e = np.random.normal(size=nsample)
X = sm.add_constant(X)
y = np.dot(X, beta) + e
To fit most of the models covered by statsmodels, you will need to create two design matrices. The first is a matrix of endogenous variable(s) (i.e. dependent, response, regressand, etc.). The second is a matrix of exogenous variable(s) (i.e. independent, predictor, regressor, etc.). The OLS coefficient estimates are calculated using linear algebra to find the parameters that minimize the sum of the squared errors:
$$\hat{\beta} = (X'X)^{-1} X'y$$where $y$ is an $N \times 1$ column of data on sales price. $X$ is $N \times 2$ with an intercept and the x variable.
Run a simple linear regression and compare the coefficients to the ones used to generate the data.
In [7]:
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
In [8]:
print('Parameters: ', results.params)
inter = results.params[0]
beta = results.params[1]
print('Intercept =', inter)
print('Beta = ', beta)
print('Rsquared = ', results.rsquared)
Rsquared = Explained Variation / Total Variation
Rsquared = 1 - (Unexplained Variation / Total Variation)
Rsquared = 1 – (sum of squared residuals / sum of squared deviation of prices from the mean price)
Plot the 'true' line using the original coefficients, and the 'predicted' line, using the estimated coefficients. Try with smaller and larger samples.
In [9]:
plt.figure(1, figsize=(10,8), )
plt.plot([0, 10], [2, 22], c='r')
plt.plot([0,10], [inter,(inter+10*beta)], c='b')
plt.scatter(x, y, marker=0, s=10, c='g')
plt.axis([0, 10, 0, 22])
plt.show();
We will use a large sample of single family housing sales from the San Francisco Bay Area to estimate a linear regression model in which the dependent variable $y$ is the price of a house at the time of sale, and $X$ is a set of exogenous, or explanatory variables.
What exactly does this give us? A statistical way to figure out what the component amenities in a house are worth, if you could buy them a la carte. Another way to think of it is, how much do house buyers in the Bay Area during this period pay, on average, for an additional unit of each amenity: square foot of living space, bedroom, bathroom, etc.
Here we use the sales transactions in San Francisco over a month from early-February through early-March.
First we load the data from a csv file. Then we rename columns to make the data easier to work with.
In [10]:
sf = pd.read_csv('data/redfin_2017-03-05-17-45-34-san-francisco-county-1-month.csv')
sf.columns
Out[10]:
In [11]:
sf1 = sf.rename(index=str, columns={'SALE TYPE': 'saletype',
'SOLD DATE': 'solddate', 'PROPERTY TYPE': 'proptype', 'ADDRESS': 'address',
'CITY': 'city', 'STATE': 'state', 'ZIP': 'zip', 'PRICE': 'price', 'BEDS': 'beds',
'BATHS': 'baths', 'LOCATION': 'location', 'SQUARE FEET': 'sqft', 'LOT SIZE': 'lotsize',
'YEAR BUILT': 'yrbuilt', 'DAYS ON MARKET': 'daysonmkt', '$/SQUARE FEET': 'pricesqft',
'LATITUDE': 'latitude', 'LONGITUDE': 'longitude', 'HOA/MONTH': 'hoamonth',
'URL (SEE http://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)': 'url',
'STATUS': 'status', 'NEXT OPEN HOUSE START TIME': 'nextopenstart', 'NEXT OPEN HOUSE END TIME': 'nextopenend',
'SOURCE': 'source', 'MLS#': 'mls', 'FAVORITE': 'favorite', 'INTERESTED': 'interested'
})
sf1.head()
Out[11]:
In [12]:
sf1.describe()
Out[12]:
In [13]:
plt.figure(1, figsize=(10,8), )
plt.scatter(sf1['sqft'], sf1['price'], marker=0, s=10, c='g')
#plt.axis([12, 16, 12, 16])
plt.show();
In [14]:
plt.figure(1, figsize=(10,8), )
plt.scatter(sf1['beds'], sf1['price'], marker=0, s=10, c='g')
#plt.axis([12, 16, 12, 16])
plt.show();
In [ ]:
In [15]:
sf1['beds4'] = sf1['beds']
sf1['baths4'] = sf1['baths']
In [16]:
sf1.describe()
Out[16]:
Since the maximum bedrooms is 18 and the maximum bathrooms is 5.5, let's create a recoded version of these to cap the maximum value at 4 for each.
In [17]:
#sf1.loc[:,'beds4'][sf1['beds']>3] = 4
#sf1.loc[:,'baths4'][sf1['baths']>3] = 4
sf1.loc[sf1.beds > 3, 'beds4'] = 4
sf1.loc[sf1.baths > 3, 'baths4'] = 4
sf1.describe()
Out[17]:
Here we specify models using R syntax. This uses the patsy language See http://patsy.readthedocs.org/en/latest/ for complete documentation
In [18]:
import statsmodels.api as sm
import numpy as np
from patsy import dmatrices
y, X = dmatrices('np.log(price) ~ np.log(sqft) + beds + baths',
data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())
Most hedonic regression models use a log-transformation of the dependent variable (price), by taking the logarithm of the price of each sale and using it as the dependent variable. It changes the interpretation of the coefficients. If the variable on the right hand side is untransformed (in its original scale) and the dependent variable is log-transformed, then one unit increase in the right hand side variable is predicted to increase the price of a house by the percentage indicated by the coefficient. If the right hand side variable is also log transformed, then the interpretation is one percent change in the independent variable is associated with the percentage change in the dependent variable indicated by the coefficient. If neither is transformed, then the coefficient indicates the dollar amount of change in price expected from a one unit change in the independent variable.
In [19]:
plt.hist(residuals, bins=25, normed=True, alpha=.5)
mu = residuals.mean()
variance = residuals.var()
sigma = residuals.std()
x = np.linspace(-3, 3, 100)
plt.plot(x,mlab.normpdf(x, mu, sigma));
Another way to look at these results is to plot the errors against the range of the y variable, to see if the errors appear to be higher at one end of the range of y or the other. It seems to be fairly uniform across the scale of y.
In [20]:
plt.figure(1, figsize=(10,8), )
plt.plot([12, 16], [0, 0], c='b')
plt.scatter(predicted, residuals, marker=0, s=10, c='g');
plt.axis([12, 16, -0.8, 0.8])
plt.show();
The next plot compares observed values on the x axis to predicted values from the model on the y axis.
In [21]:
plt.figure(1, figsize=(10,8), )
plt.plot([12, 16], [12, 16])
plt.scatter(observed, predicted, marker=0, s=10, c='g')
plt.axis([12, 16, 12, 16])
plt.show();
In [ ]: