We want to start simply, so we'll just look at a basic forecast of the national Case-Schiller index based on population, and then add a few other variables. Schiller provides the data for this on the webpage for his book Irrational Exuberance.
We'll use pymc, a well-developed and maintained python library for bayesian statistical analysis, especially Markov Chain Monte Carlo (MCMC).
Other libraries used here include pandas and numpy. matplotlib is used for plotting, and I use a custom configuration file based on the one used in Bayesian Programming for Hackers, a great book with tons of good information on bayesian methods and graphing.
So first we need to import modules:
In [28]:
#Modules
import pymc as pm
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from IPython.core.pylabtools import figsize
%matplotlib inline
Then let's load our data. I've made a few quick changes to convert the original file named "Fig2-1.xls" so that it is easy to import:
In [208]:
#Data file
datafile = '/home/potterzot/data/irrational_exuberance/prices_population.csv'
data = pd.read_csv(datafile)
data.columns
Out[208]:
Before we do anything, let's look at a few samples of what our data look like:
In [234]:
data.loc[0:10,:]
Out[234]:
In [265]:
#Plot real house prices and population versus date
z = np.polyfit(data['date'],data['real_home_price_idx'],1)
p = np.poly1d(z)
figsize(12, 4)
plt.title("National Real Home Price Index and Population")
plt.plot(data.loc[data['date']>=1960, 'date'], data.loc[data['date']>=1960,'real_home_price_idx'], label="CS Index")
plt.plot(data.loc[data['date']>=1960, 'date'], p(data.loc[data['date']>=1960, 'date']), color='grey', linewidth=1)
plt.plot(data.loc[data['date2']>=1960, 'date2'], data.loc[data['date2']>=1960, 'us_pop'], label="Population")
plt.ylim(0,350)
legend = plt.legend()
legend.get_frame().set_alpha(0.4)
plt.tight_layout()
plt.show()
Notice that, although the scale here is scrunched together pretty badly, it seems like the population and the trendline for housing prices are fairly similar. Let's look at that more closely.
In [266]:
plt.clf()
figsize(12, 4)
plt.plot(data['date'], p(data['date']), label="Housing Price Trendline")
plt.plot(data['date2'], data.loc[:,'us_pop']/10+68, label="Population") #Here we scale and center the population
plt.title("Housing Price Trendline and US Population, Scaled and Centered")
plt.ylim(0,150)
legend = plt.legend()
legend.get_frame().set_alpha(0.4)
plt.tight_layout()
plt.show()
We can see that housing prices have increased somewhat more quickly than population, so population may not be a great predictor, but it's not a terrible place to start with a simple model. It makes sense that as population increases, the price of housing might also be expected to increases, especially in spatially-constrained geographies like dense urban-areas.
We'll refine the model substantially later, but starting with a simple model will allow us to get the basic framework of the code down. It can always be made more complicated, but starting with something complicated can make development quite a bit slower.
We'll use a bayesian approach to the forecast, in an attempt to estimate a percent change in price in each year with a sense of how confident we are in the estimate.
Our prior can just be the average annual percent growth from 1890 to 2014.
In [258]:
#Set up an annual data data file
adata = pd.DataFrame()
adata['date'] = data.loc[data['date']<=1950, 'date']
adata['rhpi'] = data.loc[data['date']<=1950, 'real_home_price_idx']
adata['pop'] = data.loc[data['date']<=1950, 'us_pop']
#Quarterly data
#qdata = pd.DataFrame()
#qdata['date'] = data.loc[data['date']>=1960, 'date']
#qdata['rhpi'] = data.loc[data['date']>=1960, 'real_home_price_idx']
#qdata['year'] = np.floor(qdata['date'])
#qdata['pop'] = data.loc[data['date2']>=qdata['year'], 'us_pop']
#qdata
#Average annual change
last_year = adata.loc[adata.index[-1],'date']
first_year = adata.loc[0, 'date']
n_years = last_year-first_year
rhpi = adata['rhpi']
delta_date = adata['date'].diff()
delta_rhpi = rhpi.diff()[1:]
avg_annual_chg = delta_rhpi.sum()/(n_years)
pct_chg = delta_rhpi / np.array(rhpi[0:-1])
#Compound rate of change
avg_rate_chg = (rhpi[rhpi.index[-1]]/rhpi[0])**(1/(n_years)) - 1
{'actual': rhpi[rhpi.index[-1]], 'estimate': rhpi[0]*(1+avg_rate_chg)**(n_years)} #check that our estimate is correct
Out[258]:
In [319]:
#Plot the year to year % changes
plt.clf()
figsize(12, 4)
plt.title("Percent change from period to period")
plt.bar(adata.loc[1:,'date'], pct_chg*100)
locs,labels = plt.yticks()
plt.yticks(locs, list(map((lambda x: "{}%".format(int(x))), locs)))
plt.show()
But what's the distribution of these percent changes? If we want to capture the way it tends to move, let's look at a histogram of the percent change:
In [407]:
#mean and standard deviation
m = pct_chg.mean()
sd = 10/(np.std(pct_chg))
# Create a normal distribution
p = pm.Normal('p', m, sd)
mdl_pct_chg = np.array([p.random() for i in np.arange(0,10000)])
#
plt.clf()
figsize(12, 4)
hist1, bins1 = np.histogram(pct_chg, bins=15)
hist2, bins2 = np.histogram(mdl_pct_chg, bins=15)
width = 0.7 * (bins[1] - bins[0])
center1 = (bins[:-1] + bins[1:]) / 2
center2 = (bins2[:-1] + bins2[1:]) / 2
plt.bar(center1, hist1, align='center', width=width)
plt.plot(center2, hist2*len(pct_chg)/10000)
plt.title("Histogram of Percent Change")
plt.show()
In [398]:
#Now create a MCMC model...
#@pm.deterministic
#def m_():
#def sd_():
m_ = pm.Normal('m', 0, 1)
sd_ = pm.Normal('sd', 0, 1)
observations = pm.Normal("obs", m, sd, value=pct_chg, observed=True)
model = pm.Model([observations, m_, sd_])
In [399]:
mcmc = pm.MCMC(model)
mcmc.sample(40000, 10000, 1)
In [437]:
#Get the traces, predicted points of the shape of m and sd
m_samples = mcmc.trace('m')[:]
sd_samples = mcmc.trace('sd')[:]
# histogram of the samples:
plt.clf()
figsize(12.5, 8)
ax = plt.subplot(211)
ax.set_autoscaley_on(False)
plt.hist(m_samples, histtype='stepfilled', bins=30, alpha=0.85,
label="posterior of $\mu$", color="#A60628", normed=True)
plt.legend(loc="upper left")
plt.title(r"""Posterior distributions of the variables
$\mu,\;\tau$""")
plt.xlim([-3, 3])
plt.xlabel("$\mu$ value")
ax = plt.subplot(212)
ax.set_autoscaley_on(False)
plt.hist(sd_samples, histtype='stepfilled', bins=30, alpha=0.85,
label="posterior of $\tau$", color="#7A68A6", normed=True)
plt.legend(loc="upper left")
plt.xlim([-3, 3])
plt.xlabel("$\tau$ value")
plt.show()
That probably doesn't seem like much, except that now we have an estimate of the two variables that we think are affecting the underlying percent change in housing price, assuming the change is just a random variable from a normal distribution.
But that's not really very interesting. That just says that prices change randomly as defined by our normal distribution. What we hope is that we can determine a series of factors that affect the change. Let's begin by assuming that the change in price is dependent on the change in population in the previous year.