This material uses Python to demonstrate some aspects of statistical models with continuous or categorical values to be predicted.
If the dependent variable (the variable we are trying to explain or predict) is continuous (has a large range, like price of housing), then we use multiple regression.
If the dependent variable is categorical (or represents a choice outcome) with two values (rent or own for example) than a binary logit model or logistic regression is usually the preferred model.
if the dependent variable is categorical (or represents a choice outcome) with a small number of values (like travel modes), then the most common model form is Multinomial Logit (MNL)
In today's session we introduce Python libraries that enable estimating models of each of these types. Theory is only briefly reviewed, as these methods should be supported by a full course (or more). The material is really meant to help students who have been exposed to these methods using another platform such as Stata, R, SAS, or SPSS, and want to be able to use Python to undertake their statistical model building.
Simple linear regression is an approach for predicting a quantitative response using a single feature (or "predictor" or "input variable"). It takes the following form:
$Y = \beta_0 + \beta_1x + \epsilon$, where $\epsilon\sim N\left(0,\sigma^{2}\right)$
Together, $\beta_0$ and $\beta_1$ are called the model coefficients. To create your model, you must "learn" or "estimate" the values of these coefficients. And once we've learned these coefficients, we can use the model to predict values of $y$ based on new values of $x$.
Generally speaking, coefficients are estimated using the least squares criterion, which means we are find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors"):
In the figure below:
How do the model coefficients relate to the least squares line?
Here is a graphical depiction of those calculations:
In [262]:
# Startup steps
import pandas as pd, numpy as np, statsmodels.api as sm
import matplotlib.pyplot as plt, matplotlib.cm as cm, matplotlib.font_manager as fm
import matplotlib.mlab as mlab
import time, requests
from scipy.stats import pearsonr, ttest_rel
import seaborn as sns
sns.set()
%matplotlib inline
We begin by generating some synthetic data that is generated using an equation for which we supply the parameters. It enables us to verify that the model estimation code is correctly 'learning' the correct parameters, before we use it on real data.
Below we generate 100 values of Y from an equation: $Y = \beta_0 + \beta_1 x + \epsilon$.
We set $\beta_0 = 0$ and $\beta_1 = 2$ and draw values of $\epsilon$ from a normal distribution
In [312]:
nsample = 300
x = np.linspace(0, 10, nsample)
beta = np.array([0, 2])
e = np.random.normal(size=nsample)*2
X = sm.add_constant(x)
y = np.dot(X, beta) + e
Plot the data and the model. Note that the intercept is set to zero in this example initially.
In [313]:
plt.figure(1, figsize=(10,8), )
plt.plot([0, 10], [0, 20])
plt.scatter(x, y, marker=0, s=10, c='g')
plt.axis([0, 10, 0, 20])
plt.show();
Now we 'fit' the model to the data, which means we compute the values of $\beta_0$ and $\beta_1$ that minimize the sum of the squared errors. We use Statsmodels for fitting the model. There are two different syntax styles to use for this. The first one uses an X matrix and a y array, like this:
In [314]:
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
We can extract any of the results from the fitted model we want, since the fitted model is a Python object and has attributes we can interrogate.
In [315]:
print('Parameters: ', results.params)
Statsmodels calculates 95% confidence intervals for our model coefficients, which are interpreted as follows: If the population from which this sample was drawn was sampled 100 times, approximately 95 of those confidence intervals would contain the "true" coefficient.
To get the 95% confidence interval around $\beta_0$ and $\beta_1$ we can do this:
In [316]:
results.conf_int()
Out[316]:
Closely related to confidence intervals is hypothesis testing. Generally speaking, you start with a null hypothesis and an alternative hypothesis (that is opposite the null). Then, you check whether the data supports rejecting the null hypothesis or failing to reject the null hypothesis.
(Note that "failing to reject" the null is not the same as "accepting" the null hypothesis. The alternative hypothesis may indeed be true, except that you just don't have enough data to show that.)
As it relates to model coefficients, here is the conventional hypothesis test:
How do we test this hypothesis? Intuitively, we reject the null (and thus believe the alternative) if the 95% confidence interval does not include zero. Conversely, the p-value represents the probability that the coefficient is actually zero:
In [317]:
results.pvalues
Out[317]:
If the 95% confidence interval includes zero, the p-value for that coefficient will be greater than 0.05. If the 95% confidence interval does not include zero, the p-value will be less than 0.05. Thus, a p-value less than 0.05 is one way to decide whether there is likely a relationship between the feature and the response. (Again, using 0.05 as the cutoff is just a convention.)
In this case, the p-value for $x$ is far less than 0.05, and so we believe that there is a relationship between $x$ and $y$.
Note that we generally ignore the p-value for the intercept.
The most common way to evaluate the overall fit of a linear model is by the R-squared value. R-squared is the proportion of variance explained, meaning the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model. (The null model just predicts the mean of the observed response, and thus it has an intercept and no slope.)
R-squared is between 0 and 1, and higher is better because it means that more variance is explained by the model.
In [318]:
print('R2: ', results.rsquared)
We can also inspect the residuals (the errors) computed by comparing the predicted values of y to the observed ones. The mean of the residuals should be zero and they should be normally distributed.
In [319]:
results.resid.mean()
Out[319]:
In [320]:
from scipy.stats import norm
plt.rcParams['figure.figsize']=8,8
#plot(residuals.mean(),0, residuals.mean(), 2.25)
sns.set_style("white")
sns.set_style("ticks")
ax = sns.distplot(results.resid, fit=norm, kde=False)
Simple linear regression can easily be extended to include multiple features. This is called multiple linear regression:
$y = \beta_0 + \beta_1x_1 + ... + \beta_nx_n + \epsilon$
Let's add an additional x variable to our synthetic model, using x squared, and fit it again with Statsmodels.
We generate new values of $y$ using $\beta_0 = 0$, $\beta_1 = 2$ and $\beta_2 = 0.5$
In [323]:
nsample = 300
x = np.linspace(0, 10, 300)
X = np.column_stack((x, x**2))
beta = np.array([1, 2, .5])
e = np.random.normal(size=nsample)*2
In [324]:
X = sm.add_constant(X)
y = np.dot(X, beta) + e
In [325]:
X[:10]
Out[325]:
In [326]:
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
In [327]:
ax = sns.regplot(x=X[:,1], y=y, scatter_kws={"s": 10}, order=2, ci=None, truncate=True)
In [328]:
results.resid.mean()
Out[328]:
In [331]:
from scipy.stats import norm
plt.rcParams['figure.figsize']=8,8
sns.set_style("white")
sns.set_style("ticks")
ax = sns.distplot(results.resid, fit=norm, kde=False)
R-squared will always increase as you add more features to the model, even if they are unrelated to the response. Thus, selecting the model with the highest R-squared is not a reliable approach for choosing the best linear model.
There is alternative to R-squared called adjusted R-squared that penalizes model complexity (to control for overfitting), but it generally under-penalizes complexity.
In [332]:
sf = pd.read_csv('data/redfin_2017-03-05-17-45-34-san-francisco-county-1-month.csv')
sf.columns
Out[332]:
In [333]:
sf1 = sf.rename(index=str, columns={'SALE TYPE': 'saletype',
'SOLD DATE': 'solddate', 'PROPERTY TYPE': 'proptype', 'ADDRESS': 'address',
'CITY': 'city', 'STATE': 'state', 'ZIP': 'zip', 'PRICE': 'price', 'BEDS': 'beds',
'BATHS': 'baths', 'LOCATION': 'location', 'SQUARE FEET': 'sqft', 'LOT SIZE': 'lotsize',
'YEAR BUILT': 'yrbuilt', 'DAYS ON MARKET': 'daysonmkt', '$/SQUARE FEET': 'pricesqft',
'LATITUDE': 'latitude', 'LONGITUDE': 'longitude', 'HOA/MONTH': 'hoamonth',
'URL (SEE http://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)': 'url',
'STATUS': 'status', 'NEXT OPEN HOUSE START TIME': 'nextopenstart', 'NEXT OPEN HOUSE END TIME': 'nextopenend',
'SOURCE': 'source', 'MLS#': 'mls', 'FAVORITE': 'favorite', 'INTERESTED': 'interested'
})
sf1.head()
Out[333]:
In [355]:
sf1.columns
Out[355]:
In [334]:
sf1.describe()
Out[334]:
In [336]:
g = sns.jointplot("sqft", "price", data=sf1, kind="reg", scatter_kws={"s": 10}, size=10)
Let's fit a simple linear regression of price on sqft, since there is clearly a strong relationship between them.
In [337]:
import statsmodels.api as sm
import numpy as np
from patsy import dmatrices
y, X = dmatrices('price ~ sqft',
data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())
How would you intepret the R-squared value?
How can you interpret the coefficient on sqft?
A common thing to explore is whether a transformation of the dependent and/or independent variables help improve the degree to which the relationship is linear, and the fit of the model. Most models of housing prices, or 'hedonic price models' are specified with the log of price, and often the log of continuous variables like sqft. Let's look at the relationship once we log-transform both:
In [338]:
g = sns.jointplot(np.log(sf1['price']), np.log(sf1['sqft']), kind="reg", scatter_kws={"s": 8}, size=10)
And now let's re-estimate the model using the log-log transformation of price and sqft.
In [339]:
y, X = dmatrices('np.log(price) ~ np.log(sqft)',
data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())
The model fit improves considerably. How do you interpret the coefficient for sqft now?
Next let's add the number of baths and see how that changes the model.
In [340]:
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths',
data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())
Let's add bedrooms now and see how that changes the model...
In [348]:
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths + beds',
data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())
In [228]:
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths + beds + yrbuilt<1940',
data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())
In [344]:
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths + beds + yrbuilt<1940',
data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())
In [347]:
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths + beds + yrbuilt<1940 + hoamonth',
data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())
In [366]:
keepcols = [ 'price', 'beds', 'baths', 'sqft', 'lotsize', 'yrbuilt']
sf1_small=sf1[keepcols]
sf1_small.head()
Out[366]:
In [369]:
g = sns.PairGrid(sf1_small)
g.map(plt.scatter);
In [370]:
data = observed.join(predicted.to_frame(name='predicted'))
data.head()
Out[370]:
In [371]:
plt.rcParams['figure.figsize']=8,8
sns.set_style("white")
sns.set_style("ticks")
ax = sns.distplot(residuals, fit=norm, kde=False)
In [372]:
g = sns.jointplot(predicted, residuals, kind="reg", scatter_kws={"s": 8}, size=10)
In [373]:
g = sns.jointplot("np.log(price)", "predicted", data=data, kind="reg", scatter_kws={"s": 8}, size=10)
In [389]:
#vars = pd.read_csv('data/ba_block_variables.csv', dtype={'block_id':object})
In [26]:
#vars.shape
Out[26]:
In [390]:
smallvars = vars[['block_id','block_groups_sum_persons','block_groups_sum_acres','block_groups_total_jobs',
'block_groups_median_children', 'block_groups_median_persons' , 'block_groups_median_income',
'block_groups_prop_tenure_2', 'nodes_low_income_hh_1500m', 'nodes_high_income_hh_1500m',
'nodes_jobs_5000m', 'nodes_jobs_30km', 'nodes_population_400m', 'nodes_population_800m',
'nodes_jobs_800m', 'prop_race_of_head_1', 'prop_race_of_head_2', 'pumas_density_households',
'nodes_jobs_3000m_agg1', 'nodes_jobs_3000m_agg2', 'pumas_density_jobs', 'nodes_jobs_40km',
'nodes_jobs_3000m_agg3', 'nodes_jobs_3000m_agg4', 'nodes_jobs_3000m_agg5',
'block_groups_prop_persons_1', 'block_groups_prop_persons_2', 'block_groups_median_age_of_head',
'puma10_id_is_0607501', 'puma10_id_is_0607502', 'puma10_id_is_0607503', 'puma10_id_is_0607504']]
sv = smallvars.rename(index=str, columns={'block_groups_sum_persons': 'bgpop',
'block_groups_sum_acres': 'bgacres',
'block_groups_total_jobs': 'bgjobs',
'block_groups_median_children': 'bgmedkids',
'block_groups_median_persons': 'bgmedhhs',
'block_groups_median_income': 'bgmedinc',
'block_groups_prop_tenure_2': 'proprent',
'nodes_low_income_hh_1500m': 'lowinc1500m',
'nodes_high_income_hh_1500m': 'highinc1500m',
'nodes_jobs_5000m': 'lnjobs5000m',
'nodes_jobs_30km': 'lnjobs30km',
'nodes_jobs_40km': 'lnjobs40km',
'nodes_population_400m': 'lnpop400m',
'nodes_population_800m': 'lnpop800m',
'nodes_jobs_800m': 'lnjobs800m',
'prop_race_of_head_1': 'propwhite',
'prop_race_of_head_2': 'propblack',
'pumas_density_households': 'pumahhden',
'pumas_density_jobs': 'pumajobden',
'nodes_jobs_3000m_agg1': 'lnbasic3000m',
'nodes_jobs_3000m_agg2': 'lntcpuw3000m',
'nodes_jobs_3000m_agg3': 'lnret3000m',
'nodes_jobs_3000m_agg4': 'lnfire3000m',
'nodes_jobs_3000m_agg5': 'lnserv3000m',
'block_groups_prop_persons_1': 'prop1per',
'block_groups_prop_persons_2': 'prop2per',
'block_groups_median_age_of_head': 'bgmedagehd',
'puma10_id_is_0607501': 'puma1',
'puma10_id_is_0607502': 'puma2',
'puma10_id_is_0607503': 'puma3',
'puma10_id_is_0607504': 'puma4'
})
sv.head()
Out[390]:
In [406]:
sv.to_csv('data/smallvars.csv')
In [409]:
sv = pd.read_csv('data/smallvars.csv', converters={'block_id': str})
sv.head()
Out[409]:
Recoded detailed race code 1 .White alone 2 .Black or African American alone 3 .American Indian alone 4 .Alaska Native alone 5 .American Indian and Alaska Native tribes specified; or American .Indian or Alaska native, not specified and no other races 6 .Asian alone 7 .Native Hawaiian and Other Pacific Islander alone 8 .Some other race alone 9 .Two or more major race groups
In [382]:
# pass the FCC API lat/long and get FIPS data back - return block fips and county name
def get_fips(row):
time.sleep(pause)
url = 'http://data.fcc.gov/api/block/find?format=json&latitude={}&longitude={}'
request = url.format(row['latitude'], row['longitude'])
response = requests.get(request)
data = response.json()
# return multiple values as a series - this will create a dataframe with multiple columns
return pd.Series({'fips_code':data['Block']['FIPS'], 'county':data['County']['name']})
In [383]:
%%time
pause = 0.1
fips = sf1.apply(get_fips, axis=1)
sf1 = pd.concat([sf1, fips], axis=1)
sf1.head()
In [384]:
#sf1.to_csv('data/redfin_w_block.csv')
In [403]:
#sf = pd.read_csv('data/redfin_w_block.csv', usecols=['proptype', 'price', 'beds', 'baths', 'sqft', 'lotsize',\
# 'yrbuilt', 'hoamonth', 'fips_code'], converters={'fips_code': str})
#sf.head()
In [404]:
sf_mrg = pd.merge(sf1, sv, left_on='fips_code', right_on='block_id', how='inner')
sf_mrg.columns
Out[404]:
In [405]:
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths + beds + yrbuilt<1940 + hoamonth + bgmedinc +\
proprent + lnjobs5000m',
data=sf_mrg, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())
In [ ]: