Statistical Modeling With Python

Multiple Regression and Discrete Choice Models

This material uses Python to demonstrate some aspects of statistical models with continuous or categorical values to be predicted.

  • If the dependent variable (the variable we are trying to explain or predict) is continuous (has a large range, like price of housing), then we use multiple regression.

  • If the dependent variable is categorical (or represents a choice outcome) with two values (rent or own for example) than a binary logit model or logistic regression is usually the preferred model.

  • if the dependent variable is categorical (or represents a choice outcome) with a small number of values (like travel modes), then the most common model form is Multinomial Logit (MNL)

In today's session we introduce Python libraries that enable estimating models of each of these types. Theory is only briefly reviewed, as these methods should be supported by a full course (or more). The material is really meant to help students who have been exposed to these methods using another platform such as Stata, R, SAS, or SPSS, and want to be able to use Python to undertake their statistical model building.

Simple Linear Regression

Simple linear regression is an approach for predicting a quantitative response using a single feature (or "predictor" or "input variable"). It takes the following form:

$Y = \beta_0 + \beta_1x + \epsilon$, where $\epsilon\sim N\left(0,\sigma^{2}\right)$

  • $y$ is the dependent variable - the one we are trying to explain or predict
  • $x$ is the independent or explanatory variable that we are using to help explain or predict the value of $y$ with
  • $\beta_0$ is the intercept
  • $\beta_1$ is the coefficient for x, which is the slope of the line that minimizes the sum of the squared errors
  • $\epsilon$ is the error term, assumed to be normally distributed with mean of zero and variance of $\sigma^{2}$

Together, $\beta_0$ and $\beta_1$ are called the model coefficients. To create your model, you must "learn" or "estimate" the values of these coefficients. And once we've learned these coefficients, we can use the model to predict values of $y$ based on new values of $x$.

Estimating Model Coefficients

Generally speaking, coefficients are estimated using the least squares criterion, which means we are find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors"):

In the figure below:

  • The black dots are the observed values of x and y.
  • The red line is our least squares line.
  • The residuals are the vertical distances between the observed values and the least squares line.

How do the model coefficients relate to the least squares line?

  • $\beta_0$ is the intercept (the value of $y$ when $x$=0)
  • $\beta_1$ is the slope (the change in $y$ divided by change in $x$)

Here is a graphical depiction of those calculations:


In [262]:
# Startup steps
import pandas as pd, numpy as np, statsmodels.api as sm
import matplotlib.pyplot as plt, matplotlib.cm as cm, matplotlib.font_manager as fm
import matplotlib.mlab as mlab
import time, requests
from scipy.stats import pearsonr, ttest_rel
import seaborn as sns
sns.set()
%matplotlib inline

We begin by generating some synthetic data that is generated using an equation for which we supply the parameters. It enables us to verify that the model estimation code is correctly 'learning' the correct parameters, before we use it on real data.

Below we generate 100 values of Y from an equation: $Y = \beta_0 + \beta_1 x + \epsilon$.

We set $\beta_0 = 0$ and $\beta_1 = 2$ and draw values of $\epsilon$ from a normal distribution


In [312]:
nsample = 300
x = np.linspace(0, 10, nsample)
beta = np.array([0, 2])
e = np.random.normal(size=nsample)*2
X = sm.add_constant(x)
y = np.dot(X, beta) + e

Plot the data and the model. Note that the intercept is set to zero in this example initially.


In [313]:
plt.figure(1, figsize=(10,8), )
plt.plot([0, 10], [0, 20])
plt.scatter(x, y, marker=0, s=10, c='g')
plt.axis([0, 10, 0, 20])
plt.show();


Now we 'fit' the model to the data, which means we compute the values of $\beta_0$ and $\beta_1$ that minimize the sum of the squared errors. We use Statsmodels for fitting the model. There are two different syntax styles to use for this. The first one uses an X matrix and a y array, like this:


In [314]:
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.887
Model:                            OLS   Adj. R-squared:                  0.886
Method:                 Least Squares   F-statistic:                     2336.
Date:                Sun, 19 Nov 2017   Prob (F-statistic):          4.68e-143
Time:                        20:57:54   Log-Likelihood:                -639.42
No. Observations:                 300   AIC:                             1283.
Df Residuals:                     298   BIC:                             1290.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0559      0.236      0.237      0.813      -0.408       0.520
x1             1.9711      0.041     48.335      0.000       1.891       2.051
==============================================================================
Omnibus:                        1.220   Durbin-Watson:                   2.060
Prob(Omnibus):                  0.543   Jarque-Bera (JB):                0.956
Skew:                           0.055   Prob(JB):                        0.620
Kurtosis:                       3.253   Cond. No.                         11.8
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can extract any of the results from the fitted model we want, since the fitted model is a Python object and has attributes we can interrogate.


In [315]:
print('Parameters: ', results.params)


Parameters:  [ 0.05590029  1.9711345 ]

Statsmodels calculates 95% confidence intervals for our model coefficients, which are interpreted as follows: If the population from which this sample was drawn was sampled 100 times, approximately 95 of those confidence intervals would contain the "true" coefficient.

To get the 95% confidence interval around $\beta_0$ and $\beta_1$ we can do this:


In [316]:
results.conf_int()


Out[316]:
array([[-0.40783806,  0.51963865],
       [ 1.89087973,  2.05138926]])

Hypothesis Testing and p-values

Closely related to confidence intervals is hypothesis testing. Generally speaking, you start with a null hypothesis and an alternative hypothesis (that is opposite the null). Then, you check whether the data supports rejecting the null hypothesis or failing to reject the null hypothesis.

(Note that "failing to reject" the null is not the same as "accepting" the null hypothesis. The alternative hypothesis may indeed be true, except that you just don't have enough data to show that.)

As it relates to model coefficients, here is the conventional hypothesis test:

  • null hypothesis: There is no relationship between $x$ and $y$ (and thus $\beta_1$ equals zero)
  • alternative hypothesis: There is a relationship between $x$ and $y$ (and thus $\beta_1$ is not equal to zero)

How do we test this hypothesis? Intuitively, we reject the null (and thus believe the alternative) if the 95% confidence interval does not include zero. Conversely, the p-value represents the probability that the coefficient is actually zero:


In [317]:
results.pvalues


Out[317]:
array([  8.12646896e-001,   4.68109967e-143])

If the 95% confidence interval includes zero, the p-value for that coefficient will be greater than 0.05. If the 95% confidence interval does not include zero, the p-value will be less than 0.05. Thus, a p-value less than 0.05 is one way to decide whether there is likely a relationship between the feature and the response. (Again, using 0.05 as the cutoff is just a convention.)

In this case, the p-value for $x$ is far less than 0.05, and so we believe that there is a relationship between $x$ and $y$.

Note that we generally ignore the p-value for the intercept.

How Well Does the Model Fit the data?

The most common way to evaluate the overall fit of a linear model is by the R-squared value. R-squared is the proportion of variance explained, meaning the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model. (The null model just predicts the mean of the observed response, and thus it has an intercept and no slope.)

R-squared is between 0 and 1, and higher is better because it means that more variance is explained by the model.


In [318]:
print('R2: ', results.rsquared)


R2:  0.886875409381

We can also inspect the residuals (the errors) computed by comparing the predicted values of y to the observed ones. The mean of the residuals should be zero and they should be normally distributed.


In [319]:
results.resid.mean()


Out[319]:
-2.9013828376870757e-16

In [320]:
from scipy.stats import norm
plt.rcParams['figure.figsize']=8,8
#plot(residuals.mean(),0, residuals.mean(), 2.25)
sns.set_style("white")
sns.set_style("ticks")
ax = sns.distplot(results.resid, fit=norm, kde=False)


Multiple Linear Regression

Simple linear regression can easily be extended to include multiple features. This is called multiple linear regression:

$y = \beta_0 + \beta_1x_1 + ... + \beta_nx_n + \epsilon$

Let's add an additional x variable to our synthetic model, using x squared, and fit it again with Statsmodels.

We generate new values of $y$ using $\beta_0 = 0$, $\beta_1 = 2$ and $\beta_2 = 0.5$


In [323]:
nsample = 300
x = np.linspace(0, 10, 300)
X = np.column_stack((x, x**2))
beta = np.array([1, 2, .5])
e = np.random.normal(size=nsample)*2

In [324]:
X = sm.add_constant(X)
y = np.dot(X, beta) + e

In [325]:
X[:10]


Out[325]:
array([[ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.03344482,  0.00111856],
       [ 1.        ,  0.06688963,  0.00447422],
       [ 1.        ,  0.10033445,  0.010067  ],
       [ 1.        ,  0.13377926,  0.01789689],
       [ 1.        ,  0.16722408,  0.02796389],
       [ 1.        ,  0.2006689 ,  0.04026801],
       [ 1.        ,  0.23411371,  0.05480923],
       [ 1.        ,  0.26755853,  0.07158757],
       [ 1.        ,  0.30100334,  0.09060301]])

In [326]:
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.991
Model:                            OLS   Adj. R-squared:                  0.991
Method:                 Least Squares   F-statistic:                 1.574e+04
Date:                Sun, 19 Nov 2017   Prob (F-statistic):          4.28e-302
Time:                        20:58:30   Log-Likelihood:                -634.18
No. Observations:                 300   AIC:                             1274.
Df Residuals:                     297   BIC:                             1285.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.7160      0.346      2.066      0.040       0.034       1.398
x1             2.0604      0.160     12.872      0.000       1.745       2.375
x2             0.4946      0.015     31.920      0.000       0.464       0.525
==============================================================================
Omnibus:                        2.309   Durbin-Watson:                   1.978
Prob(Omnibus):                  0.315   Jarque-Bera (JB):                1.854
Skew:                           0.014   Prob(JB):                        0.396
Kurtosis:                       2.616   Cond. No.                         146.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [327]:
ax = sns.regplot(x=X[:,1], y=y, scatter_kws={"s": 10}, order=2, ci=None, truncate=True)



In [328]:
results.resid.mean()


Out[328]:
-5.7928476830208332e-14

In [331]:
from scipy.stats import norm
plt.rcParams['figure.figsize']=8,8
sns.set_style("white")
sns.set_style("ticks")
ax = sns.distplot(results.resid, fit=norm, kde=False)


R-squared will always increase as you add more features to the model, even if they are unrelated to the response. Thus, selecting the model with the highest R-squared is not a reliable approach for choosing the best linear model.

There is alternative to R-squared called adjusted R-squared that penalizes model complexity (to control for overfitting), but it generally under-penalizes complexity.

Estimating a Multiple Regression on Housing Prices (Hedonic Regression)

Now let's use real data. We pull some housing sales transactions from Redfin for this, for one month of sales in San Francisco, looking back from March 5th 2017.


In [332]:
sf = pd.read_csv('data/redfin_2017-03-05-17-45-34-san-francisco-county-1-month.csv')
sf.columns


Out[332]:
Index(['SALE TYPE', 'SOLD DATE', 'PROPERTY TYPE', 'ADDRESS', 'CITY', 'STATE',
       'ZIP', 'PRICE', 'BEDS', 'BATHS', 'LOCATION', 'SQUARE FEET', 'LOT SIZE',
       'YEAR BUILT', 'DAYS ON MARKET', '$/SQUARE FEET', 'HOA/MONTH', 'STATUS',
       'NEXT OPEN HOUSE START TIME', 'NEXT OPEN HOUSE END TIME',
       'URL (SEE http://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)',
       'SOURCE', 'MLS#', 'FAVORITE', 'INTERESTED', 'LATITUDE', 'LONGITUDE'],
      dtype='object')

In [333]:
sf1 = sf.rename(index=str, columns={'SALE TYPE': 'saletype',
    'SOLD DATE': 'solddate', 'PROPERTY TYPE': 'proptype', 'ADDRESS': 'address',
    'CITY': 'city', 'STATE': 'state', 'ZIP': 'zip', 'PRICE': 'price', 'BEDS': 'beds',
    'BATHS': 'baths', 'LOCATION': 'location', 'SQUARE FEET': 'sqft', 'LOT SIZE': 'lotsize',
    'YEAR BUILT': 'yrbuilt', 'DAYS ON MARKET': 'daysonmkt', '$/SQUARE FEET': 'pricesqft',
    'LATITUDE': 'latitude', 'LONGITUDE': 'longitude', 'HOA/MONTH': 'hoamonth',
    'URL (SEE http://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)': 'url',
    'STATUS': 'status', 'NEXT OPEN HOUSE START TIME': 'nextopenstart', 'NEXT OPEN HOUSE END TIME': 'nextopenend',
    'SOURCE': 'source', 'MLS#': 'mls', 'FAVORITE': 'favorite', 'INTERESTED': 'interested'
    })

sf1.head()


Out[333]:
saletype solddate proptype address city state zip price beds baths ... status nextopenstart nextopenend url source mls favorite interested latitude longitude
0 PAST SALE February-7-2017 Single Family Residential 521 Cayuga Ave San Francisco CA 94112.0 1200000.0 2.0 1.25 ... Sold NaN NaN http://www.redfin.com/CA/San-Francisco/521-Cay... San Francisco MLS 453167 N Y 37.728860 -122.435308
1 PAST SALE February-10-2017 Condo/Co-op 95 Red Rock Way Unit 105M San Francisco CA 94131.0 590000.0 0.0 1.00 ... Sold NaN NaN http://www.redfin.com/CA/San-Francisco/95-Red-... San Francisco MLS 452726 N Y 37.746324 -122.441517
2 PAST SALE February-10-2017 Single Family Residential 3157 Baker St San Francisco CA 94123.0 3260000.0 5.0 3.50 ... Sold NaN NaN http://www.redfin.com/CA/San-Francisco/3157-Ba... San Francisco MLS 451943 N Y 37.800218 -122.446455
3 PAST SALE February-17-2017 Single Family Residential 2594 San Jose Ave San Francisco CA 94112.0 920000.0 3.0 2.00 ... Sold NaN NaN http://www.redfin.com/CA/San-Francisco/2594-Sa... MetroList 16067859 N Y 37.716435 -122.450444
4 PAST SALE February-27-2017 Single Family Residential 130 Brazil Ave SAN FRANCISCO CA 94112.0 790000.0 2.0 1.00 ... Sold NaN NaN http://www.redfin.com/CA/San-Francisco/130-Bra... MLSListings ML81635907 N Y 37.723920 -122.433735

5 rows × 27 columns


In [355]:
sf1.columns


Out[355]:
Index(['saletype', 'solddate', 'proptype', 'address', 'city', 'state', 'zip',
       'price', 'beds', 'baths', 'location', 'sqft', 'lotsize', 'yrbuilt',
       'daysonmkt', 'pricesqft', 'hoamonth', 'status', 'nextopenstart',
       'nextopenend', 'url', 'source', 'mls', 'favorite', 'interested',
       'latitude', 'longitude'],
      dtype='object')

In [334]:
sf1.describe()


Out[334]:
zip price beds baths sqft lotsize yrbuilt daysonmkt pricesqft hoamonth nextopenstart nextopenend latitude longitude
count 308.000000 3.080000e+02 301.000000 293.000000 266.000000 139.000000 288.000000 306.00000 264.000000 159.000000 0.0 0.0 309.000000 309.000000
mean 93888.253247 1.521055e+06 2.647841 1.880546 1662.593985 3194.935252 1952.006944 14.30719 949.905303 543.874214 NaN NaN 37.760820 -122.431289
std 2820.378426 1.063400e+06 1.862499 0.896519 1169.031917 4656.050424 39.950230 8.87308 279.135062 436.562243 NaN NaN 0.025052 0.027834
min 59059.000000 3.500000e+03 0.000000 1.000000 300.000000 611.000000 1885.000000 2.00000 339.000000 4.000000 NaN NaN 37.709249 -122.510730
25% 94109.000000 8.787500e+05 2.000000 1.000000 978.750000 2319.500000 1916.750000 6.00000 773.750000 342.500000 NaN NaN 37.739065 -122.443876
50% 94114.000000 1.271000e+06 2.000000 2.000000 1360.000000 2513.000000 1941.000000 12.00000 963.500000 473.000000 NaN NaN 37.762549 -122.428527
75% 94122.000000 1.813750e+06 3.000000 2.000000 1904.750000 3090.000000 1996.000000 23.00000 1106.000000 683.000000 NaN NaN 37.781520 -122.412238
max 94158.000000 1.025000e+07 18.000000 5.500000 9504.000000 56024.000000 2017.000000 30.00000 2053.000000 4413.000000 NaN NaN 37.805816 -122.371007

In [336]:
g = sns.jointplot("sqft", "price", data=sf1, kind="reg", scatter_kws={"s": 10}, size=10)


Let's fit a simple linear regression of price on sqft, since there is clearly a strong relationship between them.


In [337]:
import statsmodels.api as sm
import numpy as np
from patsy import dmatrices
y, X = dmatrices('price ~ sqft', 
                 data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.577
Method:                 Least Squares   F-statistic:                     361.3
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           2.74e-51
Time:                        21:00:08   Log-Likelihood:                -3907.9
No. Observations:                 265   AIC:                             7820.
Df Residuals:                     263   BIC:                             7827.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    4.52e+05    6.6e+04      6.847      0.000    3.22e+05    5.82e+05
sqft         616.7552     32.447     19.008      0.000     552.867     680.644
==============================================================================
Omnibus:                      158.986   Durbin-Watson:                   1.874
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1792.883
Skew:                           2.184   Prob(JB):                         0.00
Kurtosis:                      14.971   Cond. No.                     3.55e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.55e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

How would you intepret the R-squared value?

How can you interpret the coefficient on sqft?

A common thing to explore is whether a transformation of the dependent and/or independent variables help improve the degree to which the relationship is linear, and the fit of the model. Most models of housing prices, or 'hedonic price models' are specified with the log of price, and often the log of continuous variables like sqft. Let's look at the relationship once we log-transform both:


In [338]:
g = sns.jointplot(np.log(sf1['price']), np.log(sf1['sqft']), kind="reg", scatter_kws={"s": 8}, size=10)


And now let's re-estimate the model using the log-log transformation of price and sqft.


In [339]:
y, X = dmatrices('np.log(price) ~ np.log(sqft)', 
                 data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.696
Model:                            OLS   Adj. R-squared:                  0.695
Method:                 Least Squares   F-statistic:                     603.0
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           5.17e-70
Time:                        21:00:31   Log-Likelihood:                -44.561
No. Observations:                 265   AIC:                             93.12
Df Residuals:                     263   BIC:                             100.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        8.3799      0.232     36.115      0.000       7.923       8.837
np.log(sqft)     0.7835      0.032     24.555      0.000       0.721       0.846
==============================================================================
Omnibus:                        7.938   Durbin-Watson:                   1.875
Prob(Omnibus):                  0.019   Jarque-Bera (JB):               12.858
Skew:                          -0.117   Prob(JB):                      0.00161
Kurtosis:                       4.053   Cond. No.                         97.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The model fit improves considerably. How do you interpret the coefficient for sqft now?

Next let's add the number of baths and see how that changes the model.


In [340]:
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths', 
                 data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.705
Model:                            OLS   Adj. R-squared:                  0.703
Method:                 Least Squares   F-statistic:                     301.5
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           1.43e-67
Time:                        21:00:34   Log-Likelihood:                -32.226
No. Observations:                 255   AIC:                             70.45
Df Residuals:                     252   BIC:                             81.08
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        8.8761      0.339     26.221      0.000       8.209       9.543
np.log(sqft)     0.6874      0.053     13.080      0.000       0.584       0.791
baths            0.1124      0.030      3.695      0.000       0.052       0.172
==============================================================================
Omnibus:                       19.326   Durbin-Watson:                   1.964
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               28.444
Skew:                          -0.500   Prob(JB):                     6.66e-07
Kurtosis:                       4.296   Cond. No.                         149.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Let's add bedrooms now and see how that changes the model...


In [348]:
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths + beds', 
                 data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.739
Model:                            OLS   Adj. R-squared:                  0.735
Method:                 Least Squares   F-statistic:                     236.4
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           8.26e-73
Time:                        22:59:54   Log-Likelihood:                -16.926
No. Observations:                 255   AIC:                             41.85
Df Residuals:                     251   BIC:                             56.02
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        7.5064      0.401     18.727      0.000       6.717       8.296
np.log(sqft)     0.9100      0.063     14.374      0.000       0.785       1.035
baths            0.1425      0.029      4.881      0.000       0.085       0.200
beds            -0.1189      0.021     -5.657      0.000      -0.160      -0.077
==============================================================================
Omnibus:                       17.160   Durbin-Watson:                   1.933
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               20.918
Skew:                          -0.529   Prob(JB):                     2.87e-05
Kurtosis:                       3.923   Cond. No.                         198.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [228]:
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths + beds + yrbuilt<1940', 
                 data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.739
Model:                            OLS   Adj. R-squared:                  0.734
Method:                 Least Squares   F-statistic:                     176.6
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           1.33e-71
Time:                        20:18:46   Log-Likelihood:                -16.890
No. Observations:                 255   AIC:                             43.78
Df Residuals:                     250   BIC:                             61.49
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  7.4783      0.415     18.010      0.000       6.661       8.296
yrbuilt < 1940[T.True]    -0.0097      0.037     -0.266      0.791      -0.082       0.062
np.log(sqft)               0.9150      0.066     13.838      0.000       0.785       1.045
baths                      0.1398      0.031      4.522      0.000       0.079       0.201
beds                      -0.1183      0.021     -5.586      0.000      -0.160      -0.077
==============================================================================
Omnibus:                       17.220   Durbin-Watson:                   1.929
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               20.905
Skew:                          -0.532   Prob(JB):                     2.89e-05
Kurtosis:                       3.914   Cond. No.                         205.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [344]:
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths + beds + yrbuilt<1940', 
                 data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.739
Model:                            OLS   Adj. R-squared:                  0.734
Method:                 Least Squares   F-statistic:                     176.6
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           1.33e-71
Time:                        21:01:30   Log-Likelihood:                -16.890
No. Observations:                 255   AIC:                             43.78
Df Residuals:                     250   BIC:                             61.49
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  7.4783      0.415     18.010      0.000       6.661       8.296
yrbuilt < 1940[T.True]    -0.0097      0.037     -0.266      0.791      -0.082       0.062
np.log(sqft)               0.9150      0.066     13.838      0.000       0.785       1.045
baths                      0.1398      0.031      4.522      0.000       0.079       0.201
beds                      -0.1183      0.021     -5.586      0.000      -0.160      -0.077
==============================================================================
Omnibus:                       17.220   Durbin-Watson:                   1.929
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               20.905
Skew:                          -0.532   Prob(JB):                     2.89e-05
Kurtosis:                       3.914   Cond. No.                         205.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [347]:
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths + beds + yrbuilt<1940 + hoamonth', 
                 data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.800
Model:                            OLS   Adj. R-squared:                  0.793
Method:                 Least Squares   F-statistic:                     107.5
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           3.85e-45
Time:                        21:02:57   Log-Likelihood:                 3.6599
No. Observations:                 140   AIC:                             4.680
Df Residuals:                     134   BIC:                             22.33
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  7.9919      0.610     13.111      0.000       6.786       9.197
yrbuilt < 1940[T.True]     0.1097      0.049      2.222      0.028       0.012       0.207
np.log(sqft)               0.8069      0.101      8.006      0.000       0.608       1.006
baths                      0.0887      0.051      1.749      0.083      -0.012       0.189
beds                       0.0019      0.044      0.042      0.967      -0.086       0.090
hoamonth                   0.0002   5.27e-05      3.551      0.001    8.29e-05       0.000
==============================================================================
Omnibus:                       27.843   Durbin-Watson:                   1.943
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               54.137
Skew:                          -0.878   Prob(JB):                     1.76e-12
Kurtosis:                       5.489   Cond. No.                     2.20e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.2e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

In [366]:
keepcols = [ 'price', 'beds', 'baths', 'sqft', 'lotsize', 'yrbuilt']
sf1_small=sf1[keepcols]
sf1_small.head()


Out[366]:
price beds baths sqft lotsize yrbuilt
0 1200000.0 2.0 1.25 1500.0 2265.0 1940.0
1 590000.0 0.0 1.00 650.0 NaN 1972.0
2 3260000.0 5.0 3.50 3646.0 2050.0 2003.0
3 920000.0 3.0 2.00 1287.0 2495.0 1960.0
4 790000.0 2.0 1.00 1125.0 2495.0 1923.0

In [369]:
g = sns.PairGrid(sf1_small)
g.map(plt.scatter);



In [370]:
data = observed.join(predicted.to_frame(name='predicted'))
data.head()


Out[370]:
np.log(price) predicted
0 13.997832 14.101815
1 13.287878 13.542935
2 14.997238 14.874004
3 13.732129 13.950430
4 13.579788 13.804406

In [371]:
plt.rcParams['figure.figsize']=8,8

sns.set_style("white")
sns.set_style("ticks")
ax = sns.distplot(residuals, fit=norm, kde=False)



In [372]:
g = sns.jointplot(predicted, residuals, kind="reg", scatter_kws={"s": 8}, size=10)



In [373]:
g = sns.jointplot("np.log(price)", "predicted", data=data, kind="reg", scatter_kws={"s": 8}, size=10)



In [389]:
#vars = pd.read_csv('data/ba_block_variables.csv', dtype={'block_id':object})

In [26]:
#vars.shape


Out[26]:
(109228, 1015)

In [390]:
smallvars = vars[['block_id','block_groups_sum_persons','block_groups_sum_acres','block_groups_total_jobs',
                 'block_groups_median_children', 'block_groups_median_persons' , 'block_groups_median_income',
                 'block_groups_prop_tenure_2', 'nodes_low_income_hh_1500m', 'nodes_high_income_hh_1500m',
                 'nodes_jobs_5000m', 'nodes_jobs_30km', 'nodes_population_400m', 'nodes_population_800m',
                 'nodes_jobs_800m', 'prop_race_of_head_1', 'prop_race_of_head_2', 'pumas_density_households',
                 'nodes_jobs_3000m_agg1', 'nodes_jobs_3000m_agg2', 'pumas_density_jobs', 'nodes_jobs_40km',
                 'nodes_jobs_3000m_agg3', 'nodes_jobs_3000m_agg4', 'nodes_jobs_3000m_agg5',
                 'block_groups_prop_persons_1', 'block_groups_prop_persons_2', 'block_groups_median_age_of_head',
                 'puma10_id_is_0607501', 'puma10_id_is_0607502', 'puma10_id_is_0607503', 'puma10_id_is_0607504']]
sv = smallvars.rename(index=str, columns={'block_groups_sum_persons': 'bgpop',
                                     'block_groups_sum_acres': 'bgacres',
                                    'block_groups_total_jobs': 'bgjobs',
                                    'block_groups_median_children': 'bgmedkids',
                                    'block_groups_median_persons': 'bgmedhhs',
                                    'block_groups_median_income': 'bgmedinc',
                                    'block_groups_prop_tenure_2': 'proprent',
                                    'nodes_low_income_hh_1500m': 'lowinc1500m',
                                    'nodes_high_income_hh_1500m': 'highinc1500m',
                                    'nodes_jobs_5000m': 'lnjobs5000m',
                                    'nodes_jobs_30km': 'lnjobs30km',
                                    'nodes_jobs_40km': 'lnjobs40km',
                                    'nodes_population_400m': 'lnpop400m',
                                    'nodes_population_800m': 'lnpop800m',
                                    'nodes_jobs_800m': 'lnjobs800m',
                                    'prop_race_of_head_1': 'propwhite',
                                    'prop_race_of_head_2': 'propblack',
                                    'pumas_density_households': 'pumahhden',
                                    'pumas_density_jobs': 'pumajobden',
                                    'nodes_jobs_3000m_agg1': 'lnbasic3000m',
                                    'nodes_jobs_3000m_agg2': 'lntcpuw3000m',
                                    'nodes_jobs_3000m_agg3': 'lnret3000m',
                                    'nodes_jobs_3000m_agg4': 'lnfire3000m',
                                    'nodes_jobs_3000m_agg5': 'lnserv3000m',
                                    'block_groups_prop_persons_1': 'prop1per',
                                    'block_groups_prop_persons_2': 'prop2per',
                                    'block_groups_median_age_of_head': 'bgmedagehd',
                                    'puma10_id_is_0607501': 'puma1',
                                    'puma10_id_is_0607502': 'puma2',
                                    'puma10_id_is_0607503': 'puma3',
                                    'puma10_id_is_0607504': 'puma4'
                                         })
sv.head()


Out[390]:
block_id bgpop bgacres bgjobs bgmedkids bgmedhhs bgmedinc proprent lowinc1500m highinc1500m ... lnret3000m lnfire3000m lnserv3000m prop1per prop2per bgmedagehd puma1 puma2 puma3 puma4
0 060014271001000 888.0 90.106156 465.0 0.0 2.0 80000.0 0.558603 5.991451 6.114498 ... 7.457814 5.329694 8.082779 0.34414 0.381546 57.0 0 0 0 0
1 060014271001001 888.0 90.106156 465.0 0.0 2.0 80000.0 0.558603 5.991451 6.114498 ... 7.457814 5.329694 8.082779 0.34414 0.381546 57.0 0 0 0 0
2 060014271001002 888.0 90.106156 465.0 0.0 2.0 80000.0 0.558603 5.991451 6.114498 ... 7.457814 5.329694 8.082779 0.34414 0.381546 57.0 0 0 0 0
3 060014271001003 888.0 90.106156 465.0 0.0 2.0 80000.0 0.558603 6.231131 6.339032 ... 7.558539 5.431654 8.188618 0.34414 0.381546 57.0 0 0 0 0
4 060014271001004 888.0 90.106156 465.0 0.0 2.0 80000.0 0.558603 6.107350 5.763724 ... 7.690577 5.521359 8.287781 0.34414 0.381546 57.0 0 0 0 0

5 rows × 32 columns


In [406]:
sv.to_csv('data/smallvars.csv')

In [409]:
sv = pd.read_csv('data/smallvars.csv', converters={'block_id': str})
sv.head()


Out[409]:
Unnamed: 0 block_id bgpop bgacres bgjobs bgmedkids bgmedhhs bgmedinc proprent lowinc1500m ... lnret3000m lnfire3000m lnserv3000m prop1per prop2per bgmedagehd puma1 puma2 puma3 puma4
0 0 060014271001000 888.0 90.106156 465.0 0.0 2.0 80000.0 0.558603 5.991451 ... 7.457814 5.329694 8.082779 0.34414 0.381546 57.0 0 0 0 0
1 1 060014271001001 888.0 90.106156 465.0 0.0 2.0 80000.0 0.558603 5.991451 ... 7.457814 5.329694 8.082779 0.34414 0.381546 57.0 0 0 0 0
2 2 060014271001002 888.0 90.106156 465.0 0.0 2.0 80000.0 0.558603 5.991451 ... 7.457814 5.329694 8.082779 0.34414 0.381546 57.0 0 0 0 0
3 3 060014271001003 888.0 90.106156 465.0 0.0 2.0 80000.0 0.558603 6.231131 ... 7.558539 5.431654 8.188618 0.34414 0.381546 57.0 0 0 0 0
4 4 060014271001004 888.0 90.106156 465.0 0.0 2.0 80000.0 0.558603 6.107350 ... 7.690577 5.521359 8.287781 0.34414 0.381546 57.0 0 0 0 0

5 rows × 33 columns

Recoded detailed race code 1 .White alone 2 .Black or African American alone 3 .American Indian alone 4 .Alaska Native alone 5 .American Indian and Alaska Native tribes specified; or American .Indian or Alaska native, not specified and no other races 6 .Asian alone 7 .Native Hawaiian and Other Pacific Islander alone 8 .Some other race alone 9 .Two or more major race groups


In [382]:
# pass the FCC API lat/long and get FIPS data back - return block fips and county name
def get_fips(row):
    time.sleep(pause)
    url = 'http://data.fcc.gov/api/block/find?format=json&latitude={}&longitude={}'
    request = url.format(row['latitude'], row['longitude'])
    response = requests.get(request)
    data = response.json()
    
    # return multiple values as a series - this will create a dataframe with multiple columns
    return pd.Series({'fips_code':data['Block']['FIPS'], 'county':data['County']['name']})

In [383]:
%%time
pause = 0.1
fips = sf1.apply(get_fips, axis=1)
sf1 = pd.concat([sf1, fips], axis=1)
sf1.head()


CPU times: user 1.24 s, sys: 155 ms, total: 1.39 s
Wall time: 2min 6s

In [384]:
#sf1.to_csv('data/redfin_w_block.csv')

In [403]:
#sf = pd.read_csv('data/redfin_w_block.csv', usecols=['proptype', 'price', 'beds', 'baths', 'sqft', 'lotsize',\
#    'yrbuilt', 'hoamonth', 'fips_code'], converters={'fips_code': str})
#sf.head()

In [404]:
sf_mrg = pd.merge(sf1, sv, left_on='fips_code', right_on='block_id', how='inner')
sf_mrg.columns


Out[404]:
Index(['saletype', 'solddate', 'proptype', 'address', 'city', 'state', 'zip',
       'price', 'beds', 'baths', 'location', 'sqft', 'lotsize', 'yrbuilt',
       'daysonmkt', 'pricesqft', 'hoamonth', 'status', 'nextopenstart',
       'nextopenend', 'url', 'source', 'mls', 'favorite', 'interested',
       'latitude', 'longitude', 'county', 'fips_code', 'block_id', 'bgpop',
       'bgacres', 'bgjobs', 'bgmedkids', 'bgmedhhs', 'bgmedinc', 'proprent',
       'lowinc1500m', 'highinc1500m', 'lnjobs5000m', 'lnjobs30km', 'lnpop400m',
       'lnpop800m', 'lnjobs800m', 'propwhite', 'propblack', 'pumahhden',
       'lnbasic3000m', 'lntcpuw3000m', 'pumajobden', 'lnjobs40km',
       'lnret3000m', 'lnfire3000m', 'lnserv3000m', 'prop1per', 'prop2per',
       'bgmedagehd', 'puma1', 'puma2', 'puma3', 'puma4'],
      dtype='object')

In [405]:
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths + beds + yrbuilt<1940 + hoamonth + bgmedinc +\
                 proprent + lnjobs5000m', 
                 data=sf_mrg, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.835
Model:                            OLS   Adj. R-squared:                  0.825
Method:                 Least Squares   F-statistic:                     82.91
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           1.63e-47
Time:                        23:54:13   Log-Likelihood:                 16.989
No. Observations:                 140   AIC:                            -15.98
Df Residuals:                     131   BIC:                             10.50
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  7.2517      0.597     12.146      0.000       6.071       8.433
yrbuilt < 1940[T.True]     0.0818      0.048      1.719      0.088      -0.012       0.176
np.log(sqft)               0.7452      0.094      7.955      0.000       0.560       0.931
baths                      0.1112      0.047      2.352      0.020       0.018       0.205
beds                       0.0210      0.041      0.511      0.610      -0.060       0.102
hoamonth                8.965e-05   5.31e-05      1.687      0.094   -1.55e-05       0.000
bgmedinc                2.092e-06   7.81e-07      2.680      0.008    5.48e-07    3.64e-06
proprent                   0.0951      0.147      0.647      0.519      -0.195       0.386
lnjobs5000m                0.0783      0.025      3.149      0.002       0.029       0.127
==============================================================================
Omnibus:                       14.658   Durbin-Watson:                   1.976
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               19.169
Skew:                          -0.613   Prob(JB):                     6.88e-05
Kurtosis:                       4.336   Cond. No.                     3.14e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.14e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

In [ ]: