Statistical Modeling With Python

Multiple Regression and Discrete Choice Models

This material uses Python to demonstrate some aspects of statistical models with continuous or categorical values to be predicted.

If the dependent variable (the variable we are trying to explain or predict) is continuous (has a large range, like price of housing), then we use multiple regression.
If the dependent variable is categorical (or represents a choice outcome) with two values (rent or own for example) than a binary logit model or logistic regression is usually the preferred model.
if the dependent variable is categorical (or represents a choice outcome) with a small number of values (like travel modes), then the most common model form is Multinomial Logit (MNL)

In today's session we introduce Python libraries that enable estimating models of each of these types. Theory is only briefly reviewed, as these methods should be supported by a full course (or more). The material is really meant to help students who have been exposed to these methods using another platform such as Stata, R, SAS, or SPSS, and want to be able to use Python to undertake their statistical model building.

Simple Linear Regression

Simple linear regression is an approach for predicting a quantitative response using a single feature (or "predictor" or "input variable"). It takes the following form:

$Y = \beta_0 + \beta_1x + \epsilon$, where $\epsilon\sim N\left(0,\sigma^{2}\right)$

$y$ is the dependent variable - the one we are trying to explain or predict
$x$ is the independent or explanatory variable that we are using to help explain or predict the value of $y$ with
$\beta_0$ is the intercept
$\beta_1$ is the coefficient for x, which is the slope of the line that minimizes the sum of the squared errors
$\epsilon$ is the error term, assumed to be normally distributed with mean of zero and variance of $\sigma^{2}$

Together, $\beta_0$ and $\beta_1$ are called the model coefficients. To create your model, you must "learn" or "estimate" the values of these coefficients. And once we've learned these coefficients, we can use the model to predict values of $y$ based on new values of $x$.

Estimating Model Coefficients

Generally speaking, coefficients are estimated using the least squares criterion, which means we are find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors"):

In the figure below:

The black dots are the observed values of x and y.
The red line is our least squares line.
The residuals are the vertical distances between the observed values and the least squares line.

How do the model coefficients relate to the least squares line?

$\beta_0$ is the intercept (the value of $y$ when $x$=0)
$\beta_1$ is the slope (the change in $y$ divided by change in $x$)

Here is a graphical depiction of those calculations:



In [262]:

    
# Startup steps
import pandas as pd, numpy as np, statsmodels.api as sm
import matplotlib.pyplot as plt, matplotlib.cm as cm, matplotlib.font_manager as fm
import matplotlib.mlab as mlab
import time, requests
from scipy.stats import pearsonr, ttest_rel
import seaborn as sns
sns.set()
%matplotlib inline

We begin by generating some synthetic data that is generated using an equation for which we supply the parameters. It enables us to verify that the model estimation code is correctly 'learning' the correct parameters, before we use it on real data.

Below we generate 100 values of Y from an equation: $Y = \beta_0 + \beta_1 x + \epsilon$.

We set $\beta_0 = 0$ and $\beta_1 = 2$ and draw values of $\epsilon$ from a normal distribution



In [312]:

    
nsample = 300
x = np.linspace(0, 10, nsample)
beta = np.array([0, 2])
e = np.random.normal(size=nsample)*2
X = sm.add_constant(x)
y = np.dot(X, beta) + e

Plot the data and the model. Note that the intercept is set to zero in this example initially.



In [313]:

    
plt.figure(1, figsize=(10,8), )
plt.plot([0, 10], [0, 20])
plt.scatter(x, y, marker=0, s=10, c='g')
plt.axis([0, 10, 0, 20])
plt.show();

Now we 'fit' the model to the data, which means we compute the values of $\beta_0$ and $\beta_1$ that minimize the sum of the squared errors. We use Statsmodels for fitting the model. There are two different syntax styles to use for this. The first one uses an X matrix and a y array, like this:



In [314]:

    
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.887
Model:                            OLS   Adj. R-squared:                  0.886
Method:                 Least Squares   F-statistic:                     2336.
Date:                Sun, 19 Nov 2017   Prob (F-statistic):          4.68e-143
Time:                        20:57:54   Log-Likelihood:                -639.42
No. Observations:                 300   AIC:                             1283.
Df Residuals:                     298   BIC:                             1290.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0559      0.236      0.237      0.813      -0.408       0.520
x1             1.9711      0.041     48.335      0.000       1.891       2.051
==============================================================================
Omnibus:                        1.220   Durbin-Watson:                   2.060
Prob(Omnibus):                  0.543   Jarque-Bera (JB):                0.956
Skew:                           0.055   Prob(JB):                        0.620
Kurtosis:                       3.253   Cond. No.                         11.8
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can extract any of the results from the fitted model we want, since the fitted model is a Python object and has attributes we can interrogate.



In [315]:

    
print('Parameters: ', results.params)









    



Parameters:  [ 0.05590029  1.9711345 ]

Statsmodels calculates 95% confidence intervals for our model coefficients, which are interpreted as follows: If the population from which this sample was drawn was sampled 100 times, approximately 95 of those confidence intervals would contain the "true" coefficient.

To get the 95% confidence interval around $\beta_0$ and $\beta_1$ we can do this:



In [316]:

    
results.conf_int()









    Out[316]:





array([[-0.40783806,  0.51963865],
       [ 1.89087973,  2.05138926]])

Hypothesis Testing and p-values

Closely related to confidence intervals is hypothesis testing. Generally speaking, you start with a null hypothesis and an alternative hypothesis (that is opposite the null). Then, you check whether the data supports rejecting the null hypothesis or failing to reject the null hypothesis.

(Note that "failing to reject" the null is not the same as "accepting" the null hypothesis. The alternative hypothesis may indeed be true, except that you just don't have enough data to show that.)

As it relates to model coefficients, here is the conventional hypothesis test:

null hypothesis: There is no relationship between $x$ and $y$ (and thus $\beta_1$ equals zero)
alternative hypothesis: There is a relationship between $x$ and $y$ (and thus $\beta_1$ is not equal to zero)

How do we test this hypothesis? Intuitively, we reject the null (and thus believe the alternative) if the 95% confidence interval does not include zero. Conversely, the p-value represents the probability that the coefficient is actually zero:



In [317]:

    
results.pvalues









    Out[317]:





array([  8.12646896e-001,   4.68109967e-143])

If the 95% confidence interval includes zero, the p-value for that coefficient will be greater than 0.05. If the 95% confidence interval does not include zero, the p-value will be less than 0.05. Thus, a p-value less than 0.05 is one way to decide whether there is likely a relationship between the feature and the response. (Again, using 0.05 as the cutoff is just a convention.)

In this case, the p-value for $x$ is far less than 0.05, and so we believe that there is a relationship between $x$ and $y$.

Note that we generally ignore the p-value for the intercept.

How Well Does the Model Fit the data?

The most common way to evaluate the overall fit of a linear model is by the R-squared value. R-squared is the proportion of variance explained, meaning the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model. (The null model just predicts the mean of the observed response, and thus it has an intercept and no slope.)

R-squared is between 0 and 1, and higher is better because it means that more variance is explained by the model.



In [318]:

    
print('R2: ', results.rsquared)









    



R2:  0.886875409381

We can also inspect the residuals (the errors) computed by comparing the predicted values of y to the observed ones. The mean of the residuals should be zero and they should be normally distributed.



In [319]:

    
results.resid.mean()









    Out[319]:





-2.9013828376870757e-16



In [320]:

    
from scipy.stats import norm
plt.rcParams['figure.figsize']=8,8
#plot(residuals.mean(),0, residuals.mean(), 2.25)
sns.set_style("white")
sns.set_style("ticks")
ax = sns.distplot(results.resid, fit=norm, kde=False)

Multiple Linear Regression

Simple linear regression can easily be extended to include multiple features. This is called multiple linear regression:

$y = \beta_0 + \beta_1x_1 + ... + \beta_nx_n + \epsilon$

Let's add an additional x variable to our synthetic model, using x squared, and fit it again with Statsmodels.

We generate new values of $y$ using $\beta_0 = 0$, $\beta_1 = 2$ and $\beta_2 = 0.5$



In [323]:

    
nsample = 300
x = np.linspace(0, 10, 300)
X = np.column_stack((x, x**2))
beta = np.array([1, 2, .5])
e = np.random.normal(size=nsample)*2



In [324]:

    
X = sm.add_constant(X)
y = np.dot(X, beta) + e



In [325]:

    
X[:10]









    Out[325]:





array([[ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.03344482,  0.00111856],
       [ 1.        ,  0.06688963,  0.00447422],
       [ 1.        ,  0.10033445,  0.010067  ],
       [ 1.        ,  0.13377926,  0.01789689],
       [ 1.        ,  0.16722408,  0.02796389],
       [ 1.        ,  0.2006689 ,  0.04026801],
       [ 1.        ,  0.23411371,  0.05480923],
       [ 1.        ,  0.26755853,  0.07158757],
       [ 1.        ,  0.30100334,  0.09060301]])



In [326]:

    
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.991
Model:                            OLS   Adj. R-squared:                  0.991
Method:                 Least Squares   F-statistic:                 1.574e+04
Date:                Sun, 19 Nov 2017   Prob (F-statistic):          4.28e-302
Time:                        20:58:30   Log-Likelihood:                -634.18
No. Observations:                 300   AIC:                             1274.
Df Residuals:                     297   BIC:                             1285.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.7160      0.346      2.066      0.040       0.034       1.398
x1             2.0604      0.160     12.872      0.000       1.745       2.375
x2             0.4946      0.015     31.920      0.000       0.464       0.525
==============================================================================
Omnibus:                        2.309   Durbin-Watson:                   1.978
Prob(Omnibus):                  0.315   Jarque-Bera (JB):                1.854
Skew:                           0.014   Prob(JB):                        0.396
Kurtosis:                       2.616   Cond. No.                         146.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



In [327]:

    
ax = sns.regplot(x=X[:,1], y=y, scatter_kws={"s": 10}, order=2, ci=None, truncate=True)



In [328]:

    
results.resid.mean()









    Out[328]:





-5.7928476830208332e-14



In [331]:

    
from scipy.stats import norm
plt.rcParams['figure.figsize']=8,8
sns.set_style("white")
sns.set_style("ticks")
ax = sns.distplot(results.resid, fit=norm, kde=False)

R-squared will always increase as you add more features to the model, even if they are unrelated to the response. Thus, selecting the model with the highest R-squared is not a reliable approach for choosing the best linear model.

There is alternative to R-squared called adjusted R-squared that penalizes model complexity (to control for overfitting), but it generally under-penalizes complexity.

Estimating a Multiple Regression on Housing Prices (Hedonic Regression)

Now let's use real data. We pull some housing sales transactions from Redfin for this, for one month of sales in San Francisco, looking back from March 5th 2017.



In [332]:

    
sf = pd.read_csv('data/redfin_2017-03-05-17-45-34-san-francisco-county-1-month.csv')
sf.columns









    Out[332]:





Index(['SALE TYPE', 'SOLD DATE', 'PROPERTY TYPE', 'ADDRESS', 'CITY', 'STATE',
       'ZIP', 'PRICE', 'BEDS', 'BATHS', 'LOCATION', 'SQUARE FEET', 'LOT SIZE',
       'YEAR BUILT', 'DAYS ON MARKET', '$/SQUARE FEET', 'HOA/MONTH', 'STATUS',
       'NEXT OPEN HOUSE START TIME', 'NEXT OPEN HOUSE END TIME',
       'URL (SEE http://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)',
       'SOURCE', 'MLS#', 'FAVORITE', 'INTERESTED', 'LATITUDE', 'LONGITUDE'],
      dtype='object')



In [333]:

    
sf1 = sf.rename(index=str, columns={'SALE TYPE': 'saletype',
    'SOLD DATE': 'solddate', 'PROPERTY TYPE': 'proptype', 'ADDRESS': 'address',
    'CITY': 'city', 'STATE': 'state', 'ZIP': 'zip', 'PRICE': 'price', 'BEDS': 'beds',
    'BATHS': 'baths', 'LOCATION': 'location', 'SQUARE FEET': 'sqft', 'LOT SIZE': 'lotsize',
    'YEAR BUILT': 'yrbuilt', 'DAYS ON MARKET': 'daysonmkt', '$/SQUARE FEET': 'pricesqft',
    'LATITUDE': 'latitude', 'LONGITUDE': 'longitude', 'HOA/MONTH': 'hoamonth',
    'URL (SEE http://www.redfin.com/buy-a-home/comparative-market-analysis FOR INFO ON PRICING)': 'url',
    'STATUS': 'status', 'NEXT OPEN HOUSE START TIME': 'nextopenstart', 'NEXT OPEN HOUSE END TIME': 'nextopenend',
    'SOURCE': 'source', 'MLS#': 'mls', 'FAVORITE': 'favorite', 'INTERESTED': 'interested'
    })

sf1.head()









    Out[333]:







  
    
      
      saletype
      solddate
      proptype
      address
      city
      state
      zip
      price
      beds
      baths
      ...
      status
      nextopenstart
      nextopenend
      url
      source
      mls
      favorite
      interested
      latitude
      longitude
    
  
  
    
      0
      PAST SALE
      February-7-2017
      Single Family Residential
      521 Cayuga Ave
      San Francisco
      CA
      94112.0
      1200000.0
      2.0
      1.25
      ...
      Sold
      NaN
      NaN
      http://www.redfin.com/CA/San-Francisco/521-Cay...
      San Francisco MLS
      453167
      N
      Y
      37.728860
      -122.435308
    
    
      1
      PAST SALE
      February-10-2017
      Condo/Co-op
      95 Red Rock Way Unit 105M
      San Francisco
      CA
      94131.0
      590000.0
      0.0
      1.00
      ...
      Sold
      NaN
      NaN
      http://www.redfin.com/CA/San-Francisco/95-Red-...
      San Francisco MLS
      452726
      N
      Y
      37.746324
      -122.441517
    
    
      2
      PAST SALE
      February-10-2017
      Single Family Residential
      3157 Baker St
      San Francisco
      CA
      94123.0
      3260000.0
      5.0
      3.50
      ...
      Sold
      NaN
      NaN
      http://www.redfin.com/CA/San-Francisco/3157-Ba...
      San Francisco MLS
      451943
      N
      Y
      37.800218
      -122.446455
    
    
      3
      PAST SALE
      February-17-2017
      Single Family Residential
      2594 San Jose Ave
      San Francisco
      CA
      94112.0
      920000.0
      3.0
      2.00
      ...
      Sold
      NaN
      NaN
      http://www.redfin.com/CA/San-Francisco/2594-Sa...
      MetroList
      16067859
      N
      Y
      37.716435
      -122.450444
    
    
      4
      PAST SALE
      February-27-2017
      Single Family Residential
      130 Brazil Ave
      SAN FRANCISCO
      CA
      94112.0
      790000.0
      2.0
      1.00
      ...
      Sold
      NaN
      NaN
      http://www.redfin.com/CA/San-Francisco/130-Bra...
      MLSListings
      ML81635907
      N
      Y
      37.723920
      -122.433735
    
  

5 rows × 27 columns



In [355]:

    
sf1.columns









    Out[355]:





Index(['saletype', 'solddate', 'proptype', 'address', 'city', 'state', 'zip',
       'price', 'beds', 'baths', 'location', 'sqft', 'lotsize', 'yrbuilt',
       'daysonmkt', 'pricesqft', 'hoamonth', 'status', 'nextopenstart',
       'nextopenend', 'url', 'source', 'mls', 'favorite', 'interested',
       'latitude', 'longitude'],
      dtype='object')



In [334]:

    
sf1.describe()









    Out[334]:







  
    
      
      zip
      price
      beds
      baths
      sqft
      lotsize
      yrbuilt
      daysonmkt
      pricesqft
      hoamonth
      nextopenstart
      nextopenend
      latitude
      longitude
    
  
  
    
      count
      308.000000
      3.080000e+02
      301.000000
      293.000000
      266.000000
      139.000000
      288.000000
      306.00000
      264.000000
      159.000000
      0.0
      0.0
      309.000000
      309.000000
    
    
      mean
      93888.253247
      1.521055e+06
      2.647841
      1.880546
      1662.593985
      3194.935252
      1952.006944
      14.30719
      949.905303
      543.874214
      NaN
      NaN
      37.760820
      -122.431289
    
    
      std
      2820.378426
      1.063400e+06
      1.862499
      0.896519
      1169.031917
      4656.050424
      39.950230
      8.87308
      279.135062
      436.562243
      NaN
      NaN
      0.025052
      0.027834
    
    
      min
      59059.000000
      3.500000e+03
      0.000000
      1.000000
      300.000000
      611.000000
      1885.000000
      2.00000
      339.000000
      4.000000
      NaN
      NaN
      37.709249
      -122.510730
    
    
      25%
      94109.000000
      8.787500e+05
      2.000000
      1.000000
      978.750000
      2319.500000
      1916.750000
      6.00000
      773.750000
      342.500000
      NaN
      NaN
      37.739065
      -122.443876
    
    
      50%
      94114.000000
      1.271000e+06
      2.000000
      2.000000
      1360.000000
      2513.000000
      1941.000000
      12.00000
      963.500000
      473.000000
      NaN
      NaN
      37.762549
      -122.428527
    
    
      75%
      94122.000000
      1.813750e+06
      3.000000
      2.000000
      1904.750000
      3090.000000
      1996.000000
      23.00000
      1106.000000
      683.000000
      NaN
      NaN
      37.781520
      -122.412238
    
    
      max
      94158.000000
      1.025000e+07
      18.000000
      5.500000
      9504.000000
      56024.000000
      2017.000000
      30.00000
      2053.000000
      4413.000000
      NaN
      NaN
      37.805816
      -122.371007



In [336]:

    
g = sns.jointplot("sqft", "price", data=sf1, kind="reg", scatter_kws={"s": 10}, size=10)

Let's fit a simple linear regression of price on sqft, since there is clearly a strong relationship between them.



In [337]:

    
import statsmodels.api as sm
import numpy as np
from patsy import dmatrices
y, X = dmatrices('price ~ sqft', 
                 data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.577
Method:                 Least Squares   F-statistic:                     361.3
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           2.74e-51
Time:                        21:00:08   Log-Likelihood:                -3907.9
No. Observations:                 265   AIC:                             7820.
Df Residuals:                     263   BIC:                             7827.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    4.52e+05    6.6e+04      6.847      0.000    3.22e+05    5.82e+05
sqft         616.7552     32.447     19.008      0.000     552.867     680.644
==============================================================================
Omnibus:                      158.986   Durbin-Watson:                   1.874
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1792.883
Skew:                           2.184   Prob(JB):                         0.00
Kurtosis:                      14.971   Cond. No.                     3.55e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.55e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

How would you intepret the R-squared value?

How can you interpret the coefficient on sqft?

A common thing to explore is whether a transformation of the dependent and/or independent variables help improve the degree to which the relationship is linear, and the fit of the model. Most models of housing prices, or 'hedonic price models' are specified with the log of price, and often the log of continuous variables like sqft. Let's look at the relationship once we log-transform both:



In [338]:

    
g = sns.jointplot(np.log(sf1['price']), np.log(sf1['sqft']), kind="reg", scatter_kws={"s": 8}, size=10)

And now let's re-estimate the model using the log-log transformation of price and sqft.



In [339]:

    
y, X = dmatrices('np.log(price) ~ np.log(sqft)', 
                 data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.696
Model:                            OLS   Adj. R-squared:                  0.695
Method:                 Least Squares   F-statistic:                     603.0
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           5.17e-70
Time:                        21:00:31   Log-Likelihood:                -44.561
No. Observations:                 265   AIC:                             93.12
Df Residuals:                     263   BIC:                             100.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        8.3799      0.232     36.115      0.000       7.923       8.837
np.log(sqft)     0.7835      0.032     24.555      0.000       0.721       0.846
==============================================================================
Omnibus:                        7.938   Durbin-Watson:                   1.875
Prob(Omnibus):                  0.019   Jarque-Bera (JB):               12.858
Skew:                          -0.117   Prob(JB):                      0.00161
Kurtosis:                       4.053   Cond. No.                         97.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The model fit improves considerably. How do you interpret the coefficient for sqft now?

Next let's add the number of baths and see how that changes the model.



In [340]:

    
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths', 
                 data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.705
Model:                            OLS   Adj. R-squared:                  0.703
Method:                 Least Squares   F-statistic:                     301.5
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           1.43e-67
Time:                        21:00:34   Log-Likelihood:                -32.226
No. Observations:                 255   AIC:                             70.45
Df Residuals:                     252   BIC:                             81.08
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        8.8761      0.339     26.221      0.000       8.209       9.543
np.log(sqft)     0.6874      0.053     13.080      0.000       0.584       0.791
baths            0.1124      0.030      3.695      0.000       0.052       0.172
==============================================================================
Omnibus:                       19.326   Durbin-Watson:                   1.964
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               28.444
Skew:                          -0.500   Prob(JB):                     6.66e-07
Kurtosis:                       4.296   Cond. No.                         149.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Let's add bedrooms now and see how that changes the model...



In [348]:

    
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths + beds', 
                 data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.739
Model:                            OLS   Adj. R-squared:                  0.735
Method:                 Least Squares   F-statistic:                     236.4
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           8.26e-73
Time:                        22:59:54   Log-Likelihood:                -16.926
No. Observations:                 255   AIC:                             41.85
Df Residuals:                     251   BIC:                             56.02
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept        7.5064      0.401     18.727      0.000       6.717       8.296
np.log(sqft)     0.9100      0.063     14.374      0.000       0.785       1.035
baths            0.1425      0.029      4.881      0.000       0.085       0.200
beds            -0.1189      0.021     -5.657      0.000      -0.160      -0.077
==============================================================================
Omnibus:                       17.160   Durbin-Watson:                   1.933
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               20.918
Skew:                          -0.529   Prob(JB):                     2.87e-05
Kurtosis:                       3.923   Cond. No.                         198.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



In [228]:

    
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths + beds + yrbuilt<1940', 
                 data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.739
Model:                            OLS   Adj. R-squared:                  0.734
Method:                 Least Squares   F-statistic:                     176.6
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           1.33e-71
Time:                        20:18:46   Log-Likelihood:                -16.890
No. Observations:                 255   AIC:                             43.78
Df Residuals:                     250   BIC:                             61.49
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  7.4783      0.415     18.010      0.000       6.661       8.296
yrbuilt < 1940[T.True]    -0.0097      0.037     -0.266      0.791      -0.082       0.062
np.log(sqft)               0.9150      0.066     13.838      0.000       0.785       1.045
baths                      0.1398      0.031      4.522      0.000       0.079       0.201
beds                      -0.1183      0.021     -5.586      0.000      -0.160      -0.077
==============================================================================
Omnibus:                       17.220   Durbin-Watson:                   1.929
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               20.905
Skew:                          -0.532   Prob(JB):                     2.89e-05
Kurtosis:                       3.914   Cond. No.                         205.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



In [344]:

    
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths + beds + yrbuilt<1940', 
                 data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.739
Model:                            OLS   Adj. R-squared:                  0.734
Method:                 Least Squares   F-statistic:                     176.6
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           1.33e-71
Time:                        21:01:30   Log-Likelihood:                -16.890
No. Observations:                 255   AIC:                             43.78
Df Residuals:                     250   BIC:                             61.49
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  7.4783      0.415     18.010      0.000       6.661       8.296
yrbuilt < 1940[T.True]    -0.0097      0.037     -0.266      0.791      -0.082       0.062
np.log(sqft)               0.9150      0.066     13.838      0.000       0.785       1.045
baths                      0.1398      0.031      4.522      0.000       0.079       0.201
beds                      -0.1183      0.021     -5.586      0.000      -0.160      -0.077
==============================================================================
Omnibus:                       17.220   Durbin-Watson:                   1.929
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               20.905
Skew:                          -0.532   Prob(JB):                     2.89e-05
Kurtosis:                       3.914   Cond. No.                         205.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



In [347]:

    
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths + beds + yrbuilt<1940 + hoamonth', 
                 data=sf1, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.800
Model:                            OLS   Adj. R-squared:                  0.793
Method:                 Least Squares   F-statistic:                     107.5
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           3.85e-45
Time:                        21:02:57   Log-Likelihood:                 3.6599
No. Observations:                 140   AIC:                             4.680
Df Residuals:                     134   BIC:                             22.33
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  7.9919      0.610     13.111      0.000       6.786       9.197
yrbuilt < 1940[T.True]     0.1097      0.049      2.222      0.028       0.012       0.207
np.log(sqft)               0.8069      0.101      8.006      0.000       0.608       1.006
baths                      0.0887      0.051      1.749      0.083      -0.012       0.189
beds                       0.0019      0.044      0.042      0.967      -0.086       0.090
hoamonth                   0.0002   5.27e-05      3.551      0.001    8.29e-05       0.000
==============================================================================
Omnibus:                       27.843   Durbin-Watson:                   1.943
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               54.137
Skew:                          -0.878   Prob(JB):                     1.76e-12
Kurtosis:                       5.489   Cond. No.                     2.20e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.2e+04. This might indicate that there are
strong multicollinearity or other numerical problems.



In [366]:

    
keepcols = [ 'price', 'beds', 'baths', 'sqft', 'lotsize', 'yrbuilt']
sf1_small=sf1[keepcols]
sf1_small.head()



In [369]:

    
g = sns.PairGrid(sf1_small)
g.map(plt.scatter);



In [370]:

    
data = observed.join(predicted.to_frame(name='predicted'))
data.head()









    Out[370]:







  
    
      
      np.log(price)
      predicted
    
  
  
    
      0
      13.997832
      14.101815
    
    
      1
      13.287878
      13.542935
    
    
      2
      14.997238
      14.874004
    
    
      3
      13.732129
      13.950430
    
    
      4
      13.579788
      13.804406



In [371]:

    
plt.rcParams['figure.figsize']=8,8

sns.set_style("white")
sns.set_style("ticks")
ax = sns.distplot(residuals, fit=norm, kde=False)



In [372]:

    
g = sns.jointplot(predicted, residuals, kind="reg", scatter_kws={"s": 8}, size=10)



In [373]:

    
g = sns.jointplot("np.log(price)", "predicted", data=data, kind="reg", scatter_kws={"s": 8}, size=10)



In [389]:

    
#vars = pd.read_csv('data/ba_block_variables.csv', dtype={'block_id':object})



In [26]:

    
#vars.shape









    Out[26]:





(109228, 1015)



In [390]:

    
smallvars = vars[['block_id','block_groups_sum_persons','block_groups_sum_acres','block_groups_total_jobs',
                 'block_groups_median_children', 'block_groups_median_persons' , 'block_groups_median_income',
                 'block_groups_prop_tenure_2', 'nodes_low_income_hh_1500m', 'nodes_high_income_hh_1500m',
                 'nodes_jobs_5000m', 'nodes_jobs_30km', 'nodes_population_400m', 'nodes_population_800m',
                 'nodes_jobs_800m', 'prop_race_of_head_1', 'prop_race_of_head_2', 'pumas_density_households',
                 'nodes_jobs_3000m_agg1', 'nodes_jobs_3000m_agg2', 'pumas_density_jobs', 'nodes_jobs_40km',
                 'nodes_jobs_3000m_agg3', 'nodes_jobs_3000m_agg4', 'nodes_jobs_3000m_agg5',
                 'block_groups_prop_persons_1', 'block_groups_prop_persons_2', 'block_groups_median_age_of_head',
                 'puma10_id_is_0607501', 'puma10_id_is_0607502', 'puma10_id_is_0607503', 'puma10_id_is_0607504']]
sv = smallvars.rename(index=str, columns={'block_groups_sum_persons': 'bgpop',
                                     'block_groups_sum_acres': 'bgacres',
                                    'block_groups_total_jobs': 'bgjobs',
                                    'block_groups_median_children': 'bgmedkids',
                                    'block_groups_median_persons': 'bgmedhhs',
                                    'block_groups_median_income': 'bgmedinc',
                                    'block_groups_prop_tenure_2': 'proprent',
                                    'nodes_low_income_hh_1500m': 'lowinc1500m',
                                    'nodes_high_income_hh_1500m': 'highinc1500m',
                                    'nodes_jobs_5000m': 'lnjobs5000m',
                                    'nodes_jobs_30km': 'lnjobs30km',
                                    'nodes_jobs_40km': 'lnjobs40km',
                                    'nodes_population_400m': 'lnpop400m',
                                    'nodes_population_800m': 'lnpop800m',
                                    'nodes_jobs_800m': 'lnjobs800m',
                                    'prop_race_of_head_1': 'propwhite',
                                    'prop_race_of_head_2': 'propblack',
                                    'pumas_density_households': 'pumahhden',
                                    'pumas_density_jobs': 'pumajobden',
                                    'nodes_jobs_3000m_agg1': 'lnbasic3000m',
                                    'nodes_jobs_3000m_agg2': 'lntcpuw3000m',
                                    'nodes_jobs_3000m_agg3': 'lnret3000m',
                                    'nodes_jobs_3000m_agg4': 'lnfire3000m',
                                    'nodes_jobs_3000m_agg5': 'lnserv3000m',
                                    'block_groups_prop_persons_1': 'prop1per',
                                    'block_groups_prop_persons_2': 'prop2per',
                                    'block_groups_median_age_of_head': 'bgmedagehd',
                                    'puma10_id_is_0607501': 'puma1',
                                    'puma10_id_is_0607502': 'puma2',
                                    'puma10_id_is_0607503': 'puma3',
                                    'puma10_id_is_0607504': 'puma4'
                                         })
sv.head()









    Out[390]:







  
    
      
      block_id
      bgpop
      bgacres
      bgjobs
      bgmedkids
      bgmedhhs
      bgmedinc
      proprent
      lowinc1500m
      highinc1500m
      ...
      lnret3000m
      lnfire3000m
      lnserv3000m
      prop1per
      prop2per
      bgmedagehd
      puma1
      puma2
      puma3
      puma4
    
  
  
    
      0
      060014271001000
      888.0
      90.106156
      465.0
      0.0
      2.0
      80000.0
      0.558603
      5.991451
      6.114498
      ...
      7.457814
      5.329694
      8.082779
      0.34414
      0.381546
      57.0
      0
      0
      0
      0
    
    
      1
      060014271001001
      888.0
      90.106156
      465.0
      0.0
      2.0
      80000.0
      0.558603
      5.991451
      6.114498
      ...
      7.457814
      5.329694
      8.082779
      0.34414
      0.381546
      57.0
      0
      0
      0
      0
    
    
      2
      060014271001002
      888.0
      90.106156
      465.0
      0.0
      2.0
      80000.0
      0.558603
      5.991451
      6.114498
      ...
      7.457814
      5.329694
      8.082779
      0.34414
      0.381546
      57.0
      0
      0
      0
      0
    
    
      3
      060014271001003
      888.0
      90.106156
      465.0
      0.0
      2.0
      80000.0
      0.558603
      6.231131
      6.339032
      ...
      7.558539
      5.431654
      8.188618
      0.34414
      0.381546
      57.0
      0
      0
      0
      0
    
    
      4
      060014271001004
      888.0
      90.106156
      465.0
      0.0
      2.0
      80000.0
      0.558603
      6.107350
      5.763724
      ...
      7.690577
      5.521359
      8.287781
      0.34414
      0.381546
      57.0
      0
      0
      0
      0
    
  

5 rows × 32 columns



In [406]:

    
sv.to_csv('data/smallvars.csv')



In [409]:

    
sv = pd.read_csv('data/smallvars.csv', converters={'block_id': str})
sv.head()









    Out[409]:







  
    
      
      Unnamed: 0
      block_id
      bgpop
      bgacres
      bgjobs
      bgmedkids
      bgmedhhs
      bgmedinc
      proprent
      lowinc1500m
      ...
      lnret3000m
      lnfire3000m
      lnserv3000m
      prop1per
      prop2per
      bgmedagehd
      puma1
      puma2
      puma3
      puma4
    
  
  
    
      0
      0
      060014271001000
      888.0
      90.106156
      465.0
      0.0
      2.0
      80000.0
      0.558603
      5.991451
      ...
      7.457814
      5.329694
      8.082779
      0.34414
      0.381546
      57.0
      0
      0
      0
      0
    
    
      1
      1
      060014271001001
      888.0
      90.106156
      465.0
      0.0
      2.0
      80000.0
      0.558603
      5.991451
      ...
      7.457814
      5.329694
      8.082779
      0.34414
      0.381546
      57.0
      0
      0
      0
      0
    
    
      2
      2
      060014271001002
      888.0
      90.106156
      465.0
      0.0
      2.0
      80000.0
      0.558603
      5.991451
      ...
      7.457814
      5.329694
      8.082779
      0.34414
      0.381546
      57.0
      0
      0
      0
      0
    
    
      3
      3
      060014271001003
      888.0
      90.106156
      465.0
      0.0
      2.0
      80000.0
      0.558603
      6.231131
      ...
      7.558539
      5.431654
      8.188618
      0.34414
      0.381546
      57.0
      0
      0
      0
      0
    
    
      4
      4
      060014271001004
      888.0
      90.106156
      465.0
      0.0
      2.0
      80000.0
      0.558603
      6.107350
      ...
      7.690577
      5.521359
      8.287781
      0.34414
      0.381546
      57.0
      0
      0
      0
      0
    
  

5 rows × 33 columns

Recoded detailed race code 1 .White alone 2 .Black or African American alone 3 .American Indian alone 4 .Alaska Native alone 5 .American Indian and Alaska Native tribes specified; or American .Indian or Alaska native, not specified and no other races 6 .Asian alone 7 .Native Hawaiian and Other Pacific Islander alone 8 .Some other race alone 9 .Two or more major race groups



In [382]:

    
# pass the FCC API lat/long and get FIPS data back - return block fips and county name
def get_fips(row):
    time.sleep(pause)
    url = 'http://data.fcc.gov/api/block/find?format=json&latitude={}&longitude={}'
    request = url.format(row['latitude'], row['longitude'])
    response = requests.get(request)
    data = response.json()
    
    # return multiple values as a series - this will create a dataframe with multiple columns
    return pd.Series({'fips_code':data['Block']['FIPS'], 'county':data['County']['name']})



In [383]:

    
%%time
pause = 0.1
fips = sf1.apply(get_fips, axis=1)
sf1 = pd.concat([sf1, fips], axis=1)
sf1.head()









    



CPU times: user 1.24 s, sys: 155 ms, total: 1.39 s
Wall time: 2min 6s



In [384]:

    
#sf1.to_csv('data/redfin_w_block.csv')



In [403]:

    
#sf = pd.read_csv('data/redfin_w_block.csv', usecols=['proptype', 'price', 'beds', 'baths', 'sqft', 'lotsize',\
#    'yrbuilt', 'hoamonth', 'fips_code'], converters={'fips_code': str})
#sf.head()



In [404]:

    
sf_mrg = pd.merge(sf1, sv, left_on='fips_code', right_on='block_id', how='inner')
sf_mrg.columns









    Out[404]:





Index(['saletype', 'solddate', 'proptype', 'address', 'city', 'state', 'zip',
       'price', 'beds', 'baths', 'location', 'sqft', 'lotsize', 'yrbuilt',
       'daysonmkt', 'pricesqft', 'hoamonth', 'status', 'nextopenstart',
       'nextopenend', 'url', 'source', 'mls', 'favorite', 'interested',
       'latitude', 'longitude', 'county', 'fips_code', 'block_id', 'bgpop',
       'bgacres', 'bgjobs', 'bgmedkids', 'bgmedhhs', 'bgmedinc', 'proprent',
       'lowinc1500m', 'highinc1500m', 'lnjobs5000m', 'lnjobs30km', 'lnpop400m',
       'lnpop800m', 'lnjobs800m', 'propwhite', 'propblack', 'pumahhden',
       'lnbasic3000m', 'lntcpuw3000m', 'pumajobden', 'lnjobs40km',
       'lnret3000m', 'lnfire3000m', 'lnserv3000m', 'prop1per', 'prop2per',
       'bgmedagehd', 'puma1', 'puma2', 'puma3', 'puma4'],
      dtype='object')



In [405]:

    
y, X = dmatrices('np.log(price) ~ np.log(sqft) + baths + beds + yrbuilt<1940 + hoamonth + bgmedinc +\
                 proprent + lnjobs5000m', 
                 data=sf_mrg, return_type='dataframe')
mod = sm.OLS(y, X)
res = mod.fit()
residuals = res.resid
predicted = res.fittedvalues
observed = y
print(res.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(price)   R-squared:                       0.835
Model:                            OLS   Adj. R-squared:                  0.825
Method:                 Least Squares   F-statistic:                     82.91
Date:                Sun, 19 Nov 2017   Prob (F-statistic):           1.63e-47
Time:                        23:54:13   Log-Likelihood:                 16.989
No. Observations:                 140   AIC:                            -15.98
Df Residuals:                     131   BIC:                             10.50
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
==========================================================================================
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
Intercept                  7.2517      0.597     12.146      0.000       6.071       8.433
yrbuilt < 1940[T.True]     0.0818      0.048      1.719      0.088      -0.012       0.176
np.log(sqft)               0.7452      0.094      7.955      0.000       0.560       0.931
baths                      0.1112      0.047      2.352      0.020       0.018       0.205
beds                       0.0210      0.041      0.511      0.610      -0.060       0.102
hoamonth                8.965e-05   5.31e-05      1.687      0.094   -1.55e-05       0.000
bgmedinc                2.092e-06   7.81e-07      2.680      0.008    5.48e-07    3.64e-06
proprent                   0.0951      0.147      0.647      0.519      -0.195       0.386
lnjobs5000m                0.0783      0.025      3.149      0.002       0.029       0.127
==============================================================================
Omnibus:                       14.658   Durbin-Watson:                   1.976
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               19.169
Skew:                          -0.613   Prob(JB):                     6.88e-05
Kurtosis:                       4.336   Cond. No.                     3.14e+06
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.14e+06. This might indicate that there are
strong multicollinearity or other numerical problems.



In [ ]:

	saletype	solddate	proptype	address	city	state	zip	price	beds	baths	...	status	nextopenstart	nextopenend	url	source	mls	favorite	interested	latitude	longitude
0	PAST SALE	February-7-2017	Single Family Residential	521 Cayuga Ave	San Francisco	CA	94112.0	1200000.0	2.0	1.25	...	Sold	NaN	NaN	http://www.redfin.com/CA/San-Francisco/521-Cay...	San Francisco MLS	453167	N	Y	37.728860	-122.435308
1	PAST SALE	February-10-2017	Condo/Co-op	95 Red Rock Way Unit 105M	San Francisco	CA	94131.0	590000.0	0.0	1.00	...	Sold	NaN	NaN	http://www.redfin.com/CA/San-Francisco/95-Red-...	San Francisco MLS	452726	N	Y	37.746324	-122.441517
2	PAST SALE	February-10-2017	Single Family Residential	3157 Baker St	San Francisco	CA	94123.0	3260000.0	5.0	3.50	...	Sold	NaN	NaN	http://www.redfin.com/CA/San-Francisco/3157-Ba...	San Francisco MLS	451943	N	Y	37.800218	-122.446455
3	PAST SALE	February-17-2017	Single Family Residential	2594 San Jose Ave	San Francisco	CA	94112.0	920000.0	3.0	2.00	...	Sold	NaN	NaN	http://www.redfin.com/CA/San-Francisco/2594-Sa...	MetroList	16067859	N	Y	37.716435	-122.450444
4	PAST SALE	February-27-2017	Single Family Residential	130 Brazil Ave	SAN FRANCISCO	CA	94112.0	790000.0	2.0	1.00	...	Sold	NaN	NaN	http://www.redfin.com/CA/San-Francisco/130-Bra...	MLSListings	ML81635907	N	Y	37.723920	-122.433735

	zip	price	beds	baths	sqft	lotsize	yrbuilt	daysonmkt	pricesqft	hoamonth	nextopenstart	nextopenend	latitude	longitude
count	308.000000	3.080000e+02	301.000000	293.000000	266.000000	139.000000	288.000000	306.00000	264.000000	159.000000	0.0	0.0	309.000000	309.000000
mean	93888.253247	1.521055e+06	2.647841	1.880546	1662.593985	3194.935252	1952.006944	14.30719	949.905303	543.874214	NaN	NaN	37.760820	-122.431289
std	2820.378426	1.063400e+06	1.862499	0.896519	1169.031917	4656.050424	39.950230	8.87308	279.135062	436.562243	NaN	NaN	0.025052	0.027834
min	59059.000000	3.500000e+03	0.000000	1.000000	300.000000	611.000000	1885.000000	2.00000	339.000000	4.000000	NaN	NaN	37.709249	-122.510730
25%	94109.000000	8.787500e+05	2.000000	1.000000	978.750000	2319.500000	1916.750000	6.00000	773.750000	342.500000	NaN	NaN	37.739065	-122.443876
50%	94114.000000	1.271000e+06	2.000000	2.000000	1360.000000	2513.000000	1941.000000	12.00000	963.500000	473.000000	NaN	NaN	37.762549	-122.428527
75%	94122.000000	1.813750e+06	3.000000	2.000000	1904.750000	3090.000000	1996.000000	23.00000	1106.000000	683.000000	NaN	NaN	37.781520	-122.412238
max	94158.000000	1.025000e+07	18.000000	5.500000	9504.000000	56024.000000	2017.000000	30.00000	2053.000000	4413.000000	NaN	NaN	37.805816	-122.371007

	price	beds	baths	sqft	lotsize	yrbuilt
0	1200000.0	2.0	1.25	1500.0	2265.0	1940.0
1	590000.0	0.0	1.00	650.0	NaN	1972.0
2	3260000.0	5.0	3.50	3646.0	2050.0	2003.0
3	920000.0	3.0	2.00	1287.0	2495.0	1960.0
4	790000.0	2.0	1.00	1125.0	2495.0	1923.0

	np.log(price)	predicted
0	13.997832	14.101815
1	13.287878	13.542935
2	14.997238	14.874004
3	13.732129	13.950430
4	13.579788	13.804406

	block_id	bgpop	bgacres	bgjobs	bgmedhhs	bgmedinc	proprent	lowinc1500m	highinc1500m	...	lnret3000m	lnfire3000m	lnserv3000m	prop1per	prop2per	bgmedagehd
0	060014271001000	888.0	90.106156	465.0	2.0	80000.0	0.558603	5.991451	6.114498	...	7.457814	5.329694	8.082779	0.34414	0.381546	57.0
1	060014271001001	888.0	90.106156	465.0	2.0	80000.0	0.558603	5.991451	6.114498	...	7.457814	5.329694	8.082779	0.34414	0.381546	57.0
2	060014271001002	888.0	90.106156	465.0	2.0	80000.0	0.558603	5.991451	6.114498	...	7.457814	5.329694	8.082779	0.34414	0.381546	57.0
3	060014271001003	888.0	90.106156	465.0	2.0	80000.0	0.558603	6.231131	6.339032	...	7.558539	5.431654	8.188618	0.34414	0.381546	57.0
4	060014271001004	888.0	90.106156	465.0	2.0	80000.0	0.558603	6.107350	5.763724	...	7.690577	5.521359	8.287781	0.34414	0.381546	57.0