ADS PublicSpace-Citibike Project

Regression Model Building and Evaluation

Assumptions: all variables are iid, normally distributed

----Split data in training and testing sets (300 training, 226 test sets).

----Using 2015 September ridership data.

1. Possible Pattern in Data

a. Test if the data follows a normal distributions Q-Q plot - Quantile Quantile

b. Data Normalization ((d-mean)/std) Split both original and normalized datasets into training and testing sets

2. Model Fitting and Evaluation

a. Bi-variate regression on each regressors Evaluate R-squared

b. Multivariate linear regression on all regressors Evaluate coefficients using p-values, R-squared and sum of residue squared

c. Ridge Regression (4 fold cv) Evaluate coefficients using p-values, R-squared and sum of residue squared

d. Lasso Regression (4 fold cv) Evaluate coefficients using p-values, R-squared and sum of residue squared

3. Further Study and Efforts

a*. OLS regression model with polynomial regressors(x)

b*. OLS regression model with transformed y

Boxcox - to check if y should be transformed to polygomial or ln



In [2]:

    
import numpy as np
import pandas as pd
from pandas.stats.api import ols

#import Quandl
import statsmodels.formula.api as smf
import statsmodels.api as sm
from scipy import stats
from sklearn import linear_model
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib

1. Possible Pattern in Data



In [3]:

    
master = pd.read_csv('../data/processed/master.csv')



In [4]:

    
master.head()









    Out[4]:






  
    
      
      station_id
      ridership_0115
      ridership_0215
      ridership_0315
      ridership_0415
      ridership_0515
      ridership_0615
      ridership_0715
      ridership_0815
      ridership_0915
      ...
      ridership_1215
      avg_ridership_2015
      bike_lane_score
      park
      street_quality_score
      subway_entrance
      tree_score
      traffic_volume
      median_hh_income
      pop_density
    
  
  
    
      0
      72
      638.0
      480.0
      1026.0
      1948.0
      2943.0
      2767.0
      3149.0
      3504.0
      3667.0
      ...
      1648.0
      2131.615385
      0.0
      1.0
      8.000000
      0.0
      17.364799
      14870.500000
      90174.000000
      0.000807
    
    
      1
      79
      566.0
      335.0
      725.0
      1728.0
      2368.0
      2424.0
      2626.0
      2726.0
      3011.0
      ...
      1579.0
      1760.538462
      3.0
      0.0
      8.571429
      1.0
      9.573955
      9484.666667
      86523.139535
      0.000631
    
    
      2
      82
      310.0
      276.0
      406.0
      788.0
      1068.0
      946.0
      1193.0
      1145.0
      1166.0
      ...
      713.0
      766.538462
      0.0
      1.0
      7.333333
      0.0
      35.070325
      16812.500000
      73988.000000
      0.000511
    
    
      3
      83
      258.0
      162.0
      281.0
      749.0
      1101.0
      994.0
      1659.0
      1724.0
      1505.0
      ...
      717.0
      863.307692
      0.0
      0.0
      7.500000
      0.0
      0.000000
      41976.000000
      85199.000000
      0.000231
    
    
      4
      116
      1696.0
      1193.0
      2105.0
      1933.0
      2281.0
      4800.0
      5674.0
      6175.0
      6558.0
      ...
      3686.0
      3576.692308
      4.0
      1.0
      8.500000
      0.0
      47.824344
      15948.000000
      104974.000000
      0.000742
    
  

5 rows × 22 columns



In [5]:

    
master.bike_lane_score.unique()









    Out[5]:





array([ 0.        ,  3.        ,  4.        ,  1.85714286,  2.5       ,
        2.        ,  1.        ,  2.33333333,  3.5       ,  4.33333333,
        3.66666667,  3.16666667,  3.125     ,  1.8       ,  3.8       ,
        5.        ,  4.3       ,  2.75      ,  3.25      ,  2.27272727,
        3.6       ,  2.66666667,  3.18181818,  3.4       ,  3.33333333,
        3.2       ,  1.66666667,  3.75      ,  1.5       ,  3.375     ,
        4.77777778,  2.2       ,  1.75      ,  4.71428571])



In [6]:

    
master.shape









    Out[6]:





(452, 22)



In [33]:

    
# Create a new data frame with stationID, riderships in month 6-10, and all study features
df0 = master[['station_id','ridership_0615','ridership_0715','ridership_0815','ridership_0915','ridership_1015',\
              'bike_lane_score', 'park', 'street_quality_score', 'subway_entrance', 'tree_score', 'traffic_volume',\
              'median_hh_income', 'pop_density']]



In [34]:

    
df0.head(3)









    Out[34]:






  
    
      
      station_id
      ridership_0615
      ridership_0715
      ridership_0815
      ridership_0915
      ridership_1015
      bike_lane_score
      park
      street_quality_score
      subway_entrance
      tree_score
      traffic_volume
      median_hh_income
      pop_density
    
  
  
    
      0
      72
      2767.0
      3149.0
      3504.0
      3667.0
      3546.0
      0.0
      1.0
      8.000000
      0.0
      17.364799
      14870.500000
      90174.000000
      0.000807
    
    
      1
      79
      2424.0
      2626.0
      2726.0
      3011.0
      2646.0
      3.0
      0.0
      8.571429
      1.0
      9.573955
      9484.666667
      86523.139535
      0.000631
    
    
      2
      82
      946.0
      1193.0
      1145.0
      1166.0
      1053.0
      0.0
      1.0
      7.333333
      0.0
      35.070325
      16812.500000
      73988.000000
      0.000511



In [35]:

    
df0.subway_entrance.unique()









    Out[35]:





array([ 0.,  1.])

a. Q-Q plot ---(move to the part b)

Check if data is normal

Q-Q plot of the quantiles of x versus the quantiles/ppf of a distribution.

Generates a probability plot of sample data against the quantiles of a specified theoretical distribution (the normal distribution by default). probplot optionally calculates a best-fit line for the data

---R-squared of data being normalaly distribuated--

'bike_lane_score' : R_2 = 0.81
'park': R_2 = 0.63 -----------------> low value
'street_quality_score': R_2 = 0.96
'subway_entrance': R_2 = 0.39 ---> very low, not likely to be a normal distributions
'tree_score' : R_2 = 0.95
'traffic_volume': R_2 = 0.71
'median_hh_income': R_2 = 0.81
'pop_density': R_2 = 0.97



In [10]:

    
#sm.qqplot(df0.ix[:,6], fit=True, line='q', ax=None)



In [36]:

    
measurements = df0.ix[:,13]
stats.probplot(measurements, dist="norm", plot=plt)









    Out[36]:





((array([ -2.96117551e+00,  -2.67653933e+00,  -2.51628594e+00,
          -2.40255243e+00,  -2.31336993e+00,  -2.23949821e+00,
          -2.17614091e+00,  -2.12047587e+00,  -2.07069676e+00,
          -2.02557504e+00,  -1.98423579e+00,  -1.94603306e+00,
          -1.91047582e+00,  -1.87718177e+00,  -1.84584720e+00,
          -1.81622671e+00,  -1.78811900e+00,  -1.76135687e+00,
          -1.73579986e+00,  -1.71132878e+00,  -1.68784157e+00,
          -1.66525018e+00,  -1.64347805e+00,  -1.62245822e+00,
          -1.60213174e+00,  -1.58244645e+00,  -1.56335594e+00,
          -1.54481877e+00,  -1.52679774e+00,  -1.50925933e+00,
          -1.49217326e+00,  -1.47551202e+00,  -1.45925059e+00,
          -1.44336614e+00,  -1.42783775e+00,  -1.41264621e+00,
          -1.39777387e+00,  -1.38320443e+00,  -1.36892283e+00,
          -1.35491510e+00,  -1.34116830e+00,  -1.32767037e+00,
          -1.31441010e+00,  -1.30137700e+00,  -1.28856128e+00,
          -1.27595378e+00,  -1.26354588e+00,  -1.25132953e+00,
          -1.23929712e+00,  -1.22744151e+00,  -1.21575596e+00,
          -1.20423410e+00,  -1.19286993e+00,  -1.18165775e+00,
          -1.17059220e+00,  -1.15966815e+00,  -1.14888076e+00,
          -1.13822543e+00,  -1.12769779e+00,  -1.11729368e+00,
          -1.10700912e+00,  -1.09684033e+00,  -1.08678372e+00,
          -1.07683584e+00,  -1.06699339e+00,  -1.05725324e+00,
          -1.04761236e+00,  -1.03806790e+00,  -1.02861707e+00,
          -1.01925723e+00,  -1.00998584e+00,  -1.00080047e+00,
          -9.91698774e-01,  -9.82678495e-01,  -9.73737472e-01,
          -9.64873622e-01,  -9.56084938e-01,  -9.47369490e-01,
          -9.38725416e-01,  -9.30150920e-01,  -9.21644272e-01,
          -9.13203799e-01,  -9.04827888e-01,  -8.96514980e-01,
          -8.88263568e-01,  -8.80072195e-01,  -8.71939453e-01,
          -8.63863977e-01,  -8.55844448e-01,  -8.47879586e-01,
          -8.39968153e-01,  -8.32108948e-01,  -8.24300807e-01,
          -8.16542601e-01,  -8.08833234e-01,  -8.01171642e-01,
          -7.93556793e-01,  -7.85987683e-01,  -7.78463339e-01,
          -7.70982811e-01,  -7.63545180e-01,  -7.56149549e-01,
          -7.48795047e-01,  -7.41480825e-01,  -7.34206057e-01,
          -7.26969939e-01,  -7.19771687e-01,  -7.12610539e-01,
          -7.05485750e-01,  -6.98396595e-01,  -6.91342366e-01,
          -6.84322373e-01,  -6.77335943e-01,  -6.70382418e-01,
          -6.63461158e-01,  -6.56571535e-01,  -6.49712938e-01,
          -6.42884768e-01,  -6.36086442e-01,  -6.29317387e-01,
          -6.22577046e-01,  -6.15864872e-01,  -6.09180331e-01,
          -6.02522899e-01,  -5.95892067e-01,  -5.89287331e-01,
          -5.82708202e-01,  -5.76154200e-01,  -5.69624854e-01,
          -5.63119702e-01,  -5.56638294e-01,  -5.50180185e-01,
          -5.43744941e-01,  -5.37332137e-01,  -5.30941355e-01,
          -5.24572184e-01,  -5.18224223e-01,  -5.11897076e-01,
          -5.05590356e-01,  -4.99303682e-01,  -4.93036680e-01,
          -4.86788983e-01,  -4.80560229e-01,  -4.74350065e-01,
          -4.68158141e-01,  -4.61984114e-01,  -4.55827647e-01,
          -4.49688410e-01,  -4.43566074e-01,  -4.37460320e-01,
          -4.31370831e-01,  -4.25297297e-01,  -4.19239410e-01,
          -4.13196870e-01,  -4.07169379e-01,  -4.01156645e-01,
          -3.95158380e-01,  -3.89174298e-01,  -3.83204120e-01,
          -3.77247569e-01,  -3.71304374e-01,  -3.65374264e-01,
          -3.59456976e-01,  -3.53552248e-01,  -3.47659821e-01,
          -3.41779440e-01,  -3.35910854e-01,  -3.30053814e-01,
          -3.24208075e-01,  -3.18373394e-01,  -3.12549532e-01,
          -3.06736251e-01,  -3.00933318e-01,  -2.95140501e-01,
          -2.89357571e-01,  -2.83584302e-01,  -2.77820469e-01,
          -2.72065852e-01,  -2.66320230e-01,  -2.60583386e-01,
          -2.54855106e-01,  -2.49135177e-01,  -2.43423387e-01,
          -2.37719527e-01,  -2.32023392e-01,  -2.26334774e-01,
          -2.20653472e-01,  -2.14979282e-01,  -2.09312006e-01,
          -2.03651445e-01,  -1.97997401e-01,  -1.92349681e-01,
          -1.86708088e-01,  -1.81072433e-01,  -1.75442522e-01,
          -1.69818167e-01,  -1.64199178e-01,  -1.58585369e-01,
          -1.52976554e-01,  -1.47372546e-01,  -1.41773164e-01,
          -1.36178222e-01,  -1.30587541e-01,  -1.25000938e-01,
          -1.19418233e-01,  -1.13839249e-01,  -1.08263805e-01,
          -1.02691724e-01,  -9.71228303e-02,  -9.15569469e-02,
          -8.59938983e-02,  -8.04335099e-02,  -7.48756071e-02,
          -6.93200164e-02,  -6.37665644e-02,  -5.82150783e-02,
          -5.26653858e-02,  -4.71173148e-02,  -4.15706938e-02,
          -3.60253515e-02,  -3.04811167e-02,  -2.49378188e-02,
          -1.93952870e-02,  -1.38533510e-02,  -8.31184045e-03,
          -2.77058513e-03,   2.77058513e-03,   8.31184045e-03,
           1.38533510e-02,   1.93952870e-02,   2.49378188e-02,
           3.04811167e-02,   3.60253515e-02,   4.15706938e-02,
           4.71173148e-02,   5.26653858e-02,   5.82150783e-02,
           6.37665644e-02,   6.93200164e-02,   7.48756071e-02,
           8.04335099e-02,   8.59938983e-02,   9.15569469e-02,
           9.71228303e-02,   1.02691724e-01,   1.08263805e-01,
           1.13839249e-01,   1.19418233e-01,   1.25000938e-01,
           1.30587541e-01,   1.36178222e-01,   1.41773164e-01,
           1.47372546e-01,   1.52976554e-01,   1.58585369e-01,
           1.64199178e-01,   1.69818167e-01,   1.75442522e-01,
           1.81072433e-01,   1.86708088e-01,   1.92349681e-01,
           1.97997401e-01,   2.03651445e-01,   2.09312006e-01,
           2.14979282e-01,   2.20653472e-01,   2.26334774e-01,
           2.32023392e-01,   2.37719527e-01,   2.43423387e-01,
           2.49135177e-01,   2.54855106e-01,   2.60583386e-01,
           2.66320230e-01,   2.72065852e-01,   2.77820469e-01,
           2.83584302e-01,   2.89357571e-01,   2.95140501e-01,
           3.00933318e-01,   3.06736251e-01,   3.12549532e-01,
           3.18373394e-01,   3.24208075e-01,   3.30053814e-01,
           3.35910854e-01,   3.41779440e-01,   3.47659821e-01,
           3.53552248e-01,   3.59456976e-01,   3.65374264e-01,
           3.71304374e-01,   3.77247569e-01,   3.83204120e-01,
           3.89174298e-01,   3.95158380e-01,   4.01156645e-01,
           4.07169379e-01,   4.13196870e-01,   4.19239410e-01,
           4.25297297e-01,   4.31370831e-01,   4.37460320e-01,
           4.43566074e-01,   4.49688410e-01,   4.55827647e-01,
           4.61984114e-01,   4.68158141e-01,   4.74350065e-01,
           4.80560229e-01,   4.86788983e-01,   4.93036680e-01,
           4.99303682e-01,   5.05590356e-01,   5.11897076e-01,
           5.18224223e-01,   5.24572184e-01,   5.30941355e-01,
           5.37332137e-01,   5.43744941e-01,   5.50180185e-01,
           5.56638294e-01,   5.63119702e-01,   5.69624854e-01,
           5.76154200e-01,   5.82708202e-01,   5.89287331e-01,
           5.95892067e-01,   6.02522899e-01,   6.09180331e-01,
           6.15864872e-01,   6.22577046e-01,   6.29317387e-01,
           6.36086442e-01,   6.42884768e-01,   6.49712938e-01,
           6.56571535e-01,   6.63461158e-01,   6.70382418e-01,
           6.77335943e-01,   6.84322373e-01,   6.91342366e-01,
           6.98396595e-01,   7.05485750e-01,   7.12610539e-01,
           7.19771687e-01,   7.26969939e-01,   7.34206057e-01,
           7.41480825e-01,   7.48795047e-01,   7.56149549e-01,
           7.63545180e-01,   7.70982811e-01,   7.78463339e-01,
           7.85987683e-01,   7.93556793e-01,   8.01171642e-01,
           8.08833234e-01,   8.16542601e-01,   8.24300807e-01,
           8.32108948e-01,   8.39968153e-01,   8.47879586e-01,
           8.55844448e-01,   8.63863977e-01,   8.71939453e-01,
           8.80072195e-01,   8.88263568e-01,   8.96514980e-01,
           9.04827888e-01,   9.13203799e-01,   9.21644272e-01,
           9.30150920e-01,   9.38725416e-01,   9.47369490e-01,
           9.56084938e-01,   9.64873622e-01,   9.73737472e-01,
           9.82678495e-01,   9.91698774e-01,   1.00080047e+00,
           1.00998584e+00,   1.01925723e+00,   1.02861707e+00,
           1.03806790e+00,   1.04761236e+00,   1.05725324e+00,
           1.06699339e+00,   1.07683584e+00,   1.08678372e+00,
           1.09684033e+00,   1.10700912e+00,   1.11729368e+00,
           1.12769779e+00,   1.13822543e+00,   1.14888076e+00,
           1.15966815e+00,   1.17059220e+00,   1.18165775e+00,
           1.19286993e+00,   1.20423410e+00,   1.21575596e+00,
           1.22744151e+00,   1.23929712e+00,   1.25132953e+00,
           1.26354588e+00,   1.27595378e+00,   1.28856128e+00,
           1.30137700e+00,   1.31441010e+00,   1.32767037e+00,
           1.34116830e+00,   1.35491510e+00,   1.36892283e+00,
           1.38320443e+00,   1.39777387e+00,   1.41264621e+00,
           1.42783775e+00,   1.44336614e+00,   1.45925059e+00,
           1.47551202e+00,   1.49217326e+00,   1.50925933e+00,
           1.52679774e+00,   1.54481877e+00,   1.56335594e+00,
           1.58244645e+00,   1.60213174e+00,   1.62245822e+00,
           1.64347805e+00,   1.66525018e+00,   1.68784157e+00,
           1.71132878e+00,   1.73579986e+00,   1.76135687e+00,
           1.78811900e+00,   1.81622671e+00,   1.84584720e+00,
           1.87718177e+00,   1.91047582e+00,   1.94603306e+00,
           1.98423579e+00,   2.02557504e+00,   2.07069676e+00,
           2.12047587e+00,   2.17614091e+00,   2.23949821e+00,
           2.31336993e+00,   2.40255243e+00,   2.51628594e+00,
           2.67653933e+00,   2.96117551e+00]),
  array([  1.02750622e-05,   1.39447273e-05,   1.39447273e-05,
           1.43116938e-05,   1.83483253e-05,   1.89599362e-05,
           2.23849569e-05,   2.23849569e-05,   2.81340989e-05,
           3.99993492e-05,   3.99993492e-05,   4.10390877e-05,
           4.53815247e-05,   6.22008229e-05,   7.31486570e-05,
           7.79803827e-05,   8.94175055e-05,   8.94175055e-05,
           8.94175055e-05,   9.40045868e-05,   9.40045868e-05,
           1.13331490e-04,   1.14921678e-04,   1.19753403e-04,
           1.27337378e-04,   1.28927566e-04,   1.31618654e-04,
           1.31618654e-04,   1.36511541e-04,   1.36511541e-04,
           1.38957984e-04,   1.44278998e-04,   1.48009824e-04,
           1.60303202e-04,   1.62627324e-04,   1.87000016e-04,
           1.91128389e-04,   1.91128389e-04,   1.93146705e-04,
           1.94125282e-04,   1.97550303e-04,   1.98773525e-04,
           2.02443190e-04,   2.05807049e-04,   2.07519560e-04,
           2.07519560e-04,   2.12228963e-04,   2.17121850e-04,
           2.21036159e-04,   2.31311221e-04,   2.31474318e-04,
           2.35837142e-04,   2.37366169e-04,   2.37916619e-04,
           2.37916619e-04,   2.37916619e-04,   2.39629129e-04,
           2.43665761e-04,   2.43665761e-04,   2.43665761e-04,
           2.43910405e-04,   2.44073501e-04,   2.51739024e-04,
           2.54307789e-04,   2.54348563e-04,   2.58099776e-04,
           2.60668542e-04,   2.68374839e-04,   2.70821282e-04,
           2.72656115e-04,   2.74429786e-04,   2.76081135e-04,
           2.78282934e-04,   2.78282934e-04,   2.78527579e-04,
           2.81707955e-04,   2.84643687e-04,   2.89862766e-04,
           2.92778111e-04,   2.93817850e-04,   2.97059387e-04,
           2.97059387e-04,   3.01218341e-04,   3.01891113e-04,
           3.01891113e-04,   3.02258079e-04,   3.04031751e-04,
           3.04092912e-04,   3.04765684e-04,   3.05316134e-04,
           3.08068382e-04,   3.08741154e-04,   3.15509648e-04,
           3.15835840e-04,   3.17548351e-04,   3.17548351e-04,
           3.18404606e-04,   3.18404606e-04,   3.18649250e-04,
           3.20239438e-04,   3.21911175e-04,   3.23175170e-04,
           3.23542137e-04,   3.23542137e-04,   3.23847942e-04,
           3.28190379e-04,   3.29107796e-04,   3.32593977e-04,
           3.33756038e-04,   3.38954730e-04,   3.39240149e-04,
           3.40177952e-04,   3.40789563e-04,   3.44459228e-04,
           3.44459228e-04,   3.48006571e-04,   3.53939196e-04,
           3.54489646e-04,   3.55651706e-04,   3.56161382e-04,
           3.56202156e-04,   3.58770922e-04,   3.64917611e-04,
           3.66538379e-04,   3.66874765e-04,   3.67945084e-04,
           3.68495534e-04,   3.69229467e-04,   3.71614749e-04,
           3.71859394e-04,   3.74265063e-04,   3.75121318e-04,
           3.76568797e-04,   3.76996925e-04,   3.81889811e-04,
           3.88372886e-04,   3.89473786e-04,   3.92654162e-04,
           3.93999706e-04,   3.95222928e-04,   3.98158660e-04,
           3.99014915e-04,   4.00788587e-04,   4.03051547e-04,
           4.03051547e-04,   4.03255417e-04,   4.03418513e-04,
           4.03418513e-04,   4.03663158e-04,   4.05864957e-04,
           4.08403142e-04,   4.09167655e-04,   4.09493848e-04,
           4.11491776e-04,   4.11491776e-04,   4.11614099e-04,
           4.12531515e-04,   4.12918868e-04,   4.15039119e-04,
           4.15528408e-04,   4.16873952e-04,   4.17118596e-04,
           4.17974851e-04,   4.18464140e-04,   4.22500772e-04,
           4.23112382e-04,   4.23662832e-04,   4.25885018e-04,
           4.27271336e-04,   4.27699464e-04,   4.28005269e-04,
           4.30696357e-04,   4.30696357e-04,   4.31246807e-04,
           4.35344599e-04,   4.35956210e-04,   4.36506660e-04,
           4.40849097e-04,   4.40849097e-04,   4.40849097e-04,
           4.40849097e-04,   4.43356701e-04,   4.43540185e-04,
           4.51919253e-04,   4.52469703e-04,   4.52592025e-04,
           4.53692925e-04,   4.56628657e-04,   4.58952778e-04,
           4.61399221e-04,   4.64946564e-04,   4.65068886e-04,
           4.65741658e-04,   4.67637652e-04,   4.68126941e-04,
           4.68616229e-04,   4.69472484e-04,   4.72102411e-04,
           4.72836344e-04,   4.75160465e-04,   4.79502902e-04,
           4.81460057e-04,   4.82132829e-04,   4.82255151e-04,
           4.83743404e-04,   4.84762756e-04,   4.85741333e-04,
           4.85863655e-04,   4.86842232e-04,   4.91123508e-04,
           4.92132666e-04,   4.97423100e-04,   5.07453518e-04,
           5.09227189e-04,   5.10328089e-04,   5.10939700e-04,
           5.12815306e-04,   5.12896854e-04,   5.17606258e-04,
           5.20236185e-04,   5.20970118e-04,   5.22315661e-04,
           5.25190232e-04,   5.26046488e-04,   5.27942481e-04,
           5.27942481e-04,   5.28370609e-04,   5.28839510e-04,
           5.29899636e-04,   5.30633569e-04,   5.30633569e-04,
           5.32468401e-04,   5.33324657e-04,   5.33324657e-04,
           5.39155347e-04,   5.40663987e-04,   5.40663987e-04,
           5.43355074e-04,   5.46841256e-04,   5.47269384e-04,
           5.47269384e-04,   5.48492606e-04,   5.49471183e-04,
           5.49776988e-04,   5.50082794e-04,   5.51917626e-04,
           5.52468076e-04,   5.59175408e-04,   5.60480178e-04,
           5.61152950e-04,   5.62315011e-04,   5.69715502e-04,
           5.71550334e-04,   5.85066934e-04,   5.85923189e-04,
           5.85923189e-04,   5.86575574e-04,   5.88573503e-04,
           5.90877237e-04,   5.91672331e-04,   5.91794653e-04,
           5.92589747e-04,   5.92589747e-04,   5.92650909e-04,
           5.93955678e-04,   5.94975030e-04,   5.95647802e-04,
           5.95647802e-04,   5.96320574e-04,   5.99378628e-04,
           6.00357205e-04,   6.01458105e-04,   6.03121686e-04,
           6.04516159e-04,   6.05372414e-04,   6.07288795e-04,
           6.07696535e-04,   6.07900406e-04,   6.14363094e-04,
           6.18338564e-04,   6.18460886e-04,   6.21151974e-04,
           6.29347559e-04,   6.30693103e-04,   6.30693103e-04,
           6.30693103e-04,   6.34485090e-04,   6.36442245e-04,
           6.36564567e-04,   6.38521722e-04,   6.39011011e-04,
           6.39561460e-04,   6.40050749e-04,   6.41579776e-04,
           6.41579776e-04,   6.44943636e-04,   6.47145435e-04,
           6.48796784e-04,   6.50305424e-04,   6.55218698e-04,
           6.59071846e-04,   6.61885256e-04,   6.64025894e-04,
           6.64025894e-04,   6.64087055e-04,   6.68857620e-04,
           6.70040067e-04,   6.73750507e-04,   6.74606762e-04,
           6.74667923e-04,   6.74973728e-04,   6.74973728e-04,
           6.76502755e-04,   6.77909460e-04,   6.93362828e-04,
           6.93627859e-04,   6.94422953e-04,   6.94708372e-04,
           6.95442305e-04,   6.96930558e-04,   6.99805129e-04,
           7.09713224e-04,   7.09896707e-04,   7.11364574e-04,
           7.12159668e-04,   7.12445086e-04,   7.13321728e-04,
           7.13994500e-04,   7.14606111e-04,   7.16807910e-04,
           7.16807910e-04,   7.22251247e-04,   7.22251247e-04,
           7.25615106e-04,   7.29774060e-04,   7.36624101e-04,
           7.36746424e-04,   7.37174551e-04,   7.38581256e-04,
           7.38825900e-04,   7.38825900e-04,   7.38825900e-04,
           7.40966538e-04,   7.42434404e-04,   7.51302762e-04,
           7.52892950e-04,   7.58642092e-04,   7.59376025e-04,
           7.60232280e-04,   7.67449288e-04,   7.69528765e-04,
           7.71118953e-04,   7.71485920e-04,   7.75277907e-04,
           7.82617237e-04,   7.82617237e-04,   7.83106526e-04,
           7.87387802e-04,   7.91546755e-04,   7.91546755e-04,
           7.91730239e-04,   7.93014621e-04,   7.94054360e-04,
           7.94054360e-04,   7.94727132e-04,   7.94727132e-04,
           8.00843240e-04,   8.06592382e-04,   8.06592382e-04,
           8.06837026e-04,   8.08243731e-04,   8.08794181e-04,
           8.10873658e-04,   8.13931712e-04,   8.18090666e-04,
           8.18702277e-04,   8.29650111e-04,   8.29894755e-04,
           8.30730623e-04,   8.40903750e-04,   8.47447986e-04,
           8.47855727e-04,   8.59435559e-04,   8.86101792e-04,
           8.92462545e-04,   8.94908988e-04,   8.94908988e-04,
           9.07997460e-04,   9.14480535e-04,   9.19342841e-04,
           9.27140880e-04,   9.31360994e-04,   9.31687187e-04,
           9.32706538e-04,   9.33501632e-04,   9.42920439e-04,
           9.42981600e-04,   9.45856171e-04,   9.51238347e-04,
           9.53195502e-04,   9.55703106e-04,   9.64265658e-04,
           9.69586672e-04,   9.84326494e-04,   9.89402864e-04,
           9.89402864e-04,   9.89586347e-04,   9.91788146e-04,
           9.91910468e-04,   9.91910468e-04,   9.91910468e-04,
           9.93684139e-04,   1.00720074e-03,   1.01600794e-03,
           1.02246043e-03,   1.02983034e-03,   1.03704735e-03,
           1.03704735e-03,   1.04634383e-03,   1.05123672e-03,
           1.05820908e-03,   1.05930998e-03,   1.05979927e-03,
           1.06854531e-03,   1.07245962e-03,   1.08511996e-03,
           1.08511996e-03,   1.09955398e-03,   1.10567009e-03,
           1.12921710e-03,   1.12988987e-03,   1.12988987e-03,
           1.14004261e-03,   1.14371228e-03,   1.19398669e-03,
           1.22872619e-03,   1.23398604e-03,   1.23894009e-03,
           1.24988792e-03,   1.25869512e-03,   1.28737967e-03,
           1.31508564e-03,   1.32456561e-03,   1.33110984e-03,
           1.36682792e-03,   1.39043609e-03,   1.43606226e-03,
           1.45361549e-03,   1.45361549e-03,   1.52217707e-03,
           1.56254339e-03,   1.87691136e-03])),
 (0.00030389455738615522, 0.00056348099982210954, 0.98356091365928644))

Based on the R-squared values of Q-Q plot, as the analyses will be based on the asssumption that the data follows a "normal distribution". However, we decide to keep "Park" and "Subway Entrances" as they are part of of "public space" quantifiers.



In [31]:

    
#df0 = df0.ix[:, ~df0.columns.isin(['park', 'subway_entrance'])]
#df0.head(3)

b. Data Normalization



In [47]:

    
# Nomalize columns that are numeric(continuous values) 
# using (X - Xmin)/ (Xmax - Xmin)
df0_n = df0.copy()



In [48]:

    
def normalize1(df1, df0, col):
    col_mean = df0[col].mean()
    col_std = df0[col].std()
    df1[col] = df0[col].apply(lambda c: (c - col_mean) / col_std)
    return col_mean, col_std



In [49]:

    
# Create a dictionaly that stores all the max and min values of the selected columns' original values
col_n = ['ridership_0615', 'ridership_0715','ridership_0815', 'ridership_0915','ridership_1015',\
        'bike_lane_score', 'street_quality_score', 'tree_score', 'traffic_volume', 'median_hh_income', 'pop_density']
col_meanstd = {}
for c in col_n:
    vmean, vstd = normalize1(df0_n, df0, c)
    col_meanstd[c] = np.array([vmean, vstd])



In [50]:

    
df0_n.head()









    Out[50]:






  
    
      
      station_id
      ridership_0615
      ridership_0715
      ridership_0815
      ridership_0915
      ridership_1015
      bike_lane_score
      park
      street_quality_score
      subway_entrance
      tree_score
      traffic_volume
      median_hh_income
      pop_density
    
  
  
    
      0
      72
      -0.044803
      -0.053536
      0.290089
      0.380089
      0.450563
      -1.037727
      1.0
      0.790673
      0.0
      -0.554668
      0.001493
      1.069154e-01
      0.791144
    
    
      1
      79
      -0.223134
      -0.296086
      -0.030448
      0.089412
      0.014532
      0.722092
      0.0
      1.584334
      1.0
      -0.904716
      -0.424358
      2.130764e-15
      0.218505
    
    
      2
      82
      -0.991566
      -0.960664
      -0.681822
      -0.728118
      -0.757243
      -1.037727
      1.0
      -0.135265
      0.0
      0.240854
      0.155045
      -3.670914e-01
      -0.170810
    
    
      3
      83
      -0.966611
      -0.744548
      -0.443273
      -0.577905
      -0.732535
      -1.037727
      0.0
      0.096220
      0.0
      -1.334880
      2.144695
      -3.877741e-02
      -1.079345
    
    
      4
      116
      1.012182
      1.117474
      1.390545
      1.661106
      1.554691
      1.308698
      1.0
      1.485126
      0.0
      0.813900
      0.086690
      5.403333e-01
      0.581773



In [51]:

    
col_meanstd = pd.DataFrame(col_meanstd.items(), columns=['column', 'mean_std'])



In [52]:

    
col_meanstd









    Out[52]:






  
    
      
      column
      mean_std
    
  
  
    
      0
      traffic_volume
      [14851.6123426, 12647.1984179]
    
    
      1
      ridership_0915
      [2809.21596244, 2256.80019503]
    
    
      2
      street_quality_score
      [7.43072263697, 0.719991027548]
    
    
      3
      pop_density
      [0.000563480999822, 0.000307600288934]
    
    
      4
      ridership_0615
      [2853.17434211, 1923.39534573]
    
    
      5
      bike_lane_score
      [1.76903484912, 1.70472136158]
    
    
      6
      ridership_0715
      [3264.43831169, 2156.25780748]
    
    
      7
      ridership_0815
      [2799.9025641, 2427.17614563]
    
    
      8
      tree_score
      [29.7097648459, 22.2565097196]
    
    
      9
      ridership_1015
      [2616.00456621, 2064.07254231]
    
    
      10
      median_hh_income
      [86523.1395349, 34147.1853206]



In [55]:

    
# Using 2015/09 diership and the original fator data to run regression
lm = smf.ols(formula = 'ridership_0915 ~ \
                  bike_lane_score + park + street_quality_score + subway_entrance + tree_score + traffic_volume + \
                  median_hh_income + pop_density', \
                        data = df0).fit()



In [56]:

    
lm.summary()









    Out[56]:





OLS Regression Results

  Dep. Variable:      ridership_0915     R-squared:             0.224


  Model:                    OLS          Adj. R-squared:        0.209


  Method:              Least Squares     F-statistic:           15.01


  Date:              Thu, 08 Dec 2016    Prob (F-statistic):  2.25e-19


  Time:                  13:27:15        Log-Likelihood:      -3839.5


  No. Observations:          426         AIC:                   7697.


  Df Residuals:              417         BIC:                   7734.


  Df Model:                    8                                     


  Covariance Type:       nonrobust                                   




                          coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept               152.8324   1092.630      0.140   0.889  -1994.917  2300.582


  bike_lane_score         255.1303     58.344      4.373   0.000    140.445   369.816


  park                     55.3926    154.598      0.358   0.720   -248.496   359.281


  street_quality_score    -46.4576    139.690     -0.333   0.740   -321.043   228.127


  subway_entrance        1074.3487    303.004      3.546   0.000    478.744  1669.954


  tree_score               -8.6051      4.769     -1.804   0.072    -17.979     0.769


  traffic_volume            0.0011      0.008      0.144   0.885     -0.014     0.017


  median_hh_income          0.0200      0.003      6.827   0.000      0.014     0.026


  pop_density            1.602e+06   3.45e+05      4.646   0.000   9.24e+05  2.28e+06




  Omnibus:        87.393    Durbin-Watson:         1.431


  Prob(Omnibus):   0.000    Jarque-Bera (JB):    178.557


  Skew:            1.097    Prob(JB):           1.69e-39


  Kurtosis:        5.291    Cond. No.           3.32e+08

The R-squared is too small

Set Training and Testing Sets - with built-in cross validation (4 folds)



In [66]:

    
# drop nan-- normalized data
df0_nV = df0_n.dropna(subset=['ridership_0915', 'bike_lane_score', 'park', 'street_quality_score', 'subway_entrance', 'tree_score', 'traffic_volume',\
              'median_hh_income', 'pop_density'])
df0_nV.shape









    Out[66]:





(426, 14)



In [67]:

    
# # drop nan-- original data
df0_V = df0.dropna(subset=['ridership_0915', 'bike_lane_score', 'park', 'street_quality_score', 'subway_entrance', 'tree_score', 'traffic_volume',\
              'median_hh_income', 'pop_density'])
df0_V.shape









    Out[67]:





(426, 14)



In [104]:

    
######## Set traning and Testing sets on normalized data
X0n = df0_nV[['bike_lane_score', 'park', 'street_quality_score', 'subway_entrance', 'tree_score', 'traffic_volume',\
              'median_hh_income', 'pop_density']]
y0n = df0_nV[['ridership_0915']]



In [105]:

    
X0n.shape









    Out[105]:





(426, 8)



In [106]:

    
# for standardized data
X0n_train, X0n_test, y0n_train, y0n_test = train_test_split(X0n, y0n, test_size = 0.295, random_state = 222)

X0n_train.shape, X0n_test.shape, y0n_train.shape, y0n_test.shape









    Out[106]:





((300, 8), (126, 8), (300, 1), (126, 1))



In [107]:

    
# Stack X0n and y0n for convience
Xyn_train = pd.concat([y0n_train, X0n_train], axis=1)
Xyn_test = pd.concat([y0n_test, X0n_test], axis=1)
Xyn_test.head(3)









    Out[107]:






  
    
      
      ridership_0915
      bike_lane_score
      park
      street_quality_score
      subway_entrance
      tree_score
      traffic_volume
      median_hh_income
      pop_density
    
  
  
    
      1
      0.089412
      0.722092
      0.0
      1.584334
      1.0
      -0.904716
      -0.424358
      2.130764e-15
      0.218505
    
    
      130
      -0.171577
      0.428789
      1.0
      -0.320452
      0.0
      0.272101
      -0.925313
      -1.568420e+00
      -0.480990
    
    
      103
      0.611390
      1.308698
      0.0
      1.137899
      0.0
      0.433821
      0.590043
      3.221601e-01
      1.202927



In [108]:

    
######## Set traning and Testing sets on original data
X0 = df0_V[['bike_lane_score', 'park', 'street_quality_score', 'subway_entrance', 'tree_score', 'traffic_volume',\
              'median_hh_income', 'pop_density']]
y0 = df0_V[['ridership_0915']]



In [109]:

    
# for original data
X0_train, X0_test, y0_train, y0_test = train_test_split(X0, y0, test_size = 0.295, random_state = 222)
# Check the shapes of each 
X0_train.shape, X0_test.shape, y0_train.shape, y0_test.shape









    Out[109]:





((300, 8), (126, 8), (300, 1), (126, 1))



In [110]:

    
# Stack X0 and y0 for convience
Xy_train = pd.concat([y0_train, X0_train], axis=1)
Xy_test = pd.concat([y0_test, X0_test], axis=1)
Xy_train.head(3)









    Out[110]:






  
    
      
      ridership_0915
      bike_lane_score
      park
      street_quality_score
      subway_entrance
      tree_score
      traffic_volume
      median_hh_income
      pop_density
    
  
  
    
      399
      1578.0
      0.0
      1.0
      7.0
      0.0
      64.477039
      22588.0
      86523.139535
      0.000640
    
    
      331
      162.0
      0.0
      1.0
      7.5
      0.0
      64.420062
      11251.0
      30686.000000
      0.000492
    
    
      194
      4546.0
      3.0
      0.0
      7.5
      0.0
      62.746180
      5462.0
      61154.000000
      0.000809

2. Model Fitting and Evaluation

a. Run bi-variate regression 8 times and check their fitted coefficients respectively.

$y = coef_i * x_i + intercept_i$ (i from 1 to 8)



In [141]:

    
publicspace_factor = ['bike_lane_score', 'park', 'street_quality_score', 'subway_entrance', 'tree_score', 'traffic_volume',\
              'median_hh_income', 'pop_density']

dic_lm = {} # for creating table
dic_lm0 = {} # for storing models 
for i in xrange(0, 8):
    lm_ = smf.ols(formula = 'ridership_0915 ~ %s'%(publicspace_factor[i]), data = Xy_train).fit()
    dic_lm[publicspace_factor[i]] = [lm_.params[1], lm_.pvalues.values[1], \
                                     lm_.params[0],lm_.pvalues.values[0], \
                                     lm_.rsquared]
    
    # If want to print out the result as a list:
    dic_lm0[i] = lm_
    #print "\nFeature: {1}\n coef{0} = {2}, intercept{0} = {3}".format(i, publicspace_factor[i], dic_lm[i].params[1], dic_lm[i].params[0])
    #print "p-values: \n{0}, \n R-squared = {1}".format(dic_lm[i].pvalues, dic_lm[i].rsquared)

df_lm = pd.DataFrame.from_dict(dic_lm)
df_lm.index = (['coefficient','coeff_p-value','intercept', 'intcp_p-value','r-squared'])



In [142]:

    
df_lm









    Out[142]:






  
    
      
      bike_lane_score
      median_hh_income
      park
      pop_density
      street_quality_score
      subway_entrance
      traffic_volume
      tree_score
    
  
  
    
      coefficient
      2.513414e+02
      2.579756e-02
      -4.104439e+01
      1.520494e+06
      -53.260821
      1.474019e+03
      1.048558e-02
      -5.678828e+00
    
    
      coeff_p-value
      1.405269e-03
      1.047342e-11
      8.418401e-01
      7.916850e-04
      0.769902
      4.170732e-04
      3.594126e-01
      3.386248e-01
    
    
      intercept
      2.299784e+03
      5.547822e+02
      2.771770e+03
      1.905288e+03
      3151.014650
      2.588981e+03
      2.605509e+03
      2.921870e+03
    
    
      intcp_p-value
      5.245604e-27
      9.842604e-02
      5.008318e-49
      8.741039e-11
      0.020785
      7.140759e-52
      1.786979e-28
      2.912760e-32
    
    
      r-squared
      3.370074e-02
      1.439425e-01
      1.338273e-04
      3.714321e-02
      0.000288
      4.099812e-02
      2.819356e-03
      3.073014e-03



In [138]:

    
# Save the table as a csv file
#df_lm.to_csv('../reports/figures/ols-bivar-regre-table.csv')



In [152]:

    
# To better visualize how well the fitted lines are with the datapoints, I used the normalized data
publicspace_factor = ['bike_lane_score', 'park', 'street_quality_score', 'subway_entrance', 'tree_score', 'traffic_volume',\
              'median_hh_income', 'pop_density']

dic_lm0_n = {} # for storing models 
for i in xrange(0, 8):
    lm_ = smf.ols(formula = 'ridership_0915 ~ %s'%(publicspace_factor[i]), data = Xyn_train).fit()    
    dic_lm0_n[i] = lm_  # Storethe result as a list:



In [280]:

    
# Plot data points and model line for each factor (using Normaliaed Data)
fig = plt.figure(figsize=(12,20))

for i in range(0,8):
    plt.subplot(421+i)
    xfit = np.linspace(-2, 4, 200)
    yfit = xfit * dic_lm0_n[i].params[1] + dic_lm0_n[i].params[0]
    plt.plot(xfit, yfit, 'r',linewidth=2)
    plt.plot(Xyn_train[publicspace_factor[i]], Xyn_train.ridership_0915, '.')
    
    plt.title('Fig2a-{}: In-Sample points and OLS fitted Line for\n {}'.format(i+1, publicspace_factor[i]), fontsize=15)
    plt.xlabel('{}'.format(publicspace_factor[i]), fontsize=12)
    plt.ylabel('Normalized Ridership (2015/09)', fontsize=12)
    plt.tight_layout(pad=3)



In [281]:

    
fig.savefig('../reports/figures/3.1-kz-OLS Bi-variate Regressions Figures.png')

b. Run multivariate regression



In [276]:

    
# Run OLS regression on 
lm0 = smf.ols(formula = 'ridership_0915 ~ \
                        bike_lane_score + park + street_quality_score + \
                        + subway_entrance + tree_score + traffic_volume + \
                        median_hh_income + pop_density', \
                        data = Xy_train).fit()



In [277]:

    
lm0.summary()









    Out[277]:





OLS Regression Results

  Dep. Variable:      ridership_0915     R-squared:             0.225


  Model:                    OLS          Adj. R-squared:        0.204


  Method:              Least Squares     F-statistic:           10.59


  Date:              Sun, 11 Dec 2016    Prob (F-statistic):  4.71e-13


  Time:                  23:06:39        Log-Likelihood:      -2710.6


  No. Observations:          300         AIC:                   5439.


  Df Residuals:              291         BIC:                   5472.


  Df Model:                    8                                     


  Covariance Type:       nonrobust                                   




                          coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept              -311.8957   1304.853     -0.239   0.811  -2880.042  2256.251


  bike_lane_score         239.5098     72.837      3.288   0.001     96.155   382.864


  park                     75.3324    187.340      0.402   0.688   -293.381   444.046


  street_quality_score     15.6001    165.723      0.094   0.925   -310.567   341.767


  subway_entrance        1142.8595    398.152      2.870   0.004    359.237  1926.482


  tree_score               -9.9558      5.761     -1.728   0.085    -21.295     1.383


  traffic_volume           -0.0015      0.010     -0.147   0.883     -0.022     0.019


  median_hh_income          0.0217      0.004      5.842   0.000      0.014     0.029


  pop_density            1.462e+06   4.29e+05      3.406   0.001   6.17e+05  2.31e+06




  Omnibus:        80.316    Durbin-Watson:         2.177


  Prob(Omnibus):   0.000    Jarque-Bera (JB):    189.823


  Skew:            1.288    Prob(JB):           6.03e-42


  Kurtosis:        5.924    Cond. No.           3.35e+08

OLS Regression Model:

$$ ridership = 239.510 \times bike\_lane\_score + 75.332 \times park + 15.600 \times street\_quality\_score + 1142.860 \times subway\_entrance - 9.956 \times tree\_score - 0.002 \times traffic\_volume + 0.022 \times median\_hh\_income + 1462000 \times pop\_density - 311.896 $$

Model Evaluation



In [164]:

    
# Test In-Sample and OutofSample RSS and R^2
y0_pred_IS = lm0.predict(X0_train)
err_IS = (y0_pred_IS - np.asarray(y0_train).T).T
RSS_IS = sum(err_IS**2)
R_2_IS = 1 - np.var(err_IS)/np.var(y0_train)
print("The Residual sum of square for OLS regression is: {0}\nThe R-squared is: {1}".format(RSS_IS, R_2_IS))

y0_pred_OS = lm0.predict(X0_test) #compute the prediction for the test sample 
err_OS = (y0_pred_OS - np.asarray(y0_test).T).T
RSS_OS = sum(err_OS**2)
R_2_OS = 1 - np.var(err_OS)/np.var(y0_test)

print("\nThe Residual sum of square for OLS regression is: {0}\nThe R-squared is: {1}".format(RSS_OS, R_2_OS))









    



The Residual sum of square for OLS regression is: 1237417993.53
The R-squared is: ridership_0915    0.225476
dtype: float64

The Residual sum of square for OLS regression is: 447676050.035
The R-squared is: ridership_0915    0.207784
dtype: float64

-- The R-squared are low and the sum of residue are high in both in-sample and out_sample tests. --



In [251]:

    
# Plot Residues
fig = plt.figure(figsize=(12,4))

ax1 = fig.add_subplot(121, title='Fig2b-1: OLS Model In-Sample Residues',\
                      xlabel='Citibike Station', ylabel='Ridership Residue')
ax1.plot(err_IS, '.')
ax1.plot([0,300], [0,0], 'r',linewidth=3)

ax2 = fig.add_subplot(122, title='Fig2b-2: OLS Model Out-Sample Residues',\
                     xlabel='Citibike Station', ylabel='Ridership Residue')
ax2.plot(err_OS, '.')
ax2.plot([0,150], [0,0], 'r',linewidth=3)

fig.tight_layout(pad= 2)

Figure 2b-1, 2b-2: The residues in both graphs are not evenly distributed along the sizes of the 0 horizontal line. This means that the our OLS multi-variate model does not fit the data well as we expected. It might be due to undectected outliers or that the correlation between the ridership and the public space factors is weak.



In [252]:

    
fig.savefig('../reports/figures/3.1-kz-OLS-Multiv Regression IS & OS Residue Figures.png')

c. Ridge Regression

**documentation** linear_model.RidgeCV(self, alphas=(0.1, 1.0, 10.0), fit_intercept=True, normalize=False, scoring=None, cv=None, gcv_mode=None, store_cv_values=False) Docstring: Ridge regression with built-in cross-validation. By default, it performs Generalized Cross-Validation, which is a form of efficient Leave-One-Out cross-validation. alphas : numpy array of shape [n_alphas] normalize: When the regressors are normalized, note that this makes the hyperparameters learnt more robust and almost independent of the number of samples. The same property is not valid for standardized data. However, if you wish to standardize, please use `preprocessing.StandardScaler` before calling `fit` on an estimator with `normalize=False`.



In [262]:

    
# 4 fold cross-validation
# potential lambdas 0.1, 1.0, 2.0, 5.0, 10.0, 20.0, 30.0, 50, 80, 1e2, 2*1e2, 5*1e2, 8*1e2, 1e3

Ridge = linear_model.RidgeCV(fit_intercept=True, normalize=True, cv=4, store_cv_values=False)

lm_r = Ridge.fit(X0_train, y0_train)



In [263]:

    
#linear_model.RidgeCV?



In [264]:

    
print ('Selected Lambda by 4 fold cross validation:\n {0} \
       \n\nRidge Model intercept: {1} \
       \nRidge Model coefficients: {2}'     
       .format(lm_r.alpha_, lm_r.intercept_, lm_r.coef_))









    



Selected Lambda by 4 fold cross validation:
 10.0        

Ridge Model intercept: [ 108.46800774]        
Ridge Model coefficients: [[  2.24611141e+02   8.80906142e+01   3.31529798e+01   7.91864909e+02
   -5.77929707e+00  -1.04379338e-03   2.40743384e-02   3.34071360e+00]]

Ridge Model:

$$ ridership = 224.611 \times {bike\_lane\_score} + 88.091 \times park + 33.153 \times {street\_quality\_score} + 791.865 \times {subway\_entrance} - 5.779 \times {tree\_score} - 0.001 \times {traffic\_volume} + 0.024 \times {median\_hh\_income} + 3.341 \times {pop\_density} + 108.468 $$

Ridge Model Evaluation



In [208]:

    
# Test In-Sample and OutofSample RSS and R^2
y0_pred_IS_r = lm_r.predict(X0_train)
err_IS_r = y0_pred_IS_r - y0_train
RSS_IS_r = sum(err_IS_r**2)
R_2_IS_r = 1 - np.var(err_IS_r)/np.var(y0_train)
print("The Residual sum of square for IS Ridge is: {0}\nThe R-squared for IS Ridge is: {1}".format(RSS_IS_r.values, R_2_IS_r.values))

y0_pred_OS_r = lm_r.predict(X0_test) #compute the prediction for the Test set
err_OS_r = y0_pred_OS_r - y0_test
RSS_OS_r = sum(err_OS_r**2)
R_2_OS_r = 1 - np.var(err_OS_r)/np.var(y0_test)

print("\nThe Residual sum of square for OS Ridge is: {0}\nThe R-squared for OS Ridge is: {1}".format(RSS_OS_r.values, R_2_OS_r.values))









    



The Residual sum of square for IS Ridge is: [  1.28906969e+09]
The R-squared for IS Ridge is: [ 0.19314584]

The Residual sum of square for OS Ridge is: [  4.83620338e+08]
The R-squared for OS Ridge is: [ 0.14407954]

-- The R-squared are low and the sum of residue are high in both in-sample and out_sample tests. --



In [257]:

    
# Plot Residues
fig = plt.figure(figsize=(12,4))

ax1 = fig.add_subplot(121, title='Fig2c-1: Ridge Model In-Sample Residues',\
                     xlabel='Citibike Station', ylabel='Ridership Residue')
ax1.plot(err_IS_r, '.')
ax1.plot([0,450], [0,0], 'r',linewidth=3)

ax2 = fig.add_subplot(122, title='Fig2c-2: Ridge Model Out-Sample Residues',\
                     xlabel='Citibike Station', ylabel='Ridership Residue')
ax2.plot(err_OS_r, '.')
ax2.plot([0,450], [0,0], 'r',linewidth=3)

fig.tight_layout(pad=3)

Figure 2c-1, 2c-2: The residues in both graphs are not evenly distributed on each side of the 0 horizontal line. This means that the our Ridge model does not fit the data so well as we expected. It might be due to undetected outliers or that the correlation between the ridership and the public space factors is weak.



In [258]:

    
fig.savefig('../reports/figures/3.1-kz-Ridge Regression IS & OS Residue Figures.png')

d. LASSO Regression

**Documentation** Lasso linear model with iterative fitting along a regularization path The best model is selected by cross-validation. The optimization objective for Lasso is:: (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1 alpha_ : float The amount of penalization chosen by cross validation coef_ : array, shape (n_features,) | (n_targets, n_features) parameter vector (w in the cost function formula) intercept_ : float | array, shape (n_targets,) independent term in decision function. mse_path_ : array, shape (n_alphas, n_folds) mean square error for the test set on each fold, varying alpha alphas_ : numpy array, shape (n_alphas,) The grid of alphas used for fitting



In [198]:

    
# Possible lambdas: 0.1, 1.0, 10.0, 50, 80, 1e2, 2*1e2, 5*1e2, 8*1e2, 1e3, 1e4, 1e5

LASSO = linear_model.LassoCV(n_alphas=20, alphas=None, \
                             fit_intercept=True, normalize=True, cv=4, positive=True, random_state=222, selection='random')

lm_l = LASSO.fit(X0_train, np.asarray(y0_train))



In [199]:

    
lm_l.alphas_









    Out[199]:





array([ 50.54915931,  35.14141141,  24.43005606,  16.98359898,
        11.80687567,   8.20805491,   5.70618064,   3.96689568,
         2.7577573 ,   1.91717301,   1.33280486,   0.92655634,
         0.64413529,   0.44779822,   0.31130609,   0.21641775,
         0.15045206,   0.10459319,   0.07271243,   0.05054916])



In [200]:

    
lm_l.mse_path_









    Out[200]:





array([[ 4187479.55017497,  6178972.92191939,  4216602.36760159,
         6419472.28031605],
       [ 3976875.07478882,  6022645.73340973,  3911139.12332971,
         5932632.88645085],
       [ 3852297.82109997,  5995766.18737591,  3755958.4254934 ,
         5604765.07660109],
       [ 3763314.82590513,  5874494.8750587 ,  3496502.49567904,
         5324118.24329369],
       [ 3746013.72829196,  5802682.3681921 ,  3353167.81436111,
         5059178.10618836],
       [ 3755508.65631305,  5774810.76413803,  3289166.46831013,
         4876291.80918443],
       [ 3772527.82211899,  5778727.46438067,  3261784.85803965,
         4769934.12198652],
       [ 3789476.630768  ,  5793926.32552716,  3251087.23008897,
         4710005.16253453],
       [ 3806480.18836028,  5811482.63352161,  3250031.68424422,
         4671824.45329605],
       [ 3819815.30688708,  5826827.52131983,  3252950.13449984,
         4647276.3025512 ],
       [ 3829798.56200209,  5838711.18646485,  3256600.59760505,
         4630980.49735698],
       [ 3837142.91427067,  5847358.03338465,  3259920.75700282,
         4620083.75670066],
       [ 3842318.00404082,  5853605.67526887,  3262572.37869322,
         4612728.41242193],
       [ 3846053.95405426,  5858121.05553143,  3264631.85044663,
         4607679.630492  ],
       [ 3848682.25844495,  5861265.33661082,  3266150.24097634,
         4604329.12687359],
       [ 3850516.87996214,  5863485.28728104,  3267281.89879635,
         4601836.16528684],
       [ 3851815.39305045,  5865034.1496829 ,  3268047.39904318,
         4600178.81746554],
       [ 3852725.98240316,  5866115.41076656,  3268577.99230945,
         4599057.75253848],
       [ 3853351.89887494,  5866877.39308151,  3268971.47109641,
         4598267.71964476],
       [ 3853809.01050042,  5867365.1105798 ,  3269246.41010583,
         4597714.32622814]])



In [201]:

    
print ('Selected Lambda by 4 fold cross validation:\n {0} \
       \n\nLasso Model intercept: {1} \
       \nLasso Model coefficients: {2}'     
       .format(lm_l.alpha_, lm_l.intercept_, lm_l.coef_))









    



Selected Lambda by 4 fold cross validation:
 2.75775729747        

Lasso Model intercept: -107.019635487        
Lasso Model coefficients: [  2.02037116e+02   0.00000000e+00   0.00000000e+00   1.15181498e+03
   0.00000000e+00   0.00000000e+00   2.04608428e-02   1.10787901e+06]

LASSO Model:

$$ Ridership = 202.037 \times bike\_lane\_score + 1151.815 \times subway\_entrance + 0.020 \times median\_hh\_income + 1107879.01 \times pop\_density - 107.020 $$

Lasso Model Evaluation



In [279]:

    
# Test In-Sample and OutofSample R^2
y0_pred_IS_l = lm_l.predict(X0_train)
y0_pred_IS_l = y0_pred_IS_l.reshape(300,1) # reshape to (212,1) from (212,)
err_IS_l = y0_pred_IS_l - y0_train

RSS_IS_l = sum(err_IS_l**2)
R_2_IS_l = 1 - np.var(err_IS_l)/np.var(y0_train)
print("The Redidual Sum of Square for IS is: {0}\nThe R-squared for IS Lasso is: {1}".format(RSS_IS_l.values, R_2_IS_l.values))

y0_pred_OS_l = lm_l.predict(X0_test) #compute the prediction for the test sample 
y0_pred_OS_l = y0_pred_OS_l.reshape(126,1)
 # reshape to (212,1) from (212,)
err_OS_l = y0_pred_OS_l - y0_test

RSS_OS_l = (err_OS_l**2).sum()
R_2_OS_l = 1 - np.var(err_OS_l)/np.var(y0_test)

print("\nThe Redidual Sum of Square for OS is: {0}\nThe R-squared for OS Lasso is: {1}".format(RSS_OS_l.values, R_2_OS_l.values))









    



The Redidual Sum of Square for IS is: [  1.25284129e+09]
The R-squared for IS Lasso is: [ 0.2158219]

The Redidual Sum of Square for OS is: [  4.49010171e+08]
The R-squared for OS Lasso is: [ 0.20564466]

The R-squared are low and the sum of residue are high in both in-sample and out_sample tests.



In [260]:

    
# Plot Residues
fig = plt.figure(figsize=(12,4))

ax1 = fig.add_subplot(121, title='Fig2d-1: Lasso Model In-Sample Residues',\
                    xlabel='Citibike Station', ylabel='Ridership Residue')
ax1.plot(err_IS_l, '.')
ax1.plot([0,450], [0,0], 'r',linewidth=3)

ax2 = fig.add_subplot(122, title='Fig2d-2: Lasso Model Out-Sample Residues',\
                     xlabel='Citibike Station', ylabel='Ridership Residue')
ax2.plot(err_OS_l, '.')
ax2.plot([0,450], [0,0], 'r',linewidth=3)

fig.tight_layout(pad=3)

Figure 2d-1, 2d-2: The residues in both graphs are not evenly distributed along the sizes of the 0 horizontal line. This means that the our model does not fit the data well as we expected. It might be due to undectected outliers or that the correlation between the ridership and the public space factors is weak.



In [261]:

    
fig.savefig('../reports/figures/3.1-kz-LASSO Regression IS & OS Residue Figures.png')

--------------------------------------------------------------------------------------------------

3. Further Studies and Efforts

With further modification and colloection of data, we might need to build more models to better fit the data such as the following:

a. Polynomial Model - Transformation on X

b. Boxcox - Transformation on y

A Box-Cox normality plot shows graphically what the best transformation parameter is to use in boxcox to obtain a distribution that is close to normal.

# Code example: #stats.boxcox_normplot? np.asarray(y0_train).shape x = stats.loggamma.rvs(5, size=500) + 5 # Create a Box_Cox plot fig = plt.figure() ax = fig.add_subplot(111) prob = stats.boxcox_normplot(np.asarray(), -20, 20, plot=ax) #Determine and plot the optimal lmbda to transform x and plot it in the same plot: _, maxlog = stats.boxcox(x) ax.axvline(maxlog, color='r')



In [ ]:

	station_id	ridership_0115	ridership_0215	ridership_0315	ridership_0415	ridership_0515	ridership_0615	ridership_0715	ridership_0815	ridership_0915	...	ridership_1215	avg_ridership_2015	bike_lane_score	park	street_quality_score	subway_entrance	tree_score	traffic_volume	median_hh_income	pop_density
0	72	638.0	480.0	1026.0	1948.0	2943.0	2767.0	3149.0	3504.0	3667.0	...	1648.0	2131.615385	0.0	1.0	8.000000	0.0	17.364799	14870.500000	90174.000000	0.000807
1	79	566.0	335.0	725.0	1728.0	2368.0	2424.0	2626.0	2726.0	3011.0	...	1579.0	1760.538462	3.0	0.0	8.571429	1.0	9.573955	9484.666667	86523.139535	0.000631
2	82	310.0	276.0	406.0	788.0	1068.0	946.0	1193.0	1145.0	1166.0	...	713.0	766.538462	0.0	1.0	7.333333	0.0	35.070325	16812.500000	73988.000000	0.000511
3	83	258.0	162.0	281.0	749.0	1101.0	994.0	1659.0	1724.0	1505.0	...	717.0	863.307692	0.0	0.0	7.500000	0.0	0.000000	41976.000000	85199.000000	0.000231
4	116	1696.0	1193.0	2105.0	1933.0	2281.0	4800.0	5674.0	6175.0	6558.0	...	3686.0	3576.692308	4.0	1.0	8.500000	0.0	47.824344	15948.000000	104974.000000	0.000742

	station_id	ridership_0615	ridership_0715	ridership_0815	ridership_0915	ridership_1015	bike_lane_score	park	street_quality_score	subway_entrance	tree_score	traffic_volume	median_hh_income	pop_density
0	72	-0.044803	-0.053536	0.290089	0.380089	0.450563	-1.037727	1.0	0.790673	0.0	-0.554668	0.001493	1.069154e-01	0.791144
1	79	-0.223134	-0.296086	-0.030448	0.089412	0.014532	0.722092	0.0	1.584334	1.0	-0.904716	-0.424358	2.130764e-15	0.218505
2	82	-0.991566	-0.960664	-0.681822	-0.728118	-0.757243	-1.037727	1.0	-0.135265	0.0	0.240854	0.155045	-3.670914e-01	-0.170810
3	83	-0.966611	-0.744548	-0.443273	-0.577905	-0.732535	-1.037727	0.0	0.096220	0.0	-1.334880	2.144695	-3.877741e-02	-1.079345
4	116	1.012182	1.117474	1.390545	1.661106	1.554691	1.308698	1.0	1.485126	0.0	0.813900	0.086690	5.403333e-01	0.581773

	column	mean_std
0	traffic_volume	[14851.6123426, 12647.1984179]
1	ridership_0915	[2809.21596244, 2256.80019503]
2	street_quality_score	[7.43072263697, 0.719991027548]
3	pop_density	[0.000563480999822, 0.000307600288934]
4	ridership_0615	[2853.17434211, 1923.39534573]
5	bike_lane_score	[1.76903484912, 1.70472136158]
6	ridership_0715	[3264.43831169, 2156.25780748]
7	ridership_0815	[2799.9025641, 2427.17614563]
8	tree_score	[29.7097648459, 22.2565097196]
9	ridership_1015	[2616.00456621, 2064.07254231]
10	median_hh_income	[86523.1395349, 34147.1853206]

Dep. Variable:	ridership_0915	R-squared:	0.224
Model:	OLS	Adj. R-squared:	0.209
Method:	Least Squares	F-statistic:	15.01
Date:	Thu, 08 Dec 2016	Prob (F-statistic):	2.25e-19
Time:	13:27:15	Log-Likelihood:	-3839.5
No. Observations:	426	AIC:	7697.
Df Residuals:	417	BIC:	7734.
Df Model:	8
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	152.8324	1092.630	0.140	0.889	-1994.917 2300.582
bike_lane_score	255.1303	58.344	4.373	0.000	140.445 369.816
park	55.3926	154.598	0.358	0.720	-248.496 359.281
street_quality_score	-46.4576	139.690	-0.333	0.740	-321.043 228.127
subway_entrance	1074.3487	303.004	3.546	0.000	478.744 1669.954
tree_score	-8.6051	4.769	-1.804	0.072	-17.979 0.769
traffic_volume	0.0011	0.008	0.144	0.885	-0.014 0.017
median_hh_income	0.0200	0.003	6.827	0.000	0.014 0.026
pop_density	1.602e+06	3.45e+05	4.646	0.000	9.24e+05 2.28e+06

Omnibus:	87.393	Durbin-Watson:	1.431
Prob(Omnibus):	0.000	Jarque-Bera (JB):	178.557
Skew:	1.097	Prob(JB):	1.69e-39
Kurtosis:	5.291	Cond. No.	3.32e+08

	ridership_0915	bike_lane_score	park	street_quality_score	tree_score	traffic_volume	median_hh_income	pop_density
399	1578.0	0.0	1.0	7.0	64.477039	22588.0	86523.139535	0.000640
331	162.0	0.0	1.0	7.5	64.420062	11251.0	30686.000000	0.000492
194	4546.0	3.0	0.0	7.5	62.746180	5462.0	61154.000000	0.000809

	bike_lane_score	median_hh_income	park	pop_density	street_quality_score	subway_entrance	traffic_volume	tree_score
coefficient	2.513414e+02	2.579756e-02	-4.104439e+01	1.520494e+06	-53.260821	1.474019e+03	1.048558e-02	-5.678828e+00
coeff_p-value	1.405269e-03	1.047342e-11	8.418401e-01	7.916850e-04	0.769902	4.170732e-04	3.594126e-01	3.386248e-01
intercept	2.299784e+03	5.547822e+02	2.771770e+03	1.905288e+03	3151.014650	2.588981e+03	2.605509e+03	2.921870e+03
intcp_p-value	5.245604e-27	9.842604e-02	5.008318e-49	8.741039e-11	0.020785	7.140759e-52	1.786979e-28	2.912760e-32
r-squared	3.370074e-02	1.439425e-01	1.338273e-04	3.714321e-02	0.000288	4.099812e-02	2.819356e-03	3.073014e-03

Omnibus:	80.316	Durbin-Watson:	2.177
Prob(Omnibus):	0.000	Jarque-Bera (JB):	189.823
Skew:	1.288	Prob(JB):	6.03e-42
Kurtosis:	5.924	Cond. No.	3.35e+08