In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas import DataFrame, Series
%pylab inline


Populating the interactive namespace from numpy and matplotlib

In [2]:
c_cycle=("#3498db","#e74c3c","#1abc9c","#9b59b6","#f1c40f","#ecf0f1","#34495e",
                  "#446cb3","#d24d57","#27ae60","#663399", "#f7ca18","#bdc3c7","#2c3e50")
mpl.rc('font', family='Bitstream Vera Sans', size=20)
mpl.rc('lines', linewidth=2,color="#2c3e50")
mpl.rc('patch', linewidth=0,facecolor="none",edgecolor="none")
mpl.rc('text', color='#2c3e50')
mpl.rc('axes', facecolor='none',edgecolor="none",titlesize=25,labelsize=15,color_cycle=c_cycle,grid=False)
mpl.rc('xtick.major',size=10,width=0)
mpl.rc('ytick.major',size=10,width=0)
mpl.rc('xtick.minor',size=10,width=0)
mpl.rc('ytick.minor',size=10,width=0)
mpl.rc('ytick',direction="out")
mpl.rc('grid',color='#c0392b',alpha=0.3,linewidth=1)
mpl.rc('legend',numpoints=3,fontsize=15,borderpad=0,markerscale=3,labelspacing=0.2,frameon=False,framealpha=0.6,handlelength=1,handleheight=0.5)
mpl.rc('figure',figsize=(10,6),dpi=80,facecolor="none",edgecolor="none")
mpl.rc('savefig',dpi=100,facecolor="none",edgecolor="none")

In [3]:
weather = pd.read_table("daily_weather.tsv")
usage = pd.read_table("usage_2012.tsv")
stations = pd.read_table("stations.tsv")

In [4]:
weather.loc[weather['season_code'] == 1, 'season_desc'] = 'winter'
weather.loc[weather['season_code'] == 2, 'season_desc'] = 'spring'
weather.loc[weather['season_code'] == 3, 'season_desc'] = 'summer'
weather.loc[weather['season_code'] == 4, 'season_desc'] = 'fall'

In [5]:
weather['date'] = pd.to_datetime(weather['date'])

In [6]:
month_rental = weather.groupby(weather['date'].dt.month)['total_riders'].sum()

In [7]:
mean = weather.groupby('season_desc')['temp'].mean()

To start with, we'll need to compute the number of rentals per station per day. Use pandas to do that.


In [8]:
count = usage['station_start'].value_counts()

In [9]:
average_rental_df = DataFrame({ 'average_rental' : count / 365})

In [10]:
average_rental_df


Out[10]:
average_rental
Massachusetts Ave & Dupont Circle NW 191.369863
Columbus Circle / Union Station 151.084932
15th & P St NW 135.386301
17th & Corcoran St NW 119.306849
14th & V St NW 110.252055
Adams Mill & Columbia Rd NW 110.087671
Thomas Circle 109.865753
Eastern Market Metro / Pennsylvania Ave & 7th St SE 108.884932
16th & Harvard St NW 95.531507
21st & I St NW 91.000000
20th St & Florida Ave NW 87.536986
North Capitol St & F St NW 87.175342
14th & Rhode Island Ave NW 87.019178
7th & F St NW / National Portrait Gallery 86.493151
8th & H St NW 85.638356
Metro Center / 12th & G St NW 84.454795
Lincoln Park / 13th & East Capitol St NE 80.808219
5th & K St NW 78.709589
17th & Rhode Island Ave NW 78.512329
Calvert St & Woodley Pl NW 78.246575
Park Rd & Holmead Pl NW 77.728767
Jefferson Dr & 14th St SW 76.805479
21st & M St NW 76.605479
10th & U St NW 71.438356
14th & Harvard St NW 71.249315
New Hampshire Ave & T St NW 70.969863
Convention Center / 7th & M St NW 69.309589
25th St & Pennsylvania Ave NW 69.021918
14th & R St NW 68.334247
1st & M St NE 67.942466
... ...
King St Metro 3.967123
12th & Newton St NE 3.564384
Glebe Rd & 11th St N 3.309589
Washington Blvd & 10th St N 3.073973
Braddock Rd Metro 2.895890
Anacostia Metro 2.542466
Market Square / King St & Royal St 2.279452
Washington Blvd & 7th St N 2.189041
Good Hope Rd & MLK Ave SE 2.134247
Anacostia Ave & Benning Rd NE / River Terrace 2.084932
Saint Asaph St & Pendleton St 2.054795
Utah St & 11th St N 2.038356
Prince St & Union St 1.797260
King St & Patrick St 1.767123
Barton St & 10th St N 1.583562
Benning Rd & East Capitol St NE / Benning Rd Metro 1.567123
Arlington Blvd & N Queen St 1.391781
Pennsylvania & Minnesota Ave SE 1.309589
Anacostia Library 1.095890
Commerce St & Fayette St 1.057534
Congress Heights Metro 0.775342
Benning Branch Library 0.758904
Good Hope & Naylor Rd SE 0.739726
Minnesota Ave Metro/DOES 0.715068
Good Hope Rd & 14th St SE 0.668493
Henry St & Pendleton St 0.660274
Fairfax Village 0.536986
Branch & Pennsylvania Ave SE 0.493151
Potomac Ave & 35th St S 0.361644
Randle Circle & Minnesota Ave SE 0.361644

185 rows × 1 columns

a. Our stations data has a huge number of quantitative attributes: fast_food, parking, restaurant, etc... Some of them are encoded as 0 or 1 (for absence or presence), others represent counts. To start with, run a simple linear regression where the input (x) variables are all the various station attributes and the output (y) variable is the average number of rentals per day.


In [11]:
from sklearn import linear_model

In [12]:
indexed_avg_df = DataFrame(average_rental_df.index, columns=['station'])

In [13]:
indexed_avg_df['avg_rentals'] = average_rental_df.values

In [14]:
indexed_avg_df['station'] = average_rental_df.index

In [15]:
avgerage_stations_df = pd.merge(left=indexed_avg_df, right=stations, on='station')

In [16]:
x = avgerage_stations_df[list(avgerage_stations_df.columns.values[8:])]
y = avgerage_stations_df[list(avgerage_stations_df.columns.values[1:2])]

In [17]:
linear_reg = linear_model.LinearRegression()
linear_reg.fit(x, y)


Out[17]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Plot the predicted values (model.predict(x)) against the actual values and see how they compare.


In [18]:
plt.scatter(y, linear_reg.predict(x), s=50)
plt.show()


c. In this case, there are 129 input variables and only 185 rows which means we're very likely to overfit. Look at the model coefficients and see if anything jumps out as odd.


In [19]:
linear_reg.coef_


Out[19]:
array([[  2.31721704e+00,  -2.14932061e-01,   5.55710831e-02,
         -6.33034472e+01,   1.95461168e+00,  -4.30632265e+00,
          5.51850630e+00,   1.90775768e+00,  -3.78606353e-01,
          7.83669551e-12,   2.47157737e+00,   6.92093575e+01,
          4.55079500e+00,   5.25217425e-12,  -1.24598788e-12,
          5.24796402e+00,   1.26818374e-11,   4.71211429e+00,
          9.12405995e+00,  -3.03680165e+00,   3.23088610e+00,
         -3.75439309e+01,   2.80098019e+01,  -3.85041448e+01,
         -1.56214076e+01,   2.12991919e+01,   1.52771531e+00,
          3.03238328e+00,  -4.28998558e+00,   8.94580818e+00,
          2.12991919e+01,  -2.85293190e-01,   3.24996230e+00,
          1.29635558e+01,   4.81656870e+00,  -1.97777901e+00,
         -3.44530831e+01,  -1.67244489e+01,   5.32312313e+00,
          5.04749123e+00,   1.14823383e+01,  -5.85187161e+00,
          1.48491231e+02,  -5.72315824e+00,  -1.28867680e+01,
          1.82383250e+02,  -4.46662968e-02,   4.04835029e-01,
         -6.61263124e+00,  -9.07174761e+00,   7.65223912e+01,
          6.04577483e+00,  -6.46742643e+00,  -1.93117599e+01,
          3.29802294e+01,   1.00463950e+01,   2.38344517e+01,
          5.40851589e+01,   1.56183090e+01,   5.19409583e+00,
          1.24487220e+01,  -1.82288484e+01,  -2.90906428e+00,
         -2.49165120e+01,  -2.70775807e+01,   1.76334302e+01,
         -2.49953401e+01,  -1.64753484e+01,  -1.71491578e+01,
         -2.01993823e+01,   4.59571709e+01,  -8.59494580e+01,
         -1.96050518e+01,   1.17080604e+01,   1.59601641e+00,
          3.42277281e+01,   3.42277281e+01,  -1.44175168e+01,
          3.68231342e+01,  -3.79667801e+00,  -3.79667801e+00,
          1.00733949e+02,   7.85516211e-15,  -8.42154144e+00,
          1.15940792e+01,  -6.42365644e-01,  -6.42365644e-01,
         -1.10034617e+01,   3.11225185e-14,  -3.94065511e-15,
          2.96425076e+01,  -5.62058836e-15,   1.57404934e+01,
         -4.97305262e+01,   4.74346096e-15,   2.15893669e-15,
          6.07615746e-16,  -9.13249113e-15,  -8.50400911e-15,
          4.96298604e-16,  -8.04695128e-15,   6.90004690e-15,
          4.59741844e-02,   3.51351643e-01,  -1.08045886e+00,
         -6.32501804e-01,   4.07413948e+01,   0.00000000e+00,
         -3.63886442e-01,  -4.98038835e+00,   3.29809209e+01,
          1.29899253e+01,  -3.54106371e+01,  -4.99597010e+00,
          1.74971881e+01,   0.00000000e+00,   3.83029840e+00,
         -1.03432997e+00,  -1.88116448e+00,  -6.24305687e+00,
         -1.06129934e+00,  -9.48083675e+01,   1.64476743e+00,
         -2.18760877e+01,  -2.33548825e+00,  -1.82010904e+01,
         -2.15230845e+01,  -3.49147840e+01,   6.84319009e+00]])

d. Go back and split the data into a training set and a test set. Train the model on the training set and evaluate it on the test set. How does it do?


In [20]:
from sklearn.cross_validation import train_test_split

In [21]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=42)

In [22]:
lin_reg = linear_model.LinearRegression()
lin_reg.fit(x_train, y_train)


Out[22]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [23]:
plt.scatter(lin_reg.predict(x_test), y_test, s=50)
plt.xlabel('predicted value')
plt.ylabel('actual value')
plt.show()


It looks too scattered and doesn't look like accurate.

1.a. Since we have so many variables, this is a good candidate for regularization. In particular, since we'd like to eliminate a lot of them, lasso seems like a good candidate. Build a lasso model on your training data for various values of alpha. Which variables survive?


In [24]:
from sklearn.linear_model import Lasso

In [25]:
lasso_model = Lasso(alpha=0.1)

In [26]:
lasso_model.fit(x_train, y_train)


Out[26]:
Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [27]:
lasso_model.coef_


Out[27]:
array([  2.11891665,  -8.31310629,   1.30158042,  12.38513702,
         1.81502454,  -3.97056337,   4.92886595,   0.        ,
         2.41843936,   0.        ,   2.52385879,   0.        ,
        -4.82539534,   0.        ,   0.        ,  -3.25661618,
         0.        ,   0.        ,  -0.52538975,  -3.47180753,
         3.92691686,  -0.        ,  29.8695751 ,   0.        ,
        -0.        ,  -0.        ,  -1.55760355,   2.31622019,
        -3.36219043,   1.68220631,  -0.        ,  -0.        ,
         2.37665676,  -0.26992042,   5.00393918,  -4.84679097,
        -0.        , -31.15780614,   1.25142142,   6.35268305,
        11.88400288,  -6.62417012,   0.        ,   0.        ,
        -0.        ,  25.2336119 ,   0.        ,  16.70356365,
        -0.        ,  -0.        ,  -0.        ,   0.        ,
         0.        , -10.01163694,  -0.        ,   6.6717517 ,
        -0.        ,  29.72285205,   0.        ,   0.        ,
         3.19781292,  -0.        ,  -4.31999142,  -2.19710409,
        -0.        ,   0.        ,   0.        ,   0.08881898,
         0.        ,  -0.        ,  -0.        ,  -0.        ,
         0.        ,   0.        ,  -3.92813546,  -0.        ,
        -0.        ,  -0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        , -11.14856473,
        -0.        ,   0.        ,   0.        , -13.430503  ,
         0.        ,   0.        ,   0.        ,   0.        ,
        -0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.0705824 ,   0.18625663,
        -0.97174324,   2.20307531,  42.06718366,   0.        ,
         0.57957333,  -8.75617772,   0.        ,   0.        ,
        -0.        ,   0.        ,   0.        ,   0.        ,
         6.92185157,   8.20350497,  -7.18927453, -10.16444316,
         0.        ,   0.        ,   0.48530522,  -0.        ,
        -7.3414173 , -28.84169793,   0.        ,  -0.        ,   0.        ])

In [28]:
lasso_model = Lasso(alpha=0.5)

In [29]:
lasso_model.fit(x_train, y_train)


Out[29]:
Lasso(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [30]:
lasso_model.coef_


Out[30]:
array([  7.41776749e-01,  -6.15480041e+00,   1.27822217e+00,
         0.00000000e+00,   1.13284481e-02,  -5.06650708e+00,
         7.38451675e+00,   0.00000000e+00,   2.83386812e+00,
         0.00000000e+00,   3.16726479e+00,   0.00000000e+00,
        -2.23151031e+00,   0.00000000e+00,   0.00000000e+00,
        -0.00000000e+00,   0.00000000e+00,  -0.00000000e+00,
        -0.00000000e+00,  -3.54215927e+00,   2.80925056e+00,
        -0.00000000e+00,   2.14000793e+01,  -0.00000000e+00,
        -0.00000000e+00,  -0.00000000e+00,  -2.88470565e-01,
         2.12348288e+00,   2.89866304e-01,   4.90382847e-01,
        -0.00000000e+00,  -0.00000000e+00,   0.00000000e+00,
        -0.00000000e+00,   4.54885183e+00,   0.00000000e+00,
         0.00000000e+00,  -1.69598679e+01,   0.00000000e+00,
         0.00000000e+00,   9.45091113e+00,  -0.00000000e+00,
         0.00000000e+00,  -0.00000000e+00,  -0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   4.96415763e+00,
        -0.00000000e+00,  -0.00000000e+00,  -0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,  -0.00000000e+00,
        -0.00000000e+00,   1.38206056e-01,  -0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,  -4.33538268e+00,
        -0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        -0.00000000e+00,  -0.00000000e+00,  -0.00000000e+00,
        -0.00000000e+00,   0.00000000e+00,  -0.00000000e+00,
        -0.00000000e+00,  -0.00000000e+00,  -0.00000000e+00,
         0.00000000e+00,  -0.00000000e+00,  -0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,  -0.00000000e+00,
        -0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
        -0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,  -0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         2.78125265e-02,   1.26368691e-01,  -5.50997086e-01,
         2.54309964e+00,   1.64684082e+00,   0.00000000e+00,
         6.76843898e-01,  -5.39212224e+00,   0.00000000e+00,
         0.00000000e+00,  -0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   7.52113439e+00,
         7.00095762e+00,  -1.88569517e+00,  -4.13124568e+00,
         0.00000000e+00,   0.00000000e+00,  -0.00000000e+00,
        -0.00000000e+00,  -0.00000000e+00,  -4.68175291e+00,
         0.00000000e+00,  -0.00000000e+00,   0.00000000e+00])

In [31]:
lasso_model = Lasso(alpha=1)

In [32]:
lasso_model.fit(x_train, y_train)


Out[32]:
Lasso(alpha=1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [33]:
lasso_model.coef_


Out[33]:
array([  0.37181023,  -4.61114302,   1.30076133,  -0.        ,
         0.        ,  -5.27229216,   7.96406457,   0.        ,
         2.65616947,   0.        ,   3.02013032,   0.        ,
        -0.        ,   0.        ,   0.        ,  -0.        ,
         0.        ,  -0.        ,  -0.        ,  -2.88227526,
         2.72890434,  -0.        ,  14.34804695,  -0.        ,
        -0.        ,  -0.        ,  -0.        ,   2.37318045,
         0.54716315,   0.        ,  -0.        ,  -0.        ,
         0.        ,   0.        ,   3.83022461,   0.        ,
         0.        ,  -2.96010342,  -0.        ,   0.        ,
         4.83249053,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,   0.        ,   0.        ,   0.        ,
        -0.        ,  -0.        ,  -0.        ,   0.        ,
         0.        ,  -0.        ,  -0.        ,   0.        ,
        -0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,  -3.95111226,  -0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,   0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,   0.        ,  -0.        ,
        -0.        ,   0.        ,   0.        ,  -0.        ,
        -0.        ,   0.        ,   0.        ,  -0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
        -0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,   0.        ,   0.        ,
         0.        ,   0.        ,  -0.01805897,   0.09703119,
        -0.        ,   2.32849308,   0.        ,   0.        ,
         0.49097261,  -0.        ,   0.        ,   0.        ,
        -0.        ,   0.        ,   0.        ,   0.        ,
         6.48574809,   6.30384593,  -0.        ,  -0.        ,
         0.        ,   0.        ,  -0.        ,  -0.        ,
        -0.        ,  -0.        ,   0.        ,  -0.        ,   0.        ])

b. How does this model perform on the test set?


In [34]:
plt.scatter(lasso_model.predict(x_test), y_test, s=50)
plt.xlabel('predicted value')
plt.ylabel('actual value')
plt.show()


I can see some correlation in this.

1.No matter how high I make alpha, the coefficient on crossing ("number of nearby crosswalks") never goes away. Try a simple linear regression on just that variable.


In [35]:
x = avgerage_stations_df[list(avgerage_stations_df.columns.values[111:112])]
y = avgerage_stations_df[list(avgerage_stations_df.columns.values[1:2])]
lin_regr = linear_model.LinearRegression()
lin_regr.fit(x, y)


Out[35]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [36]:
plt.scatter(lin_regr.predict(x), y, s=50)
plt.xlabel('predicted value')
plt.ylabel('actual value')
plt.show()


This looks like reasonably correlated.


In [ ]: