Introduction

This is the first installment of Applying Machine Learning to Kaggle Datasets, a series of ipython notebooks demonstrating the methods described in the Stanford Machine Learning Course. In each noteobok, I apply one method taught in the course to an open kaggle competition.

In this notebook, I demonstrate linear regression using the DC Bicycle share competition.

Outline

  1. Import and examine the data
  2. Construct a linear model to predicut bicycle usage
  3. Optimize model paramters by solving Theta*X=y OLS problem
  4. Evaluate model results
  5. Submit the results to the Kaggle competition

Import Necessary Modules


In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import code.Linear_Regression_Funcs as LRF

1. Read Bicycle Share Data from file

A description of the data can be found here: https://www.kaggle.com/c/bike-sharing-demand/data


In [2]:
train = pd.read_csv("./data/bike_sharing_demand/train.csv",
                   index_col='datetime', parse_dates='datetime')
train.head()


Out[2]:
season holiday workingday weather temp atemp humidity windspeed casual registered count
datetime
2011-01-01 00:00:00 1 0 0 1 9.84 14.395 81 0 3 13 16
2011-01-01 01:00:00 1 0 0 1 9.02 13.635 80 0 8 32 40
2011-01-01 02:00:00 1 0 0 1 9.02 13.635 80 0 5 27 32
2011-01-01 03:00:00 1 0 0 1 9.84 14.395 75 0 3 10 13
2011-01-01 04:00:00 1 0 0 1 9.84 14.395 75 0 0 1 1

2. Construct a linear model to predict hourly bicycle counts

A list of variables to include in the linear model
  1. Time variables (Hour of Day, Day of Week, Month of Year, Day of Week * Hour of Day (Interaction))
  2. Weather variables (Weather Type: nice, misty, precipitation, nasty)
Preparing data for linear regression
  1. make vector y: endogenous variable = bicycle count each hour
  2. make matrix X: exogenous variables described above
  • Before the regression, scale the model by the average number of rides in each:
    • month
    • weather type

In [3]:
y = train['count']

In [4]:
# Get dictionaries to hold scale factors for month and weather type
scales = LRF.get_scale(train,['weather','monthly'])

In [5]:
# X is an [m x n] matrix.
#    m = number of observations
#    n = number of predictors
X = LRF.make_matrix(train,weather_scale=scales['weather'],monthly_scale=scales['monthly'])

3. Optimize Model Parameters via OLS regression

Ordinary Least Squares (OLS) assumes a squared loss function with no weights and no regularization.

a. sm.OLS.fit() solves the normal equations to estimate model parameters.

  • Pros: exact single step solution
  • Cons: does not scale with number of observations

b. sm.OLS.fit_regularized() minimizes the cost function numerically using coordinate descent.

  • Pros: poorly conditioned problems can be regularized by early stopping or a regularization term
  • Cons: slower for problems of this size, minimization via gradient descent is not implemented in statsmodels.OLS (3 Dec 2014)

In [6]:
results = sm.OLS(y,X).fit()

4. Evaluate Model Results


In [7]:
# Print OLS regression results and diagnostics
results.summary()


Out[7]:
OLS Regression Results
Dep. Variable: count R-squared: 0.963
Model: OLS Adj. R-squared: 0.962
Method: Least Squares F-statistic: 1448.
Date: Wed, 03 Dec 2014 Prob (F-statistic): 0.00
Time: 16:31:20 Log-Likelihood: -58166.
No. Observations: 10886 AIC: 1.167e+05
Df Residuals: 10693 BIC: 1.181e+05
Df Model: 193
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Months_Elapsed_1 15.6206 8.011 1.950 0.051 -0.082 31.323
Months_Elapsed_2 20.1382 7.359 2.736 0.006 5.712 34.564
Months_Elapsed_3 29.6989 6.608 4.494 0.000 16.746 42.652
Months_Elapsed_4 23.6136 5.707 4.137 0.000 12.426 34.801
Months_Elapsed_5 18.0927 5.543 3.264 0.001 7.227 28.958
Months_Elapsed_6 15.1806 5.502 2.759 0.006 4.395 25.966
Months_Elapsed_7 19.6494 5.628 3.491 0.000 8.617 30.682
Months_Elapsed_8 29.7368 5.749 5.173 0.000 18.468 41.006
Months_Elapsed_9 30.8250 5.723 5.387 0.000 19.608 42.042
Months_Elapsed_10 19.0813 5.839 3.268 0.001 7.636 30.526
Months_Elapsed_11 23.4614 6.077 3.861 0.000 11.549 35.374
Months_Elapsed_12 19.6215 6.246 3.141 0.002 7.378 31.865
Months_Elapsed_13 25.2206 5.986 4.213 0.000 13.487 36.955
Months_Elapsed_14 19.2910 5.496 3.510 0.000 8.518 30.064
Months_Elapsed_15 15.1804 5.355 2.835 0.005 4.684 25.677
Months_Elapsed_16 23.9481 5.368 4.462 0.000 13.427 34.469
Months_Elapsed_17 20.7854 5.288 3.931 0.000 10.421 31.150
Months_Elapsed_18 17.6691 5.347 3.304 0.001 7.187 28.151
Months_Elapsed_19 20.6134 5.306 3.885 0.000 10.212 31.014
Months_Elapsed_20 22.0085 5.292 4.159 0.000 11.635 32.382
Months_Elapsed_21 24.8268 5.331 4.657 0.000 14.376 35.277
Months_Elapsed_22 17.1652 5.429 3.162 0.002 6.523 27.807
Months_Elapsed_23 22.3705 5.510 4.060 0.000 11.570 33.171
Hour_1 -9.9726 8.078 -1.235 0.217 -25.806 5.861
Hour_2 -15.9362 7.999 -1.992 0.046 -31.616 -0.256
Hour_3 -19.1545 8.180 -2.342 0.019 -35.189 -3.120
Hour_4 -17.9639 8.008 -2.243 0.025 -33.660 -2.268
Hour_5 -0.1197 7.978 -0.015 0.988 -15.758 15.518
Hour_6 73.7026 8.031 9.178 0.000 57.961 89.444
Hour_7 265.1547 8.129 32.620 0.000 249.221 281.088
Hour_8 424.7341 8.062 52.681 0.000 408.930 440.538
Hour_9 200.5496 8.160 24.578 0.000 184.555 216.544
Hour_10 99.8264 8.062 12.382 0.000 84.024 115.629
Hour_11 126.4965 8.014 15.784 0.000 110.787 142.206
Hour_12 171.7776 8.117 21.161 0.000 155.866 187.689
Hour_13 165.4083 8.018 20.630 0.000 149.692 181.125
Hour_14 151.2943 8.036 18.827 0.000 135.542 167.047
Hour_15 169.5180 7.984 21.233 0.000 153.869 185.167
Hour_16 268.4027 8.056 33.318 0.000 252.612 284.194
Hour_17 520.2124 8.112 64.127 0.000 504.311 536.114
Hour_18 499.0364 8.121 61.448 0.000 483.117 514.956
Hour_19 342.4033 8.016 42.715 0.000 326.691 358.116
Hour_20 223.8352 8.032 27.867 0.000 208.090 239.580
Hour_21 148.8104 8.072 18.435 0.000 132.987 164.634
Hour_22 90.0603 8.079 11.148 0.000 74.224 105.896
Hour_23 45.2944 8.127 5.573 0.000 29.363 61.225
Day_1 2.8788 7.864 0.366 0.714 -12.536 18.293
Day_2 10.9351 7.653 1.429 0.153 -4.066 25.936
Day_3 15.9188 7.860 2.025 0.043 0.511 31.326
Day_4 28.7777 7.678 3.748 0.000 13.727 43.828
Day_5 69.3382 7.176 9.663 0.000 55.272 83.404
Day_6 70.8883 7.694 9.213 0.000 55.806 85.971
Weather_2 9.9642 1.146 8.694 0.000 7.718 12.211
Weather_3 3.1118 2.811 1.107 0.268 -2.398 8.621
Weather_4 -223.5256 92.196 -2.424 0.015 -404.246 -42.805
Day_1_*_Hour_1 -4.6686 11.771 -0.397 0.692 -27.742 18.405
Day_1_*_Hour_2 -4.7630 11.742 -0.406 0.685 -27.779 18.253
Day_1_*_Hour_3 -4.1579 11.919 -0.349 0.727 -27.521 19.205
Day_1_*_Hour_4 -3.8242 11.749 -0.326 0.745 -26.854 19.205
Day_1_*_Hour_5 -2.1925 11.740 -0.187 0.852 -25.204 20.819
Day_1_*_Hour_6 9.2137 11.785 0.782 0.434 -13.886 32.314
Day_1_*_Hour_7 12.4646 11.806 1.056 0.291 -10.678 35.608
Day_1_*_Hour_8 18.5426 11.752 1.578 0.115 -4.493 41.578
Day_1_*_Hour_9 8.6453 11.808 0.732 0.464 -14.501 31.792
Day_1_*_Hour_10 6.3671 11.779 0.541 0.589 -16.722 29.456
Day_1_*_Hour_11 -5.0526 11.681 -0.433 0.665 -27.949 17.844
Day_1_*_Hour_12 -12.9040 11.781 -1.095 0.273 -35.997 10.189
Day_1_*_Hour_13 -6.8936 11.705 -0.589 0.556 -29.838 16.051
Day_1_*_Hour_14 -8.8924 11.740 -0.757 0.449 -31.905 14.121
Day_1_*_Hour_15 -8.5768 11.682 -0.734 0.463 -31.476 14.323
Day_1_*_Hour_16 -7.9458 11.682 -0.680 0.496 -30.844 14.953
Day_1_*_Hour_17 -4.3034 11.718 -0.367 0.713 -27.272 18.665
Day_1_*_Hour_18 -13.6728 11.674 -1.171 0.242 -36.557 9.211
Day_1_*_Hour_19 -17.5595 11.602 -1.514 0.130 -40.301 5.182
Day_1_*_Hour_20 5.4856 11.678 0.470 0.639 -17.406 28.377
Day_1_*_Hour_21 9.1994 11.694 0.787 0.432 -13.724 32.122
Day_1_*_Hour_22 13.7585 11.652 1.181 0.238 -9.081 36.598
Day_1_*_Hour_23 7.0077 11.741 0.597 0.551 -16.006 30.021
Day_2_*_Hour_1 -9.2812 11.395 -0.815 0.415 -31.617 13.055
Day_2_*_Hour_2 -10.3549 11.354 -0.912 0.362 -32.610 11.900
Day_2_*_Hour_3 -9.7327 11.496 -0.847 0.397 -32.268 12.802
Day_2_*_Hour_4 -11.5499 11.446 -1.009 0.313 -33.986 10.886
Day_2_*_Hour_5 -8.5024 11.370 -0.748 0.455 -30.789 13.784
Day_2_*_Hour_6 3.4818 11.444 0.304 0.761 -18.950 25.913
Day_2_*_Hour_7 16.6284 11.531 1.442 0.149 -5.974 39.231
Day_2_*_Hour_8 41.5842 11.502 3.615 0.000 19.038 64.131
Day_2_*_Hour_9 6.5540 11.604 0.565 0.572 -16.192 29.300
Day_2_*_Hour_10 -1.8299 11.515 -0.159 0.874 -24.402 20.742
Day_2_*_Hour_11 -12.0941 11.460 -1.055 0.291 -34.558 10.369
Day_2_*_Hour_12 -12.8202 11.526 -1.112 0.266 -35.412 9.772
Day_2_*_Hour_13 -20.9115 11.466 -1.824 0.068 -43.387 1.563
Day_2_*_Hour_14 -20.6515 11.438 -1.806 0.071 -43.071 1.768
Day_2_*_Hour_15 -20.8866 11.453 -1.824 0.068 -43.338 1.564
Day_2_*_Hour_16 -35.5946 11.381 -3.128 0.002 -57.903 -13.286
Day_2_*_Hour_17 -41.7742 11.403 -3.663 0.000 -64.127 -19.421
Day_2_*_Hour_18 -7.5222 11.576 -0.650 0.516 -30.213 15.168
Day_2_*_Hour_19 -14.9571 11.386 -1.314 0.189 -37.277 7.362
Day_2_*_Hour_20 5.5813 11.388 0.490 0.624 -16.741 27.904
Day_2_*_Hour_21 14.0508 11.396 1.233 0.218 -8.288 36.389
Day_2_*_Hour_22 21.5795 11.408 1.892 0.059 -0.783 43.942
Day_2_*_Hour_23 2.5087 11.484 0.218 0.827 -20.001 25.019
Day_3_*_Hour_1 -13.7088 11.691 -1.173 0.241 -36.626 9.209
Day_3_*_Hour_2 -15.1032 11.587 -1.303 0.192 -37.816 7.609
Day_3_*_Hour_3 -15.1437 11.728 -1.291 0.197 -38.133 7.845
Day_3_*_Hour_4 -16.8648 11.619 -1.452 0.147 -39.640 5.910
Day_3_*_Hour_5 -13.5723 11.608 -1.169 0.242 -36.325 9.181
Day_3_*_Hour_6 -4.5756 11.653 -0.393 0.695 -27.417 18.266
Day_3_*_Hour_7 6.8192 11.734 0.581 0.561 -16.181 29.820
Day_3_*_Hour_8 22.1295 11.665 1.897 0.058 -0.737 44.996
Day_3_*_Hour_9 -6.0219 11.740 -0.513 0.608 -29.035 16.991
Day_3_*_Hour_10 -10.7687 11.658 -0.924 0.356 -33.621 12.083
Day_3_*_Hour_11 -8.5635 11.635 -0.736 0.462 -31.371 14.244
Day_3_*_Hour_12 -15.3395 11.668 -1.315 0.189 -38.211 7.532
Day_3_*_Hour_13 -13.5303 11.586 -1.168 0.243 -36.242 9.181
Day_3_*_Hour_14 -20.3574 11.583 -1.758 0.079 -43.062 2.348
Day_3_*_Hour_15 -17.5265 11.524 -1.521 0.128 -40.115 5.062
Day_3_*_Hour_16 -24.6346 11.576 -2.128 0.033 -47.326 -1.943
Day_3_*_Hour_17 -28.1863 11.654 -2.419 0.016 -51.030 -5.342
Day_3_*_Hour_18 -33.4563 11.659 -2.870 0.004 -56.310 -10.603
Day_3_*_Hour_19 -22.0675 11.552 -1.910 0.056 -44.712 0.577
Day_3_*_Hour_20 2.8539 11.547 0.247 0.805 -19.781 25.489
Day_3_*_Hour_21 9.8515 11.637 0.847 0.397 -12.958 32.661
Day_3_*_Hour_22 19.4362 11.639 1.670 0.095 -3.378 42.251
Day_3_*_Hour_23 16.1372 11.669 1.383 0.167 -6.736 39.010
Day_4_*_Hour_1 -17.0434 11.506 -1.481 0.139 -39.597 5.511
Day_4_*_Hour_2 -24.1076 11.425 -2.110 0.035 -46.502 -1.713
Day_4_*_Hour_3 -27.1160 11.535 -2.351 0.019 -49.727 -4.505
Day_4_*_Hour_4 -27.9054 11.476 -2.432 0.015 -50.401 -5.409
Day_4_*_Hour_5 -28.8077 11.449 -2.516 0.012 -51.249 -6.366
Day_4_*_Hour_6 -32.0347 11.520 -2.781 0.005 -54.617 -9.453
Day_4_*_Hour_7 -53.1982 11.611 -4.582 0.000 -75.959 -30.438
Day_4_*_Hour_8 -12.8399 11.508 -1.116 0.265 -35.397 9.717
Day_4_*_Hour_9 -4.0348 11.542 -0.350 0.727 -26.659 18.590
Day_4_*_Hour_10 -0.9062 11.492 -0.079 0.937 -23.433 21.620
Day_4_*_Hour_11 3.2064 11.484 0.279 0.780 -19.303 25.716
Day_4_*_Hour_12 6.8410 11.535 0.593 0.553 -15.770 29.452
Day_4_*_Hour_13 16.3048 11.425 1.427 0.154 -6.091 38.701
Day_4_*_Hour_14 17.4463 11.469 1.521 0.128 -5.036 39.929
Day_4_*_Hour_15 22.4190 11.429 1.962 0.050 0.016 44.822
Day_4_*_Hour_16 12.6954 11.529 1.101 0.271 -9.903 35.294
Day_4_*_Hour_17 -63.7130 11.583 -5.500 0.000 -86.419 -41.008
Day_4_*_Hour_18 -123.0775 11.567 -10.641 0.000 -145.750 -100.404
Day_4_*_Hour_19 -88.9573 11.472 -7.754 0.000 -111.444 -66.470
Day_4_*_Hour_20 -58.4012 11.453 -5.099 0.000 -80.851 -35.951
Day_4_*_Hour_21 -30.4182 11.483 -2.649 0.008 -52.927 -7.910
Day_4_*_Hour_22 8.5425 11.472 0.745 0.456 -13.944 31.029
Day_4_*_Hour_23 21.4113 11.529 1.857 0.063 -1.188 44.011
Day_5_*_Hour_1 -17.7230 10.810 -1.640 0.101 -38.912 3.466
Day_5_*_Hour_2 -30.6979 10.754 -2.855 0.004 -51.778 -9.618
Day_5_*_Hour_3 -52.1627 10.859 -4.804 0.000 -73.449 -30.877
Day_5_*_Hour_4 -66.7563 10.734 -6.219 0.000 -87.797 -45.715
Day_5_*_Hour_5 -82.6931 10.693 -7.733 0.000 -103.654 -61.732
Day_5_*_Hour_6 -141.7856 10.749 -13.190 0.000 -162.856 -120.715
Day_5_*_Hour_7 -299.3367 10.812 -27.687 0.000 -320.529 -278.144
Day_5_*_Hour_8 -381.9155 10.783 -35.420 0.000 -403.051 -360.780
Day_5_*_Hour_9 -95.8556 10.853 -8.832 0.000 -117.130 -74.582
Day_5_*_Hour_10 72.3466 10.784 6.709 0.000 51.208 93.485
Day_5_*_Hour_11 111.8986 10.748 10.412 0.000 90.831 132.966
Day_5_*_Hour_12 109.9479 10.812 10.169 0.000 88.754 131.142
Day_5_*_Hour_13 131.4199 10.753 12.222 0.000 110.342 152.498
Day_5_*_Hour_14 138.8437 10.776 12.884 0.000 117.720 159.968
Day_5_*_Hour_15 119.8622 10.775 11.124 0.000 98.741 140.983
Day_5_*_Hour_16 13.7431 10.837 1.268 0.205 -7.500 34.986
Day_5_*_Hour_17 -248.2790 10.913 -22.751 0.000 -269.671 -226.887
Day_5_*_Hour_18 -276.1579 10.864 -25.419 0.000 -297.454 -254.862
Day_5_*_Hour_19 -170.0055 10.811 -15.725 0.000 -191.198 -148.813
Day_5_*_Hour_20 -113.2595 10.843 -10.445 0.000 -134.514 -92.005
Day_5_*_Hour_21 -72.6400 10.854 -6.693 0.000 -93.915 -51.365
Day_5_*_Hour_22 -36.4564 10.824 -3.368 0.001 -57.673 -15.239
Day_5_*_Hour_23 -22.7051 10.926 -2.078 0.038 -44.121 -1.289
Day_6_*_Hour_1 -7.0523 11.512 -0.613 0.540 -29.618 15.513
Day_6_*_Hour_2 -19.3372 11.469 -1.686 0.092 -41.819 3.145
Day_6_*_Hour_3 -46.9686 11.607 -4.047 0.000 -69.720 -24.217
Day_6_*_Hour_4 -67.3904 11.490 -5.865 0.000 -89.913 -44.868
Day_6_*_Hour_5 -84.4756 11.438 -7.385 0.000 -106.897 -62.054
Day_6_*_Hour_6 -152.5208 11.471 -13.297 0.000 -175.005 -130.036
Day_6_*_Hour_7 -323.7666 11.562 -28.002 0.000 -346.431 -301.103
Day_6_*_Hour_8 -434.2322 11.495 -37.776 0.000 -456.764 -411.700
Day_6_*_Hour_9 -136.0725 11.563 -11.768 0.000 -158.739 -113.406
Day_6_*_Hour_10 61.8447 11.443 5.405 0.000 39.414 84.275
Day_6_*_Hour_11 90.3837 11.420 7.915 0.000 67.999 112.768
Day_6_*_Hour_12 104.8658 11.531 9.094 0.000 82.263 127.469
Day_6_*_Hour_13 114.0860 11.453 9.961 0.000 91.636 136.536
Day_6_*_Hour_14 117.7520 11.479 10.258 0.000 95.251 140.253
Day_6_*_Hour_15 96.0046 11.474 8.367 0.000 73.513 118.496
Day_6_*_Hour_16 -3.4728 11.501 -0.302 0.763 -26.016 19.070
Day_6_*_Hour_17 -285.3727 11.576 -24.653 0.000 -308.063 -262.682
Day_6_*_Hour_18 -312.0480 11.593 -26.918 0.000 -334.772 -289.324
Day_6_*_Hour_19 -212.0147 11.415 -18.574 0.000 -234.390 -189.640
Day_6_*_Hour_20 -146.8203 11.432 -12.843 0.000 -169.229 -124.411
Day_6_*_Hour_21 -112.4777 11.459 -9.816 0.000 -134.939 -90.016
Day_6_*_Hour_22 -84.0678 11.554 -7.276 0.000 -106.716 -61.419
Day_6_*_Hour_23 -74.2203 11.615 -6.390 0.000 -96.987 -51.453
Omnibus: 1688.015 Durbin-Watson: 0.755
Prob(Omnibus): 0.000 Jarque-Bera (JB): 30329.423
Skew: 0.023 Prob(JB): 0.00
Kurtosis: 11.177 Cond. No. 118.

In [8]:
# Print score on Kaggle training data
ypredict = results.predict(X)
ypredict[ypredict<0]=0  # Make sure we don't predict negative ridership!
print "score on training data = ", LRF.score(y,ypredict)


score on training data =  0.463711064666

In [9]:
# View Time series of ridership observations and predictions
start = 2700
end = start + 300
#start = 0; end = len(train)-1  # Uncomment to see entire timeseries
plt.plot(train.index[start:end],y[start:end],'-r',alpha=1,lw=3)
plt.plot(train.index[start:end],ypredict[start:end],'-b',alpha=0.3,lw=3)


Out[9]:
[<matplotlib.lines.Line2D at 0x108c77d90>]

5. Submit the results to the Kaggle competition


In [10]:
# Read test data
test = pd.read_csv("./data/bike_sharing_demand/test.csv",
                   index_col='datetime', parse_dates='datetime')

In [11]:
# Construct test model matrix
Xtest = LRF.make_matrix(test,weather_scale=scales['weather'],monthly_scale=scales['monthly'])

In [12]:
# Calculate predictions by applying model parameters to test model matrix
Ypredict = pd.DataFrame(results.predict(Xtest),index=Xtest.index)
Ypredict = Ypredict.apply(np.round)
Ypredict[Ypredict<0]=0

In [13]:
# Write to csv
Ypredict.columns = ['count']
Ypredict = Ypredict.astype(int)  # Force integers in output
Ypredict.to_csv('./predictions/Linear_Regression_Prediction.csv',sep=',')

This submission received a score of 0.57986, placing 902 of 1481 submissions. Not bad for a quick linear regression model!

6. How to improve the regression model

The evaluation function uses log of rides, so it's very important to be correct when ridership is low.

  1. Implement weighted least squares (sm.WLS(y,X)) in which the smaller values of the observations are weighted more heavily.

  2. Include more predictor variables. For example, use weather information after transforming it from the raw data. For example, try monthly temperature anomaly instead of simply temperature. Additionally, include an interaction term between temperature anomaly and month of the year (or day of week).

  3. Regularize the model results. The test data performs worse than the training data, suggesting that the model overfits the training data. Try early stopping or including a regularization term using OLS.fit_regularlized().

  4. Fine tune model predictions. For example, manually reduce ridership predictions during the last week of December, when many people will be out of town or not commuting to work.

  5. Assume that the endogenous variable follows a Poisson distribution, not a normal distribution sm.GLM(y,X,family=sm.families.Poisson()). This is cheating a little bit because this tutorial is about linear regression.