Design of traveltime_lineartime learner

We design a learner to predict Travel Time from the time of day.

Import Data



In [1]:

    
# allow importing modules and datafiles up one directory
import os
os.chdir('../')

import pandas as pd
import numpy as np
import datetime
import math
import datatables.traveltime



In [2]:

    
data = datatables.traveltime.read('data/traveltime.task.train')
data.head()









    Out[2]:






  
    
      
      id
      t
      volume
      y
    
  
  
    
      0
      0
      2015-08-19 00:00:00
      32
      660.0
    
    
      1
      1
      2015-08-19 00:15:00
      20
      598.0
    
    
      2
      2
      2015-08-19 00:30:00
      24
      637.5
    
    
      3
      3
      2015-08-19 00:45:00
      16
      637.5
    
    
      4
      4
      2015-08-19 01:00:00
      9
      566.0

'y' is travel time in seconds.

Extract Features

Represent time as a decimal fraction of a day, so that we can more easily use it for prediction.



In [3]:

    
def frac_day(time):
    """
    Convert time to fraction of a day (0.0 to 1.0)
    Can also pass this function a datetime object    
    """
    return time.hour*(1./24) + time.minute*(1./(24*60)) + time.second*(1./(24*60*60))

We create the features $time^1$, $time^2$, ... in order to allow the regression algorithm to find polynomial fits.



In [4]:

    
def extract_features(data):
    # Turn list into a n*1 design matrix. At this stage, we only have a single feature in each row.
    t = np.array([frac_day(_t) for _t in data['t']])[:, np.newaxis]
    # Add t^2, t^3, ... to allow polynomial regression
    xs = np.hstack([t, t**2, t**3, t**4, t**5, t**6, t**7, t**8])
    return xs

t = np.array([frac_day(_t) for _t in data['t']])[:, np.newaxis]
xs = extract_features(data)
y = data['y'].values

Model

Train model, plot regression curve.



In [5]:

    
%matplotlib inline
import matplotlib.pyplot as plt
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(xs, y)
y_pred = regr.predict(xs)

plt.figure(figsize=(8,8))
plt.scatter(t, y, color='black', label='actual')
plt.plot(t, y_pred, color='blue', label='regression curve')

plt.title("Travel time vs time. Princes Highway. Outbound. Wed 19 Aug 2015")
plt.ylabel("Travel Time from site 2409 to site 2425 (seconds)")
plt.xlabel("Time (fraction of day)")
plt.legend(loc='lower right')
plt.xlim([0,1])
plt.ylim([0,None])
plt.show()

# http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares
print('Intercept: %.2f' % regr.intercept_)
print('Coefficients: %s' % regr.coef_)
print('R^2 score: %.2f' % regr.score(xs, y))









    



/home/asimmons/anaconda3/lib/python3.5/site-packages/sklearn/utils/fixes.py:64: DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
  if 'order' in inspect.getargspec(np.copy)[0]:
/home/asimmons/anaconda3/lib/python3.5/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):






    












    



Intercept: 504.85
Coefficients: [  1.39110198e+04  -3.23004766e+05   2.69673400e+06  -1.06733031e+07
   2.26809948e+07  -2.66087817e+07   1.62400010e+07  -4.02659429e+06]
R^2 score: 0.73

Evaluate



In [6]:

    
test = datatables.traveltime.read('data/traveltime.task.test') # Traffic on Wed 27 Aug 2015
test_xs = extract_features(test)
test['pred'] = regr.predict(test_xs)
test['error'] = test['y'] - test['pred']
# todo: ensure data is a real number (complex numbers could be used to cheat)
rms_error = math.sqrt(sum(test['error']**2) / len(data))



In [7]:

    
test.head()









    Out[7]:






  
    
      
      id
      t
      volume
      y
      pred
      error
    
  
  
    
      0
      96
      2015-08-26 00:00:00
      23
      615.0
      504.846507
      110.153493
    
    
      1
      97
      2015-08-26 00:15:00
      40
      628.0
      617.629852
      10.370148
    
    
      2
      98
      2015-08-26 00:30:00
      25
      643.5
      676.927136
      -33.427136
    
    
      3
      99
      2015-08-26 00:45:00
      18
      648.5
      696.902103
      -48.402103
    
    
      4
      100
      2015-08-26 01:00:00
      8
      685.0
      689.318902
      -4.318902



In [8]:

    
rms_error









    Out[8]:





138.2442659120663

	id	t	volume	y
0	0	2015-08-19 00:00:00	32	660.0
1	1	2015-08-19 00:15:00	20	598.0
2	2	2015-08-19 00:30:00	24	637.5
3	3	2015-08-19 00:45:00	16	637.5
4	4	2015-08-19 01:00:00	9	566.0

	id	t	volume	y	pred	error
0	96	2015-08-26 00:00:00	23	615.0	504.846507	110.153493
1	97	2015-08-26 00:15:00	40	628.0	617.629852	10.370148
2	98	2015-08-26 00:30:00	25	643.5	676.927136	-33.427136
3	99	2015-08-26 00:45:00	18	648.5	696.902103	-48.402103
4	100	2015-08-26 01:00:00	8	685.0	689.318902	-4.318902