Time Series Analysis for Car Counts

Guillaume Decerprit, PhD - 1 Nov. 2017

This document describes thought process and findings on data exploration & forecasting of car counts.

Overview of my strategy

At the highest level, two strategies are possible with time series forecasting:

1) we ignore the true order of time and assume that the target value depends only on the features (day of week, weather etc.) We can then do inferences on car count given the features and use standard supervized-learning techniques

2) we assume correlation between time steps not only exists but more importantly plays a critical role in the value of the target output. We may then use standard methods (ARIMA, Kalman Filters etc.) or more complex ones (Recurrent Neural Networks, RNN)

Data Science is largely an empyrical science. After giving some exploratory results I decided to go first with strategy 2) using a Deep LSTM Recurrent Neural Net but was not very successful so decided to try 1) with a Random Forest that turned out to work. Not willing to give up too easily, I refined my Deep Net and eventually got interesting results.

Below is a table that summarizes a (tiny) fraction of info on most common methodologies for Time Series Forecasting:

Tables Pros Cons
ARIMA robust, fast does not handle multi-dimensional features, hard to forget trends
Kalman better at capturing trends, handles noise easily Markov model
Deep LSTM RNN can adapt to changing trends, can handle multi-D features from multiple time steps, can learn extremely complex representations needs enough data

In this document, I will go in the same order as my findings and start with the Deep Net, followed by the Random Forest (RF).

It is divided into 4 sections:

A) Data Exploration  
B) Deep Net Approach  
C) Random Forest Approach  
D) Conclusions / Extension  

In [5]:
####################
# test libraries
####################
try:
    import mxnet
except ImportError:
    !pip2 install mxnet

try:
    import seaborn
except ImportError:
    !pip2 install seaborn

try:
    import sklearn
except ImportError:
    !pip2 install sklearn

In [6]:
####################
# necessary imports. !! pip install mxnet if mxnet not installed !!
####################
from __future__ import print_function
import mxnet as mx  #  !! pip install mxnet if mxnet not installed !!
from mxnet import nd, autograd
import numpy as np
from collections import defaultdict
mx.random.seed(1234)
# ctx = mx.gpu(0)
ctx = mx.cpu(0)

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import scipy.fftpack
from scipy import stats
from pandas.tools import plotting
from pandas.tools.plotting import autocorrelation_plot
try:
    from sklearn.model_selection import train_test_split
except ImportError:
    from sklearn.cross_validation import train_test_split
from numpy import polyfit
from datetime import datetime
sns.set_style('whitegrid')
#sns.set_context('notebook')
sns.set_context('poster')
# Make inline plots vector graphics instead of raster graphics
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('pdf', 'png')

A) Data exploration

Data Structure

The data consists of a multi-dimensional (4D) daily time series with one target value (car count) and 3 associated features. Two of those features are categorical (day of week and cloud/clear) and one is not (weather)

To get some sense of the data, let's display a few basic plots.


In [7]:
######################################
# MAKE SURE data.csv IS IN THE SAME PATH AS THIS JUPYTER NOTEBOOK
######################################
original_data = pd.read_csv("data.csv", index_col=0)
dict_days_to_int = {'Monday': 1, 'Tuesday': 2, 'Wednesday': 3, 'Thursday': 4, 'Friday': 5, 'Saturday': 6, 'Sunday': 7}
original_data.sort_index(inplace=True)
original_data['date_']=original_data.index
original_data['current_month'] = original_data['date_'].apply(lambda x: pd.Timestamp(x).month)
original_data['current_year'] = original_data['date_'].apply(lambda x: pd.Timestamp(x).year-2009)
original_data['day_of_week_int'] = original_data['day.of.week'].apply(lambda x: dict_days_to_int[x])
original_data['cloudy_or_not_cloudy'] = original_data['cloud.indicator'].apply(lambda x: 0 if x=='clear' else 1)
# For the Random Forest, we bin the car # in small bins of 2 cars. This increases training statistics at the cost of a negligible drop in precision
BIN_SIZE = 2
original_data['bucketed_car_count'] = original_data['car.count'].apply(lambda x: np.floor(x/BIN_SIZE))
full_data = original_data.dropna()
# Split data in cloudy data and non-cloudy data
data_no_clouds = original_data[original_data['cloudy_or_not_cloudy']==0].dropna()
data_clouds = original_data[original_data['cloudy_or_not_cloudy']==1].dropna()
data_no_clouds['previous_bucketed_car_count'] = data_no_clouds['bucketed_car_count'].shift(1).dropna()

plotting.scatter_matrix(full_data.sample(300)[['car.count', 'weather']])
plotting.scatter_matrix(full_data.sample(300)[['car.count', 'day_of_week_int']])

plt.show()

data_no_clouds['car.count.previous'] = data_no_clouds['car.count'].shift(1)

# computes correlation coeffs, normalization is done before their computation
data_no_clouds.loc[:, ['car.count', 'car.count.previous' ,'weather', 'day_of_week_int']].corr()
data_no_clouds = data_no_clouds.dropna()


A few remarks:

Car count is roughly between 0 and 250. No missing value was observed. Interestingly, weather doesn't seem to have much influence on car count (no clear correlation). A slight trend is observed on car count vs day of week. Let's look at the cloud indicator by plotting a few cloudy data and non-cloudy data charts.


In [8]:
full_data.index = full_data.index.to_datetime()
data_no_clouds.index = data_no_clouds.index.to_datetime()
data_clouds.index = data_clouds.index.to_datetime()

#######################
# resampling to weekly data for better visibilty
# this doesn't change any of the conclusions on the data
#######################
resampled_full_data = full_data.resample('W')
resampled_no_clouds = data_no_clouds.resample('W')
resampled_clouds = data_clouds.resample('W')

resampled_full_data['car.count'].plot(label='all data')
resampled_no_clouds['car.count'].plot(label='no clouds data')
resampled_clouds['car.count'].plot(label='clouds data')

resampled_full_data['ii'] = resampled_full_data.reset_index().index
resampled_no_clouds['ii'] = resampled_no_clouds.reset_index().index
resampled_no_clouds = resampled_no_clouds.dropna()

poly_fit = polyfit(resampled_full_data['ii'], resampled_full_data['car.count'], 3)
a, b, c, d = poly_fit[0], poly_fit[1], poly_fit[2], poly_fit[3]

poly_fit2 = polyfit(resampled_no_clouds['ii'], resampled_no_clouds['car.count'], 3)
a2, b2, c2, d2 = poly_fit2[0], poly_fit2[1], poly_fit2[2], poly_fit2[3]

resampled_no_clouds['car.count_fit'] = resampled_no_clouds['ii'].apply(lambda x: a2*x*x*x + b2*x*x + c2*x + d2)
resampled_no_clouds['untrended_car_count'] = resampled_no_clouds['car.count'] - resampled_no_clouds['car.count_fit']
resampled_no_clouds['car.count_fit'].plot(label='3rd order poly fit (trend)')
resampled_full_data['car.count_fit'] = resampled_full_data['ii'].apply(lambda x: a*x*x*x + b*x*x + c*x + d)
resampled_full_data['untrended_car_count'] = resampled_full_data['car.count'] - resampled_full_data['car.count_fit']

print(full_data.head(3))
plt.legend(loc='upper right')

plt.show()


           day.of.week  car.count  weather cloud.indicator       date_  \
2010-01-01      Friday        101      0.1           clear  2010-01-01   
2010-01-02    Saturday         34      0.2          cloudy  2010-01-02   
2010-01-03      Sunday        113      0.4           clear  2010-01-03   

            current_month  current_year  day_of_week_int  \
2010-01-01              1             1                5   
2010-01-02              1             1                6   
2010-01-03              1             1                7   

            cloudy_or_not_cloudy  bucketed_car_count  
2010-01-01                     0                  50  
2010-01-02                     1                  17  
2010-01-03                     0                  56  

This was a critical step in the exploratory phase: the data README suggests that clouds and weather may affect the visibility of the parking. It was unclear to me whether that means the car count would be corrected or not. By breaking down the plots in cloudy/non-cloudy/all, one can interpret the charts with the two following Hypotheses:

Hyp. 1 : clouds introduce a strong variance and a (clearly visible) bias for which we essentially 'miss' cars but clouds do not impact the intrinsic behavior of shoppers, in other words if we could count cars on the ground directly and accurately, clouds would correlate at most marginally with the car count

Hyp. 2 : clouds do actually influence the amount of cars (at least more so than its intrinsic dispersion)

Given the much larger dispersion ($\sigma_{clouds} >> \sigma_{clear}$) and the negligible correlation of car # with 'weather', I picked Hyp. 1. It is important to keep in mind that this was a choice to make, and the entire analysis (especially the RF where the cloud/clear would be an important feature to add) should be revised should Hyp. 2 be true.

The box plot and histograms below display the bias and the different shapes of the cloudy vs non-cloudy car # distributions, further demonstrated by the extremely low p-value of the Kolmogorov test (which proves that both samples do not come from the same underlying distribution.)


In [46]:
fig, axes_fig1 = plt.subplots(1,1, figsize=(8,6))
plt.hist(data_clouds.sample(200)['car.count']/100. + 1.5, 20, alpha=0.5, label='bad data', normed=True)
plt.hist(data_no_clouds.sample(200)['car.count']/128., 20, alpha=0.5, label='good data', normed=True)
plt.legend(loc='upper right')
plt.xlabel('Anomaly score')
axes_fig1.locator_params(nbins=1, axis='y')
axes_fig1.set_ylim([0., 2.5])

plt.show()



In [5]:
print('\n===> p-value for Kolmogorov test: {0}\n'.format(stats.ks_2samp(data_clouds['car.count'], data_no_clouds['car.count'])[1]))
fig, axes_fig1 = plt.subplots(1,1, figsize=(10,8))
plt.hist(data_clouds['car.count'], 20, alpha=0.5, label='clouds')
plt.hist(data_no_clouds['car.count'], 20, alpha=0.5, label='no clouds')
plt.legend(loc='upper right')
plt.xlabel('car count')
fig, axes_fig2 = plt.subplots(1,1, figsize=(10,8))
full_data.sample(500).loc[:, ['cloudy_or_not_cloudy', 'car.count']].boxplot(by='cloudy_or_not_cloudy', ax=axes_fig2)
plt.show()


===> p-value for Kolmogorov test: 6.11456216541e-230

Note: For the sake of completeness I did this very exercise of including the cloud feature in the RF and found that predictions were less accurate, basically doubling error bars for any day, cloudy or not. Which made me conclude that Hyp. 1 is more likely and clouds, which are very random by nature, introduce a significant noise in the car count.

Some exploration in the Frequency domain

The trend above suggests at least a multi-year pseudo-periodicity in the time series.
It is always interesting and sometimes informative to quickly analyze the time series in the Frequency (Fourier) domain. Let's plot:

  • the Fourier spectrum of this time series. It will tell us what are the most significant (pseudo)periodic time scales in the series and, more importantly, whether the data available actually covers at least a few of those periods. This is especially important for the Deep Net, to ensure a good temporal representation is learnt.

In [6]:
def plot_fft(data):
    """Plot the Fourier Transform of a time series
    """
    # Number of samplepoints
    N = len(data)
    # sample spacing
    T = 1.0 
    x = np.linspace(0.0, N*T, N)
    y = data
    yf = scipy.fftpack.fft(y)
    xf = np.linspace(0.0, (T*N/2), N/2)
    fig, ax = plt.subplots(1,1, figsize=(8,6))
    yf = (2.0/N * np.abs(yf[:N//2]))
    axes = ax.plot(xf[1:50], yf[1:50])  # point 0 is for T=+inf, i.e. a constant (actually prop. to the mean value)
    ax.set_xlabel("Period [day]")
    plt.show()

plot_fft(data_no_clouds['car.count'].dropna())
# --> autocorr reveals most info is in previous day, ==> markov style
f1 = autocorrelation_plot(resampled_no_clouds['car.count'])
f2 = autocorrelation_plot(resampled_no_clouds['untrended_car_count'])
_ = f2.set_xlabel("Lag [day]")


We found the multi-year pseudo-periodicity again (about 280 weeks or ~5 years), which is consistent with the previous remarks. The green curve on the bottom plot is the 'un-trended' one where the 3rd-order polynomial fit was substracted from the car count. One can see that all auto-correlations are gone once un-trended.
This means that we need to incorporate two dominant time scales in our forecaster:

  • one long-term pseudo-periodicity. We will incorporate this one as an integer representing the current year of each data point and use that as a feature in our Deep Net and RF
  • one short-term (few days) pseudo-periodicity (can be seen when zooming on the autocorr plot, see also correlation coeffs above) that we can represent by using the last known car count value as a feature itself (last few for the Deep Net)

B) Deep LSTM RNN Net Approach

The next few Cells are code snippets needed to prepare the data and implement the RNN. I re-used some code I had worked on (and the MXNet tutorials helped!).

Here is a brief overview of LSTM Cells (we assume familiarity with RNNs).

Long short-term memory (LSTM) RNNs

An LSTM block has mechanisms to enable "memorizing" information for an extended number of time steps. We use the LSTM block with the following transformations that map inputs to outputs across blocks at consecutive layers and consecutive time steps: $\newcommand{\xb}{\mathbf{x}} \newcommand{\RR}{\mathbb{R}}$

$$f_t = \sigma(X_t W_{xf} + h_{t-1} W_{hf} + b_f) \text{ [forget gate]}$$$$i_t = \sigma(X_t W_{xi} + h_{t-1} W_{hi} + b_i) \text{ [input gate]}$$$$g_t = \text{tanh}(X_t W_{xg} + h_{t-1} W_{hg} + b_g) \text{ [candidate gate]}$$$$o_t = \sigma(X_t W_{xo} + h_{t-1} W_{ho} + b_o) \text{ [output gate]}$$$$c_t = f_t \odot c_{t-1} + i_t \odot g_t \text{ [cell state]}$$$$h_t = o_t \odot \text{tanh}(c_t) \text{ [new hidden state]}$$

where $\odot$ is an element-wise multiplication operator, and for all $\xb = [x_1, x_2, \ldots, x_k]^\top \in \RR^k$ the two activation functions:

$$\sigma(\xb) = \left[\frac{1}{1+\exp(-x_1)}, \ldots, \frac{1}{1+\exp(-x_k)}]\right]^\top,$$$$\text{tanh}(\xb) = \left[\frac{1-\exp(-2x_1)}{1+\exp(-2x_1)}, \ldots, \frac{1-\exp(-2x_k)}{1+\exp(-2x_k)}\right]^\top.$$

In the transformations above, the memory cell $c_t$ stores the "long-term" memory in the vector form. In other words, the information accumulatively captured and encoded until time step $t$ is stored in $c_t$ and is only passed along the same layer over different time steps.

Given the inputs $c_t$ and $h_t$, the input gate $i_t$ and forget gate $f_t$ will help the memory cell to decide how to overwrite or keep the memory information. The output gate $o_t$ further lets the LSTM block decide how to retrieve the memory information to generate the current state $h_t$ that is passed to both the next layer of the current time step and the next time step of the current layer. Such decisions are made using the hidden-layer parameters $W$ and $b$ with different subscripts: these parameters will be inferred during the training phase.

Data Preparation


In [7]:
#############################
# Utility functions for data preparation
#############################
def get_list_unique_block_indices(len_data=100, seq_length=5, n_samples=10):
    """ returns a list of unique random int that serve as index of the first element of a block of data
        args:
            len_data (int): length of the data set
            seq_length (int): length of the blocks to extract
            n_blocks (int): # of blocks to extract
    """
    set1 = set(np.random.randint(len_data // seq_length, size=n_samples)*seq_length)
    full_set = set1
    while len(full_set) < n_samples:
        set2 = set(np.random.randint(len_data // seq_length, size=n_samples)*seq_length)
        full_set = full_set | set2
    returned_list = list(full_set)[0:n_samples]
    assert(len(returned_list) == n_samples)
    return returned_list

def extract_random_sequence(data, seq_length=5, block_start_index=None):
    columns_subset = ['car.count', 'day_of_week_int', 'weather', 'current_month', 'current_year']
    if block_start_index is None:
        block_start_index = np.random.randint(len(data)-seq_length)
    data_subset = data.reset_index().loc[block_start_index:block_start_index+seq_length-1, columns_subset]
    assert(len(data_subset) == (seq_length))
    out_data = [list(i) for i in data_subset.values]
    return out_data

def create_batch_ND_time_series(full_data, seq_length=10, num_samples=4):
    out_data = []
    # get a list of non-overlapping random sequence start indices
    all_samples_start_indices = get_list_unique_block_indices(len(full_data), seq_length, num_samples)
    assert(len(all_samples_start_indices) == num_samples)
    for one_random_start_index in all_samples_start_indices:
        out_data.append(extract_random_sequence(full_data, seq_length, one_random_start_index))
        assert(len(out_data[-1]) == (seq_length))
    return out_data

In [8]:
#############################
# Data Preparation
#############################
SEQ_LENGTH = 3  # we use the last 3 days as inputs
NUM_FEATURES = 4  # we use 4 features (weather, day of week, month, year)
BATCH_SIZE = 32
BATCH_SIZE_TEST = 1

# let's divide data in train (75%), dev (15%), test (10%)
# in sequences of 5 days (SEQ_LENGTH = 5)
data_no_clouds_length = len(data_no_clouds)
# the actual length of extracted sequence is SEQ_LENGTH + 1 so that we can do the shift of +1 for labels
total_num_of_sequences = data_no_clouds_length // (SEQ_LENGTH+1) - 1

# the length of extracted sequence is SEQ_LENGTH so that we can do the shift of +1 for labels
all_random_sequences = create_batch_ND_time_series(data_no_clouds, seq_length=SEQ_LENGTH+1, num_samples=total_num_of_sequences)

percent_train, percent_dev = 0.9, 0.91  # we used dev set to try a few sets of parameters
n_seq_train = int(total_num_of_sequences*percent_train)
n_seq_dev = int(total_num_of_sequences*percent_dev) - int(total_num_of_sequences*percent_train)
n_seq_test = len(all_random_sequences) - int(total_num_of_sequences*percent_dev)

data_train = np.array(all_random_sequences[0:n_seq_train])
data_dev = np.array(all_random_sequences[n_seq_train:n_seq_train+n_seq_dev])
data_test = np.array(all_random_sequences[n_seq_train+n_seq_dev:])

print('all data_length ', data_no_clouds_length)
print('SHAPES of ALL, TRAIN, DEV, TEST:') 
print(np.array(all_random_sequences).shape)
print(np.array(data_train).shape)
print(np.array(data_dev).shape)
print(np.array(data_test).shape)

assert(data_train.shape == (n_seq_train, SEQ_LENGTH+1, NUM_FEATURES+1))  # +1 for the target value (car.count)
assert(data_dev.shape == (n_seq_dev, SEQ_LENGTH+1, NUM_FEATURES+1))
assert(data_test.shape == (n_seq_test, SEQ_LENGTH+1, NUM_FEATURES+1))


all data_length  1352
SHAPES of ALL, TRAIN, DEV, TEST:
(337, 4, 5)
(303, 4, 5)
(3, 4, 5)
(31, 4, 5)

In [9]:
num_batches_train = data_train.shape[0] // BATCH_SIZE
num_batches_test = data_test.shape[0] // BATCH_SIZE_TEST
dim_inputs = NUM_FEATURES + 1  # dimension of the input vector for a given time stamp
# for labels, we just keep the time-series prediction which has index of 0, we dont keep the features.
# See Remarks Below
indices_outputs = [0]
dim_outputs = len(indices_outputs) # same comment

# inputs are from t0 to t_seq_length - 1. because the last point is kept for the 
# output ("label") of the penultimate point 
data_train_inputs = data_train[:, :-1, :]
data_train_labels = data_train[:, 1:, indices_outputs]
data_test_inputs = data_test[:, :-1, :]
data_test_labels = data_test[:, 1:, indices_outputs]

train_data_inputs = nd.array(data_train_inputs).reshape((num_batches_train, BATCH_SIZE, SEQ_LENGTH, dim_inputs))
train_data_labels = nd.array(data_train_labels).reshape((num_batches_train, BATCH_SIZE, SEQ_LENGTH, dim_outputs))
test_data_inputs = nd.array(data_test_inputs).reshape((num_batches_test, BATCH_SIZE_TEST, SEQ_LENGTH, dim_inputs))
test_data_labels = nd.array(data_test_labels).reshape((num_batches_test, BATCH_SIZE_TEST, SEQ_LENGTH, dim_outputs))

train_data_inputs = nd.swapaxes(train_data_inputs, 1, 2)
train_data_labels = nd.swapaxes(train_data_labels, 1, 2)
test_data_inputs = nd.swapaxes(test_data_inputs, 1, 2)
test_data_labels = nd.swapaxes(test_data_labels, 1, 2)

print('num_mini-batches_train={0} | seq_length={2} | mini-batch_size={1} | dim_input={3} | dim_output={4}'.format(num_batches_train, BATCH_SIZE, SEQ_LENGTH, dim_inputs, dim_outputs))
print('train_data_inputs shape: ', train_data_inputs.shape)
print('train_data_labels shape: ', train_data_labels.shape)


num_mini-batches_train=9 | seq_length=3 | mini-batch_size=32 | dim_input=5 | dim_output=1
train_data_inputs shape:  (9L, 3L, 32L, 5L)
train_data_labels shape:  (9L, 3L, 32L, 1L)

In [10]:
num_hidden_units = [64, 16]  # num of hidden units in each hidden LSTM layer
num_hidden_layers = len(num_hidden_units)  # num of hidden LSTM layers
num_units_layers = [dim_inputs] + num_hidden_units

norm_factor = 0.1  # normalization factor for weight initialization. set to 0.01 for prediction of small values (<1)

########################
#  Weights connecting the inputs to the hidden layer
########################
Wxg, Wxi, Wxf, Wxo, Whg, Whi, Whf, Who, bg, bi, bf, bo = {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {} 
for i_layer in range(1, num_hidden_layers+1):
    num_inputs = num_units_layers[i_layer-1]
    num_hidden_units = num_units_layers[i_layer]
    Wxg[i_layer] = nd.random_normal(shape=(num_inputs,num_hidden_units), ctx=ctx) * norm_factor
    Wxi[i_layer] = nd.random_normal(shape=(num_inputs,num_hidden_units), ctx=ctx) * norm_factor
    Wxf[i_layer] = nd.random_normal(shape=(num_inputs,num_hidden_units), ctx=ctx) * norm_factor
    Wxo[i_layer] = nd.random_normal(shape=(num_inputs,num_hidden_units), ctx=ctx) * norm_factor

    ########################
    #  Recurrent weights connecting the hidden layer across time steps
    ########################
    Whg[i_layer] = nd.random_normal(shape=(num_hidden_units, num_hidden_units), ctx=ctx) * norm_factor
    Whi[i_layer] = nd.random_normal(shape=(num_hidden_units, num_hidden_units), ctx=ctx) * norm_factor
    Whf[i_layer] = nd.random_normal(shape=(num_hidden_units, num_hidden_units), ctx=ctx) * norm_factor
    Who[i_layer] = nd.random_normal(shape=(num_hidden_units, num_hidden_units), ctx=ctx) * norm_factor

    ########################
    #  Bias vector for hidden layer
    ########################
    bg[i_layer] = nd.random_normal(shape=num_hidden_units, ctx=ctx) * norm_factor
    bi[i_layer] = nd.random_normal(shape=num_hidden_units, ctx=ctx) * norm_factor
    bf[i_layer] = nd.random_normal(shape=num_hidden_units, ctx=ctx) * norm_factor
    bo[i_layer] = nd.random_normal(shape=num_hidden_units, ctx=ctx) * norm_factor

########################
# Weights to the output nodes
########################
Why = nd.random_normal(shape=(num_units_layers[-1], dim_outputs), ctx=ctx) * norm_factor
by = nd.random_normal(shape=dim_outputs, ctx=ctx) * 0.01

Attach the gradients


In [11]:
params = []
for i_layer in range(1, num_hidden_layers+1):
    params += [Wxg[i_layer], Wxi[i_layer], Wxf[i_layer], Wxo[i_layer], Whg[i_layer], Whi[i_layer], Whf[i_layer], Who[i_layer], bg[i_layer], bi[i_layer], bf[i_layer], bo[i_layer]]

params += [Why, by]  # add the output layer

for param in params:
    param.attach_grad()

Loss utility functions


In [12]:
def rmse(yhat, y):
    return nd.mean(nd.sqrt(nd.sum(nd.power(y - yhat, 2), axis=0, exclude=True)))

def average_rmse_loss(outputs, labels, last_element_only=False):
#     assert(len(outputs) == len(labels))
    total_loss = 0.
    zipped_elements = zip(outputs,labels)
    if last_element_only:
        output, label = zipped_elements[-1]
        total_loss = rmse(output, label)
        return total_loss
    for (output, label) in zipped_elements:
        total_loss = total_loss + rmse(output, label)
    return total_loss / len(outputs)

Optimizer

Note: we will use the Adam Optimizer. Much faster and much more robust than Standard Gradient Descent.


In [13]:
from exceptions import ValueError

def SGD(params, learning_rate):
    for param in params:
        param[:] = param - learning_rate * param.grad

def adam(params, learning_rate, M , R, index_adam_call, beta1, beta2, eps):
    k = -1
    for param in params:
        k += 1
        M[k] = beta1 * M[k] + (1. - beta1) * param.grad
        R[k] = beta2 * R[k] + (1. - beta2) * (param.grad)**2
        # bias correction since we initilized M & R to zeros, they're biased toward zero on the first few iterations
        m_k_hat = M[k] / (1. - beta1**(index_adam_call))
        r_k_hat = R[k] / (1. - beta2**(index_adam_call))
        if((np.isnan(M[k].asnumpy())).any() or (np.isnan(R[k].asnumpy())).any()):
            raise(ValueError('Nans!!'))
        param[:] = param - learning_rate * m_k_hat / (nd.sqrt(r_k_hat) + eps)
    
    return params, M, R

Deep LSTM Net core functions


In [14]:
def single_lstm_unit_calcs(X, c, Wxg, h, Whg, bg, Wxi, Whi, bi, Wxf, Whf, bf, Wxo, Who, bo):
    g = nd.tanh(nd.dot(X, Wxg) + nd.dot(h, Whg) + bg)
    i = nd.sigmoid(nd.dot(X, Wxi) + nd.dot(h, Whi) + bi)
    f = nd.sigmoid(nd.dot(X, Wxf) + nd.dot(h, Whf) + bf)
    o = nd.sigmoid(nd.dot(X, Wxo) + nd.dot(h, Who) + bo)
    #######################
    c = f * c + i * g
    h = o * nd.tanh(c)
    return c, h

def deep_lstm_rnn(inputs, h, c, temperature=1.0):
    """
        h: dict of nd.arrays, each key is the index of a hidden layer (from 1 to whatever). 
        Index 0, if any, is the input layer
    """
    outputs = []
    # inputs is one BATCH of sequences so its shape is number_of_seq, seq_length, features_dim 
    # (latter is 1 for a time series, vocab_size for a character, n for a n different times series)
    for X in inputs:
        # X is batch of one time stamp. E.g. if each batch has 37 sequences, then the first value of X will be a set of the 37 first values of each of the 37 sequences 
        # that means each iteration on X corresponds to one time stamp, but it is done in batches of different sequences
        h[0] = X # the first hidden layer takes the input X as input 
        for i_layer in range(1, num_hidden_layers+1):
            # lstm units now have the 2 following inputs: 
            # i) h_t from the previous layer (equivalent to the input X for a non-deep lstm net), 
            # ii) h_t-1 from the current layer (same as for non-deep lstm nets)
            c[i_layer], h[i_layer] = single_lstm_unit_calcs(h[i_layer-1], c[i_layer], Wxg[i_layer], h[i_layer], Whg[i_layer], bg[i_layer], Wxi[i_layer], Whi[i_layer], bi[i_layer], Wxf[i_layer], Whf[i_layer], bf[i_layer], Wxo[i_layer], Who[i_layer], bo[i_layer])
        yhat_linear = nd.dot(h[num_hidden_layers], Why) + by
        # yhat is a batch of several values of the same time stamp
        # this is basically the prediction of the sequence, which overlaps most of the input sequence, plus one point (character or value)
#         yhat = nd.sigmoid(yhat_linear)
        yhat = yhat_linear # we cant use a 1.0-bounded activation function since outputs (car count) can be greater than 1.0
        outputs.append(yhat) # outputs has same shape as inputs, i.e. a list of batches of data points.

    return (outputs, h, c)

Test and visualize predictions


In [15]:
INDEX_TARGET_VALUE = 0
def test_prediction(one_input_seq, one_label_seq, temperature=1.0):
      # WE ASSUME the first value in input vector is the variable of interest
    #####################################
    # Set the initial state of the hidden representation ($h_0$) to the zero vector
    #####################################  # some better initialization needed??
    h, c = {}, {}
    for i_layer in range(1, num_hidden_layers+1):
        h[i_layer] = nd.zeros(shape=(BATCH_SIZE_TEST, num_units_layers[i_layer]), ctx=ctx)
        c[i_layer] = nd.zeros(shape=(BATCH_SIZE_TEST, num_units_layers[i_layer]), ctx=ctx)
    
    outputs, h, c = deep_lstm_rnn(one_input_seq, h, c, temperature=temperature)

    return outputs[-1][0].asnumpy()[INDEX_TARGET_VALUE], one_label_seq.asnumpy()[-1].flatten()[INDEX_TARGET_VALUE], outputs, one_label_seq

def check_prediction(index):
    if index >= len(test_data_inputs):
        index = np.random.randint(len(test_data_inputs))
        
    o, label, outputs, labels = test_prediction(test_data_inputs[index], test_data_labels[index], temperature=1.0)    

    prediction = round(o, 3)
    true_label = round(label, 3)
    outputs = [float(i.asnumpy().flatten()[INDEX_TARGET_VALUE]) for i in outputs]  # if batch_size_test=1 then this float() will work, otherwise, nope.
    true_labels = list(test_data_labels[index].asnumpy()[:,:,INDEX_TARGET_VALUE].flatten())
    
    df = pd.DataFrame([outputs, true_labels]).transpose()
    df.columns = ['predicted', 'true']

    return df

Time to Train the Deep Net! (about 6 min for 2000 epochs on a 2015 15'' MBP)

Another tweak that I applied to this Net is to compute the loss only on the last predicted value, not on the entire sequence. This is achieved by setting last_element_only to True in the call to average_rmse_loss().
Usually, the loss is averaged over the full sequence but by only using the last value, I force the Net to learn a representation that is entirely trained to predict one target value from a full sequence, as opposed to being trained to predict one full sequence from another full sequence. My rationale was to force the training to be entirely dedicated to our value of interest, and the intermediate value of the output sequence could be use by the Net to serve as some sort of representation (sort of like some additional units.)

During training, a plot of predicted vs true sequence is shown, along another chart that shows the (training) loss and (test) error vs the training epoch.


In [16]:
epochs = 2000  # one epoch is one pass over the entire training set
moving_loss = 0.
learning_rate = 0.002  # no more than 0.002, even with Adam, otherwise, convergence will likely saturate

# Adam Optimizer stuff
beta1 = .9
beta2 = .999
index_adam_call = 0
# M & R arrays to keep track of momenta in adam optimizer. params is a list that contains all ndarrays of parameters
M = {k: nd.zeros_like(v) for k, v in enumerate(params)}
R = {k: nd.zeros_like(v) for k, v in enumerate(params)}

df_moving_loss = pd.DataFrame(columns=['Loss', 'Error'])
df_moving_loss.index.name = 'Epoch'

# needed to update plots on the fly
%matplotlib notebook
fig, axes_fig1 = plt.subplots(1,1, figsize=(6,3))
fig2, axes_fig2 = plt.subplots(1,1, figsize=(6,3))


for e in range(epochs):
    ############################
    # Attenuate the learning rate by a factor of 2 every 100 epochs
    ############################
    if ((e+1) % 1000 == 0):
        learning_rate = learning_rate / 2.0  # TODO check if its ok to adjust learning_rate when using Adam Optimizer
    h, c = {}, {}
    for i_layer in range(1, num_hidden_layers+1):
        h[i_layer] = nd.zeros(shape=(BATCH_SIZE, num_units_layers[i_layer]), ctx=ctx)
        c[i_layer] = nd.zeros(shape=(BATCH_SIZE, num_units_layers[i_layer]), ctx=ctx)

    for i in range(num_batches_train):
        data_one_hot = train_data_inputs[i]
        label_one_hot = train_data_labels[i]
        with autograd.record():
            outputs, h, c = deep_lstm_rnn(data_one_hot, h, c)
            loss = average_rmse_loss(outputs, label_one_hot, last_element_only=True)
            loss.backward()
#         SGD(params, learning_rate)
        index_adam_call += 1  # needed for bias correction in Adam optimizer
        params, M, R = adam(params, learning_rate, M, R, index_adam_call, beta1, beta2, 1e-8)
        
        ##########################
        #  Keep a moving average of the losses
        ##########################
        if (i == 0) and (e == 0):
            moving_loss = nd.mean(loss).asscalar()
        else:
            moving_loss = .99 * moving_loss + .01 * nd.mean(loss).asscalar()
        df_moving_loss.loc[e] = round(moving_loss, 4)

    ############################
    #  Predictions and plots
    ############################
    data_prediction_df = check_prediction(index=e)
    
    if not (e%50):
        axes_fig1.clear()
        data_prediction_df.plot(ax=axes_fig1)
        fig.canvas.draw()
    prediction = round(data_prediction_df.tail(1)['predicted'].values.flatten()[-1], 3)
    true_label = round(data_prediction_df.tail(1)['true'].values.flatten()[-1], 3)
    if true_label != 0:
        rel_error = round(100. * np.abs(prediction / (true_label+1e-5) - 1.0), 2)
    else:
        rel_error = moving_rel_error
    if not (e%50):
        print("Epoch = {0} | Loss = {1} | Prediction = {2} True = {3} Error = {4}".format(e, moving_loss, prediction, true_label, rel_error ))
    if not (e%50):
        axes_fig2.clear()
    if e == 0:
        moving_rel_error = rel_error
    else:
        moving_rel_error = .99 * moving_rel_error + .01 * rel_error

    df_moving_loss.loc[e, ['Error']] = moving_rel_error
    if not (e%50):
        axes_loss_plot = df_moving_loss.plot(ax=axes_fig2, secondary_y='Loss', color=['r','b'])
        axes_loss_plot.right_ax.grid(False)
#     axes_loss_plot.right_ax.set_yscale('log')
        fig2.canvas.draw()
    
%matplotlib inline


Epoch = 0 | Loss = 120.121385886 | Prediction = 0.164 True = 115.0 Error = 99.86
/usr/local/lib/python2.7/site-packages/matplotlib/axes/_base.py:2787: UserWarning: Attempting to set identical left==right results
in singular transformations; automatically expanding.
left=0.0, right=0.0
  'left=%s, right=%s') % (left, right))
Epoch = 50 | Loss = 129.912569467 | Prediction = 18.073 True = 153.0 Error = 88.19
Epoch = 100 | Loss = 114.524337391 | Prediction = 33.574 True = 140.0 Error = 76.02
Epoch = 150 | Loss = 99.0900367482 | Prediction = 48.923 True = 135.0 Error = 63.76
Epoch = 200 | Loss = 83.7230199096 | Prediction = 64.216 True = 159.0 Error = 59.61
Epoch = 250 | Loss = 68.3881308609 | Prediction = 79.482 True = 152.0 Error = 47.71
Epoch = 300 | Loss = 53.2300928567 | Prediction = 94.605 True = 102.0 Error = 7.25
Epoch = 350 | Loss = 40.4476926927 | Prediction = 108.103 True = 134.0 Error = 19.33
Epoch = 400 | Loss = 32.3094987179 | Prediction = 118.99 True = 159.0 Error = 25.16
Epoch = 450 | Loss = 28.0650737245 | Prediction = 127.087 True = 125.0 Error = 1.67
Epoch = 500 | Loss = 25.7128142502 | Prediction = 133.328 True = 125.0 Error = 6.66
Epoch = 550 | Loss = 21.0741573674 | Prediction = 127.333 True = 133.0 Error = 4.26
Epoch = 600 | Loss = 15.6376537638 | Prediction = 124.227 True = 150.0 Error = 17.18
Epoch = 650 | Loss = 13.1248963494 | Prediction = 119.585 True = 130.0 Error = 8.01
Epoch = 700 | Loss = 12.2414558273 | Prediction = 131.457 True = 125.0 Error = 5.17
Epoch = 750 | Loss = 11.632709775 | Prediction = 123.486 True = 120.0 Error = 2.9
Epoch = 800 | Loss = 11.4883882638 | Prediction = 118.233 True = 133.0 Error = 11.1
Epoch = 850 | Loss = 11.1306284069 | Prediction = 148.703 True = 138.0 Error = 7.76
Epoch = 900 | Loss = 10.7408217361 | Prediction = 119.285 True = 133.0 Error = 10.31
Epoch = 950 | Loss = 10.4067367105 | Prediction = 136.648 True = 133.0 Error = 2.74
Epoch = 1000 | Loss = 10.1806864155 | Prediction = 131.376 True = 149.0 Error = 11.83
Epoch = 1050 | Loss = 9.6002499653 | Prediction = 129.865 True = 133.0 Error = 2.36
Epoch = 1100 | Loss = 9.40852998331 | Prediction = 123.35 True = 134.0 Error = 7.95
Epoch = 1150 | Loss = 9.13181589646 | Prediction = 124.228 True = 129.0 Error = 3.7
Epoch = 1200 | Loss = 8.81483250796 | Prediction = 142.823 True = 125.0 Error = 14.26
Epoch = 1250 | Loss = 8.69964197434 | Prediction = 152.254 True = 147.0 Error = 3.57
Epoch = 1300 | Loss = 8.5647633724 | Prediction = 153.171 True = 136.0 Error = 12.63
Epoch = 1350 | Loss = 8.34626953241 | Prediction = 135.648 True = 120.0 Error = 13.04
Epoch = 1400 | Loss = 8.40675757918 | Prediction = 124.694 True = 152.0 Error = 17.96
Epoch = 1450 | Loss = 8.20718842113 | Prediction = 153.125 True = 147.0 Error = 4.17
Epoch = 1500 | Loss = 8.10436369619 | Prediction = 144.072 True = 133.0 Error = 8.32
Epoch = 1550 | Loss = 8.03653038988 | Prediction = 135.171 True = 136.0 Error = 0.61
Epoch = 1600 | Loss = 7.75606049116 | Prediction = 139.036 True = 120.0 Error = 15.86
Epoch = 1650 | Loss = 7.53326302315 | Prediction = 141.999 True = 153.0 Error = 7.19
Epoch = 1700 | Loss = 7.50471777738 | Prediction = 120.604 True = 133.0 Error = 9.32
Epoch = 1750 | Loss = 7.48445976351 | Prediction = 146.15 True = 150.0 Error = 2.57
Epoch = 1800 | Loss = 7.20212238673 | Prediction = 146.36 True = 149.0 Error = 1.77
Epoch = 1850 | Loss = 7.08246718491 | Prediction = 133.661 True = 152.0 Error = 12.07
Epoch = 1900 | Loss = 7.14242242583 | Prediction = 136.849 True = 150.0 Error = 8.77
Epoch = 1950 | Loss = 6.89232063099 | Prediction = 139.622 True = 149.0 Error = 6.29

Error Analysis

Hurrah! The Deep Net worked and seems to do OK on forecasting. Let's get some error statistics using the test set.


In [17]:
def compute_test_errors(bias_correction=1.):
    list_abs_rel_error, list_rel_error, list_abs_error = [], [], []
    for e in range(0, data_test_labels.shape[0]):
        data_prediction_df = check_prediction(index=e)
        prediction = round(data_prediction_df.tail(1)['predicted'].values.flatten()[-1] * bias_correction, 3) 
        true_label = round(data_prediction_df.tail(1)['true'].values.flatten()[-1], 3)
        if true_label != 0:
            list_abs_rel_error.append( round(100. * np.abs(prediction / (true_label) - 1.0), 2) )
            list_rel_error.append( round(100. * (prediction / (true_label) - 1.0), 2) )
            list_abs_error.append( round(prediction - true_label, 0) )
        else:
            continue
    return list_abs_rel_error, list_rel_error, list_abs_error

list_abs_rel_error, list_rel_error, _ = compute_test_errors(bias_correction=1.0)

bias_correction = 100. / (100. + np.mean(list_rel_error))  # estimate of the bias

# recompute errors using "de-biased" model
list_abs_rel_error, list_rel_error, list_abs_error = compute_test_errors(bias_correction=bias_correction)

%matplotlib inline
plt.hist(list_abs_rel_error, 20, alpha=0.4, label='abs percent error')
plt.hist(list_rel_error, 20, alpha=0.4, label='percent error')
plt.hist(list_abs_error, 20, alpha=0.4, label='abs error [in # of cars]')
plt.legend(loc='upper right')
plt.xlabel('unbiased errors')
plt.show()
print('Estimate of the Bias: {0}% | Median Abs Rel Error: {1}%'.format(round(bias_correction,1), round(np.median(list_abs_rel_error), 1)))


Estimate of the Bias: 1.0% | Median Abs Rel Error: 9.6%

Conclusions on the Deep Net approach

  • It is remarkable to see the convergence happening quickly and firmly given the relatively small data set at hand.
  • Forecasting Performances are accceptable: about 15% uncertainty after 2000 epochs on the data. By estimating the bias and correcting our model accordingly, the median absolute relative error falls to about 9.5%.
  • Forecast accuracy is thus +- 9.5% or roughly +- 15 cars in average (95% CL, see red histogram). This means that if we predict, say, 150 cars on a given day, there's 95% chances that the range [135-165] will cover the true value. Lower Confidence Levels will obviously narrow this range.

C) Random Forest Approach

A Random Forest is an estimator that fits a large number of decision trees on various subsamples of the dataset and use averaging to improve the prediction accuracy and reduce over-fitting.
As explained in A), we use the previous-day car count as one of the features, as well as weekday, month, year and weather features.


In [18]:
columns_subset = ['previous_bucketed_car_count', 'day_of_week_int', 'weather', 'current_month', 'current_year']

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

X_train, X_test, Y_train, Y_test = train_test_split(data_no_clouds[columns_subset], data_no_clouds['bucketed_car_count'],test_size=0.1)

clf = RandomForestClassifier(n_estimators=3500, max_depth=8)
clf.fit(X_train, Y_train)


Out[18]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=8, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=3500, n_jobs=1, oob_score=False,
            random_state=None, verbose=0, warm_start=False)

In [19]:
#####################
# Execute predictions on test set
#####################
predictions = clf.predict(X_test)

#####################
# Compute error metrics on test set
#####################
def get_list_errors(predictions, Y_test, bias_corr=1.00):
    list_deviations = []
    for p, t in zip(predictions, Y_test):
        list_deviations.append(p*bias_corr - t)
    return list_deviations

list_deviations = get_list_errors(predictions, Y_test) 
med = round(np.mean(list_deviations)*BIN_SIZE, 1)
sigma = round(np.std(list_deviations)*BIN_SIZE, 1)
print('Prediction bias {0} | sigma: +-{1}'.format(med, sigma))
plt.hist(list_deviations*BIN_SIZE)
plt.xlabel('Relative Error [car #]')

#####################
# Rank the features
#####################
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]
print("Feature ranking:")
for f in range(X_train.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X_train.shape[1]), indices)
plt.xlim([-1, X_train.shape[1]])
plt.show()


Prediction bias -1.5 | sigma: +-20.1
Feature ranking:
1. feature 0 (0.273949)
2. feature 2 (0.248428)
3. feature 3 (0.188335)
4. feature 1 (0.159505)
5. feature 4 (0.129783)

Conclusions on the Random Forest

Random Forest are easy to implement, easy to interpret, fast to train and generally accurate as they are less prone to overfitting than many other approaches.

  • 3-4 models were tested against the dev set and a rather simple one was kept for testing, with 3500 trees and a max depth fixed at 8. Max depth is unbounded by default but given the limited data set, I figured it could introduce some overfitting.
  • Bias is low at about -5% (another benefit of averaging over many decision tree with a deep-enough RF)
  • Forecast uncertainty is reasonable, about +-18 cars at a 95% percentile (that we can call 95% C.L. because they look approx. gaussian..). It is remarkable that those values are very close to what we get with the Deep Net. A sane (optimistic?) explanation is that we reached (or got very close to) the intrinsic dispersion of the time series itself.
  • One benefit of Random Forest is their ability to rank features, based on how much 'discerning' power each has. It turns out that the last known non-cloudy value and the weather are the most predictive features, followed by the current month and weekday. The former was somewhat expected given the rather significant correlation between consecutive days, the latter is a bit more surprising as we could naively think that people may shop more on weekends. However, we don't measure how many people shop, we measure cars and it could well be that people walk or take public transit on fair-weathered days as opposed to rainy days where everybody would drive and park. It would be interesting to renormalize by the weather and see the trend of car # vs weekdays.

Note: Train error can be computed using get_list_errors(clf.predict(X_test), Y_test) and is about 7%.

D) General Conclusions

  • Both approaches, though radically different, have decent predictive powers and exhibit similar uncertainties (~15%). This is a very healthy sign as it is unlikely to be a coincidence and means that we are hopefully close to the inherent dispersion of the true distribution itself.

  • On a cloudy day, just use the RF or Deep Net normally, since we assumed clouds add noise (bias+variance) but no signal, i.e. doesn't impact the actual car # significantly

Next natural steps would be:

  • gather more data, including more features, more independent (ground?) measurements
  • try more models, including more classic ones like ARIMA to have a baseline performance
  • improve the models presented here: batch-normalize the car count for the Deep Net, try a few more RF models; and try an entirely new model

The tradeoff between how much ‘human-designed’ features we can inject in our model versus ‘let the deep net figure out everything’ is a tough one and it is generally worth trying both -- time permitting.

Further Extensions include:

  • Extend the model to allow conditional inference (say, we miss the weather data, we can just do some Monte Carlo samples on weather distribution given the other features and extract the median of the prediction)