In [ ]:
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

Overview

This notebook demonstrates how to sequence data for a time-series problem, and then how to build deep learning and statistical models

Dataset

The Iowa Liquor Sales dataset from BigQuery Public Datasets is used in this example. The dataset contains wholesale liquor purchases in the state of Iowa from 2012 to the present.

Objective

In this lab, you'll learn how to build a time-series forecasting model with TensorFlow, and then learn how to deploy these models with the Google Cloud AI Platform.

Install packages and dependencies

Restarting the kernel may be required to use new packages.


In [ ]:
%pip install -U statsmodels --user

Import libraries and define constants


In [ ]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from google.cloud import storage
from pandas.plotting import register_matplotlib_converters
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from tensorflow.keras import Sequential
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Conv1D, Dense, Dropout, Flatten, LSTM, MaxPooling1D

register_matplotlib_converters() # Address warning

In [ ]:
# Enter your project, region, and bucket. Then run the  cell to make sure the
# Cloud SDK uses the right project for all the commands in this notebook.

PROJECT = "your-project-name" # REPLACE WITH YOUR PROJECT NAME
BUCKET = "your-bucket-id" # REPLACE WITH YOUR BUCKET ID - YOU CAN USE YOUR PROJECT NAME IF UNSURE
REGION = "us-central1" # REPLACE WITH YOUR BUCKET REGION e.g. us-central1
BUCKET_URI = "gs://" + BUCKET

assert PROJECT != 'your-project-name', 'Don''t forget to change the project variables!'

In [ ]:
target_col = 'y' # What we are predicting
ts_col = 'ds' # Time series column

if os.path.exists('iowa_daily.csv'):
    input_file = 'iowa_daily.csv' # File created in previous lab
else:
    input_file = 'data/iowa_daily.csv'

n_features = 2 # Two features: y (previous values) and whether the date is a holiday
n_input_steps = 30 # Lookback window
n_output_steps = 7 # How many steps to predict forward
n_seasons = 7 # Weekly periodicity

train_split = 0.75 # % Split between train/test data
epochs = 1000 # How many passes through the data (early-stopping will cause training to stop before this)
patience = 5 # Terminate training after the validation loss does not decrease after this many epochs

Create a Cloud Storage bucket

The following steps are required, regardless of your notebook environment.

When you submit a training job using the Cloud SDK, you upload a Python package containing your training code to a Cloud Storage bucket. AI Platform runs the code from this package. In this tutorial, AI Platform also saves the trained model that results from your job in the same bucket. You can then create an AI Platform model version based on this output in order to serve online predictions.


In [ ]:
storage_client = storage.Client()
try:
    bucket = storage_client.get_bucket(BUCKET)
    print('Bucket exists, let''s not recreate it.')
except:
    bucket = storage_client.create_bucket(BUCKET)
    print('Created bucket: ' + BUCKET)

Load data


In [ ]:
df = pd.read_csv(input_file, index_col='ds', parse_dates=True)

df.head()

In [ ]:
# Split data
size = int(len(df) * train_split)
df_train, df_test = df[0:size].copy(deep=True), df[size:len(df)].copy(deep=True)

df_train.head()

In [ ]:
# Notice any outliers in the data? Make note of the value.

# The _= syntax avoids printing out the type of the plot. _ is a Python convention for "I don't care about this variable"
_=sns.lineplot(data=df_train)

In [ ]:
# TODO-1 : Update the threshold below to remove the outlier (NOTE: scale is in millions)

threshold = 2e6 # Set this to a level which will flag outliers (any values above it)
assert threshold != -1, 'Set the threshold to the minimum that will eliminate outlier(s)'
# Set any values above the threshold to NaN (not a number)
df_train.loc[df_train[target_col] > threshold, target_col] = np.nan

# Interpolate the missing values (e.g. [3, NaN, 5] becomes [3, 4, 5])
df_train = df_train.interpolate()

In [ ]:
# Review the updated chart to see if outliers still exist

_=sns.lineplot(data=df_train)

Utility functions

The Utility functions section does not need thorough review.

Functions such as scaling variables, creating a time-series sequence, etc. are provided here.

Scale the inputs and outputs


In [ ]:
# For neural networks to converge quicker, it is helpful to scale the values.
# For example, each feature might be transformed to have a mean of 0 and std. dev. of 1.
#
# We are working with a mix of features, input timesteps, output horizon, etc.
# which don't work out-of-the-box with common scaling utilities.
# So, here are a couple wrappers to handle scaling and inverting the scaling.

feature_scaler = StandardScaler()
target_scaler = StandardScaler()

def scale(df, 
          fit=True, 
          target_col=target_col,
          feature_scaler=feature_scaler,
          target_scaler=target_scaler):
    """
    Scale the input features, using a separate scaler for the target.
    
    Parameters: 
    df (pd.DataFrame): Input dataframe
    fit (bool): Whether to fit the scaler to the data (only apply to training data)
    target_col (pd.Series): The column that is being predicted
    feature_scaler (StandardScaler): Scaler used for features
    target_scaler (StandardScaler): Scaler used for target
      
    Returns: 
    df_scaled (pd.DataFrame): Scaled dataframe   
    """    
    target = df[target_col].values.reshape(-1, 1)
    if fit:
        target_scaler.fit(target)
    data_scaled = target_scaler.transform(target)
    
    features = df.loc[:, df.columns != target_col].values
    if features.shape[1]:
        if fit:
            feature_scaler.fit(features)
        features_scaled = feature_scaler.transform(features)
        data_scaled = np.concatenate([data_scaled, features_scaled], axis=1)

    df_scaled = pd.DataFrame(data_scaled, columns=df.columns)
    
    return df_scaled

def inverse_scale(data, target_scaler=target_scaler):
    """
    Transform the scaled values of the target back into their original form.
    The features are left alone, as we're assuming that the output of the model only includes the target.
    
    Parameters: 
    data (np.array): Input array
    target_scaler (StandardScaler): Scaler used for target
      
    Returns: 
    data_scaled (np.array): Scaled array   
    """    
    
    df = pd.DataFrame()
    data_scaled = np.empty([data.shape[1], data.shape[0]])
    for i in range(data.shape[1]):
        data_scaled[i] = target_scaler.inverse_transform(data[:,i])
    return data_scaled.transpose()

df_train_scaled=scale(df_train)
df_test_scaled=scale(df_test, False)

Create sequences of time-series data


In [ ]:
def reframe(data, n_input_steps = n_input_steps, n_output_steps = n_output_steps):

    # Iterate through data and create sequences of features and outputs
    df = pd.DataFrame(data)
    cols=list()
    for i in range(n_input_steps, 0, -1):
        cols.append(df.shift(i))
    for i in range(0, n_output_steps):
        cols.append(df.shift(-i))
        
    # Concatenate values and remove any missing values
    df = pd.concat(cols, axis=1)
    df.dropna(inplace=True)
    
    # Split the data into feature and target variables
    n_feature_cols = n_input_steps * n_features
    features = df.iloc[:,0:n_feature_cols]
    target_cols = [i for i in range(n_feature_cols, n_feature_cols + n_output_steps * n_features, n_features)]
    targets = df.iloc[:,target_cols]

    return (features, targets)

X_train_reframed, y_train_reframed = reframe(df_train_scaled)
X_test_reframed, y_test_reframed = reframe(df_test_scaled)

Evaluate results


In [ ]:
def print_stats(timestep, y_true, y_pred, chart=True, table=False, dec=3):
    '''
    Helper function to print overall summary statistics and stats for each time step
    '''
    
    # Print summary statistics
    print('=== t+' + str(timestep) + ' ===')
    print('R^2: ' + str(np.round(r2_score(y_true, y_pred), dec)))
    print('MAE: ' + str(np.round(mean_absolute_error(y_true, y_pred), dec)))
    print('')

    df_y_true = pd.DataFrame(y_true)
    df_y_true[target_col + '_pred'] = np.round(y_pred, dec)
    
    # Show plot of actuals vs predictions and a sample of values
    if table:
        print(str(df_y_true.head(5)) + '\n')
        print(str(df_y_true.tail(5)) + '\n')
    if chart:
        sns.lineplot(data=df_y_true[[target_col, target_col+'_pred']])
        plt.show()
        
def evaluate(y_pred,
             exclude_timesteps=n_input_steps,
             y_true=df_test):
    '''
    Helper function to transform predictions to match size and indices of actuals.
    
    For example, n_timesteps from the test data will be required to make a prediction,
    so the number of predictions will be fewer than the number of test samples.
    
    Parameters:
    y_pred (np.array): Predictions
    exclude_timesteps (int): Number of leading timesteps to trim from the dataset
    y_true (pd.DataFrame): Actuals
    '''
        
    # Number of outputs (future timesteps)
    outputs = y_pred.shape[1]

    # Lists of actual and predicted values for each time step
    # For example, y_true_eval[2] will contain actual values 3 time steps out
    # These specific lists enable computing the accuracy for specific time steps
    y_true_eval, y_pred_eval = list(), list()

    # Actual and predicted values combined across all time steps (to compute overall accuracy metrics)
    y_true_all, y_pred_all = np.array([]), np.array([])
    
    # Append entries to lists for each output timestep
    for t in range(outputs):
        if exclude_timesteps:
            y_true_eval.append(y_true[exclude_timesteps+t : outputs*-2 + t].copy())
            y_pred_eval.append(y_pred[:outputs*-1 - 1,t])          
        else:
            y_true_eval.append(y_true[t:].copy())
            y_pred_eval.append(y_pred[:-1,t] if t == 0 else y_pred[:-1*t-1,t])
        
        # Append the output values to the combined lists
        y_true_all = np.concatenate([y_true_all, y_true_eval[t].values[:,0]], axis=0)
        y_pred_all = np.concatenate([y_pred_all, y_pred_eval[t]], axis=0)

    # Print aggregate statistics across all time steps (only if predicting multiple time steps)
    if outputs > 1:
        print_stats('(1-' + str(outputs) + ')', y_true_all, y_pred_all, False)

    # Print stats for each future time step
    for t in range(outputs):    
        print_stats(t+1, y_true_eval[t][target_col], y_pred_eval[t], True)

ML Models

In this section, we will build models using popular neural network architectures for time-series data.

Long Short Term Memory (LSTM)


In [ ]:
# Reshape test data to match model inputs and outputs

X_train = X_train_reframed.values.reshape(X_train_reframed.shape[0], n_input_steps, n_features)
X_test = X_test_reframed.values.reshape(X_test_reframed.shape[0], n_input_steps, n_features)
y_train = y_train_reframed.values.reshape(y_train_reframed.shape[0], n_output_steps, 1)
y_test = y_test_reframed.values.reshape(y_test_reframed.shape[0], n_output_steps, 1)

In [ ]:
# TODO-2: Try increasing and decreasing the number of LSTM units and see if you notice any accuracy improvements.
# Run the next cell to evaluate the results in more detail.

model = Sequential([
    LSTM(64, input_shape=[n_input_steps, n_features], recurrent_activation=None),
    Dense(n_output_steps)])

model.compile(optimizer='adam', loss='mae')

early_stopping = EarlyStopping(monitor='val_loss', patience=patience)
_ = model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test), epochs=epochs, callbacks=[early_stopping])

In [ ]:
# Predict the results, and then reverse the transformation that scaled all values to a mean of 0 and std. dev. of 1
preds = model.predict(X_test)
y_pred_lstm = inverse_scale(preds)

# Evaluate the overall results and for each time step
evaluate(y_pred_lstm)

# The plot will show the R^2 value (0 lowest -> 1 highest) and the MAE (mean absolute error) for the entire prediction window.
# It will also show individual plots for 1 day out, 2 days out, etc. comparing the actual vs the predicted value.

Convolutional Neural Network (CNN)


In [ ]:
# TODO-3: Try adjusting the # of filters (pattern types) and kernel size (size of the sliding window)

model = Sequential([
    Conv1D(filters=32, kernel_size=4, activation='tanh', padding='causal', input_shape=[n_input_steps, n_features]),
    Flatten(),
    Dense(n_output_steps)])

model.compile(optimizer='adam', loss='mae')

early_stopping = EarlyStopping(monitor='val_loss', patience=patience)
_ = model.fit(x=X_train, y=y_train, validation_data=(X_test, y_test), epochs=epochs, callbacks=[early_stopping])

In [ ]:
preds = model.predict(X_test)
y_pred_cnn = inverse_scale(preds)

evaluate(y_pred_cnn)

Ensembled Model

Often, combining multiple models that use diverse approaches can lead to better results than any individual model.


In [ ]:
# TODO: Try adjusting the weights to find the optimal blend of models.

models = [y_pred_cnn, y_pred_lstm]
weights = [1, 3]

y_pred_ensemble = np.average( np.array(models), axis=0, weights=weights)
evaluate(y_pred_ensemble)

Statistical Models

This section is optional and can take 20+ minutes to train depending on the environment. Feel free to skip to the Conclusion section.

We will next implement a popular statistical method for time-series analysis, exponential smoothing. Exponential smoothing estimates future data by weighting recent observations more heavily. The Holt-Winters exponential smoothing method used here uses a "triple" exponential smoothing approach that also considers trend and seasonality.

We can also ensemble classical and machine learning methods for a potentially even more accurate result.

Exponential Smoothing


In [ ]:
# We will use a walk-forward approach, in which a model is fit on all historical data available.
# As we progress through the test set to evaluate the model, we will be creating new models for each row in the test set.
# Each new model will be fit on not only the training data, but on prior test data.

hist = df_train[target_col].copy() # Predict based on historical data. Start with the training data
hist.index.freq = pd.infer_freq(hist.index) # To avoid warnings, explicitly specify the dataframe frequency
n_pred = len(df_test) + 1 # Number of predictions: 1 on the training set; and then 1 for each additional 
y_pred_es = np.empty([n_pred,n_output_steps]) # Create an array to hold predictions, with a number of predictions equal to the test set size, each containing the # of time steps we are predicting forward.

for t in range(n_pred):
    mod = ExponentialSmoothing(hist, seasonal_periods=n_seasons, trend='add', seasonal='add', damped=True)
    res = mod.fit(use_boxcox=False)
    pred = res.forecast(n_output_steps)
    y_pred_es[t] = pred.values
    if t < n_pred - 1:
        hist.loc[df_test.iloc[t].name] = df_test[target_col][t] # Append the latest test data row to the history, for fitting the next model
        hist.index.freq = pd.infer_freq(hist.index)

In [ ]:
evaluate(y_pred_es, 0)

Ensemble ML and Statistical Models

If the performance of the ML and statistical models are similar, ensembling them can be helpful, because their approaches are quite different.


In [ ]:
# Let's start by adjusting the sizes of the prediction arrays to match.
# Some methods predict the initial timesteps of the test set.
# Others start after the first sequence length.
# So, we will remove the test data that doesn't exist in both sets.

def trunc(df, test_set=df_test, n_input_steps = n_input_steps, n_output_steps = n_output_steps):
    return df[n_input_steps: -n_output_steps]

y_pred_es_trunc = trunc(y_pred_es)
y_true_trunc = trunc(df_test)

In [ ]:
models = [y_pred_cnn, y_pred_lstm, y_pred_es_trunc]
weights = [1, 1, 1]

y_pred_ensemble = np.average( np.array(models), axis=0, weights=weights)

evaluate(y_pred_ensemble, 0, y_true_trunc)

Conclusion

Great job! You've now completed the modeling portion of this workshop. We've covered:

  • Removing outliers from the data
  • Multi-step forecasting (we're predicting 7 days out)
  • Including additional features in a time-series model (in this case, holidays)
  • Neural network architectures for time-series forecasting: LSTM and CNN
  • Statistical models, including Holt-Winters Exponential Smoothing
  • Ensembling models