A time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data.
In this course you seen various data till date. A time series dataset is different. Time series adds an explicit order dependence between observations: a time dimension. This additional dimension is both a constraint and a structure that provides a source of additional information.
Time series forecasting is the use of a model to predict future values of a time series based on previously observed values. In order to do that, first one needs to understand or model the stochastic mechanisms that gives rise to an observed series. Then this model can be used to predict or forecast the future values of a series based on the history of that series.
Assume we have a timeseries: $x_1, x_2, x_3, \ldots, x_N$
We have observed $T$ values, and wish to predict future $T'$ values. We want to model the relation: $$ \underbrace{(x_{T+1}, x_{T+2}, \ldots, x_{T+T'})}_{T'\text{ is forecast horizon}} = r(\underbrace{x_1, x_2, \ldots, x_T}_{T\text{ is history}}) $$
$T$ is called history and $T'$ is called forecast horizon.
Standard approach is to consider a moving window over the timeseries to construct following matrix: $$ X = \overbrace{\begin{bmatrix} x_1 & x_2 & \dots & x_T \\ x_2 & x_3 & \dots & x_{T+1} \\ \vdots & \vdots & \ddots & \vdots \\ x_{N-T-T'+1} & x_{N-T-T'+2} & \dots & x_{N-T'} \end{bmatrix}}^{input} \quad Y = \overbrace{\begin{bmatrix} x_{T+1} & x_{T+2} & \dots & x_{T+T'} \\ x_{T+2} & x_{T+3} & \dots & x_{T+T'+1} \\ \vdots & \vdots & \ddots & \vdots \\ x_{N-T'+1} & x_{N-T'+2} & \dots & x_{N} \end{bmatrix}}^{output} $$
We will learn the muti-input multi-output (MIMO) regression relation: $Y = r(X)$
Objectives of this notebook are:
For questions, comments and suggestions, please contact parantapa[dot]goswami[at]viseo[dot]com
Initially we require:
In [1]:
# Write code to import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# For visualzing plots in this notebook
%matplotlib inline
In [91]:
# We start by importing the data using pandas
# Hint 1: use "read_csv" method, Note that comma (",") is the field separator
# Hint 2: this data file already includes a header
weather = pd.read_csv("weatherdata.csv", sep=",")
# We sneak peek into the data
# Hint: use dataframe "head" method with "n" parameter
weather.head(n=5)
Out[91]:
For the sake of simplicity, we will use only MaxTemp variable to perform time series forecasting.
Also, we are not interested in the exact time here, but the sequence itself. We ignore the "Date", still the indices of the dataframe preserves the order. So, the index values will be considered as time stamps. This will satisfy our requirement and simplify the problem.
For this exercise create a new dataframe from the "MaxTemp" column.
In [92]:
# Write code to create a new dataframe from weather
# Hint: call "pandas.DataFrame" and pass the desired column of weather
temperature = pd.DataFrame(weather["MaxTemp"])
# To check if all is good
temperature.head()
Out[92]:
In [95]:
# Write code to generate a line plot of temperature
# Hint: use "DataFrame.plot.line()" on our dataframe temperature
temperature.plot.line()
Out[95]:
Python Pandas library is an efficient tool to work with time series data. Please see this link for all the detailed functionalities.
shift()
methodA key method to help transform time series data into a supervised learning problem is the Pandas DataFrame.shift()
method. Given a DataFrame, the shift()
method can be used to create copies of columns that are pushed forward or pulled back.
This is the behavior required to create columns of lag observations as well as columns of forecast observations for a time series dataset in a supervised learning format.
Notes:
shift()
takes a positive integer $k$ to push forward the data $k$ steps (rows of NaN values added to the front)shift()
takes a positive integer $-k$ to pull back the data $k$ steps (rows of NaN values added to the end)
In [96]:
# Write code to try "shift()" method to push forward the data 1 step
# Hint: use head() at the end of youe code to see first lines
temperature.shift(1).head()
Out[96]:
In [97]:
# Write code to try "shift()" method to pull back the data 1 step
# Hint: use head() at the end of youe code to see first lines
temperature.shift(-1).head()
Out[97]:
In [98]:
# Write code to create an empty DataFrame
# Hint: "pandas.DataFrame()" without any arguments creates an empty DataFrame
reg_mat = pd.DataFrame()
Step 2: Assume the history $T = 5$. So, you have to use shift()
five times. Each such shift will generate a new column of the regression matrix.
Note that you have to maintain the order as shown in the above equations. If we assume that "t" is the current time, the history columns of the regression matrix should be in the following order: $$t-4, t-3, t-2, t-1, t$$
To get the "t-i" column, the time series needs to get pushed forward i steps. For each shift, you should store the newly generated column in the reg_mat
dataframe with column name "t-i"
In [100]:
# Write code to generate columns t-4, t-3, t-2, t-1, t IN THIS ORDER for reg_mat
# Hint: you do not need any shift to store the column "t"
reg_mat["t-4"] = temperature.shift(4)
reg_mat["t-3"] = temperature.shift(3)
reg_mat["t-2"] = temperature.shift(2)
reg_mat["t-1"] = temperature.shift(1)
reg_mat["t"] = temperature
# To check if all is good
reg_mat.head(10)
Out[100]:
Observe how the NaN values are added.
Step 3: Assume horizon $T' = 2$. This time you have to use shift()
2 times in the other direction. Again you need to maintain the order. This time the generated columns will be "t+1" and "t+2". The final ordering of columns of the entire regression matrix should be:
$$t-4, t-3, t-2, t-1, t, t+1, t+2$$
To get the "t+i" column, the time series needs to get pulled back i steps. For each shift, you should store the newly generated column in the reg_mat
dataframe with column name "t+i"
In [101]:
# Write code to generate columns t+1, t+2 IN THIS ORDER for reg_mat
reg_mat["t+1"] = temperature.shift(-1)
reg_mat["t+2"] = temperature.shift(-2)
# To check if all is good
reg_mat.head(10)
Out[101]:
Above approach works for a small and known history and horizon. To make it automatic, use a for
loop.
Notes:
Use python range()
method wisely to control the loops. Also you have to generate the column names dynamically. Carefully choose the positive or negative values for shift()
methods.*
In [108]:
history = 5
horizon = 2
# STEP 1: Create an empty DataFrame
reg_mat = pd.DataFrame()
# STEP 2: For loop to generate the history columns
# Hint 1: use "reversed()" method to reverse the output of "range()" method
# Hint 2: generate column names by adding loop variable i with string "t-"
for i in reversed(range(history)):
column_name = "t-" + str(i)
reg_mat[column_name] = temperature.shift(i)
# Generating the column "t"
reg_mat["t"] = temperature
# STEP 3: For loop to generate the forecast/future columns
# Hint: generate column names by adding loop variable i with string "t+"
for i in range(1, horizon+1):
column_name = "t+" + str(i)
reg_mat[column_name] = temperature.shift(-i)
# To check if all is good
reg_mat.head(10)
Out[108]:
We will ignore the rows containing NaN values. Use DataFrame.dropna()
method to do this.
You have to reset indices using DataFrame.reset_index()
method. By setting the attribute drop
True, it overwrites the old indices.
In both the above methods, by setting inplace
to True, the index reset is done on the dataframe itself, without returning anything.
In [111]:
# Write code to drop rows with NaN values from reg_mat inplace
reg_mat.dropna(inplace=True)
# Write code to reset index of reg_mat inplace, with dropping old indices
reg_mat.reset_index(drop=True, inplace=True)
# To check if all is good
reg_mat.head(10)
Out[111]:
All machine learning algorithm implementations work efficiently with numpy matrices and arrays. The current format of our data is panda dataframes. Fortunately, pandas provides DataFrame.as_matrix()
method to convert dataframes to numpy arrays.
In [116]:
# Write code to convert entire reg_mat into a numpy matrix
reg_mat_numpy = reg_mat.as_matrix()
Our goal is to model time series forecasting as a machine learning regression task. It is time to identify the inputs and outputs
Question: What is your input X
here? What is your output y
here?
In [118]:
# Write code to cerate input matrix X by selecting correct columns of reg_mat_numpy
# Hint: first k columns of a numpy matrix M can be selected by M[:,:k]
X = reg_mat_numpy[:, :history]
# Write code to cerate output matrix y by selecting correct columns of reg_mat_numpy
# Hint: last k columns of a numpy matrix M can be selected by M[:,-k:]
y = reg_mat_numpy[:, -horizon:]
In [130]:
# Write a function to put everything together
# TO DELETE
def regression_matrix(df, history, horizon):
reg_mat = pd.DataFrame()
for i in reversed(range(history)):
column_name = "t-" + str(i)
reg_mat[column_name] = temperature.shift(i)
reg_mat["t"] = temperature
for i in range(1, horizon+1):
column_name = "t+" + str(i)
reg_mat[column_name] = temperature.shift(-i)
reg_mat.dropna(inplace=True)
reg_mat.reset_index(drop=True, inplace=True)
reg_mat_numpy = reg_mat.as_matrix()
X = reg_mat_numpy[:, :history]
y = reg_mat_numpy[:, -horizon:]
return X, y
Finally, we have all the modules necessary to train a regression model.
Forecast horizon is to be fixed in the beginning based on the requirement. We will first try to predict only 1 step in the future, that is we set horizon = 1
. We will set history=20
.
In [144]:
# Write code to generate X and y matrices from dataframe temperature
# Set history to 20 and horizon to 1
# Hint: use the method "regression_matrix" you just wrote.
history, horizon = 20, 1
X, y = regression_matrix(temperature, history, horizon)
A portion of the dataset is to be set aside to be used only for testing. Fortunately sklearn
provides a train_test_split()
module to do that. You can specify a ratio for test_size
parameter.
In this exercise, we will retain $20\%$ data for testing.
In [145]:
# Importing the module
from sklearn.model_selection import train_test_split
# Write code for splitting the data into train and test sets.
# Hint: use "train_test_split" on X and y, and test size should be 0.2 (20%)
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.2)
In [146]:
# Write code to import LinearRegression from sklearn
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
R2 = lin_reg.score(X_test, y_test)
print("Linear Regression R2 = ", R2)
We will use native matplotlib.pyplot.plot()
method to plot original y_test
against y_predicted
.
Note: X axis points are to be generated for plotting. It is simply range of integers 1...len(y_test)
.
In [152]:
# Write code to predict values for test set using LinearREgression.predict() method.
y_predicted = lin_reg.predict(X_test)
# Generating x-axis points
x_points = np.arange(y_test.shape[0])
# Write code to visualize y_test and y_predicted in a single plot
# Hint 1: use matplotlib.pyplot.plot() method
# Hint 2: choose different colors for 2 curves
plt.plot(x_points, y_test[:,0], "b--")
plt.plot(x_points, y_predicted[:,0], "r--")
Out[152]:
Let's repeat the above procedure for forecast horizon $4$, i.e. to predict values till next $4$ time stamps.
Python Scikit-Learn provides a wrapper module called MultiOutputRegressor
for multi target regression. You can pass it a standard sklearn
regressor instance directly, which will be used as a base.
We will use it with LinearRegression
and RandomForestRegressor
In [141]:
# Write code to generate X and y matrices.
# This time set history to 20 and horizon to 4
history, horizon = 20, 4
X, y = regression_matrix(temperature, history, horizon)
# Write code for splitting the data into train and test sets.# TO DELETE
In [142]:
# Write code to import MultiOutputRegressor module
from sklearn.multioutput import MultiOutputRegressor
# Write code to train a MultiOutputRegressor model using LinearRegression.
# Test its performance on the test set
lin_reg = MultiOutputRegressor(LinearRegression())
lin_reg.fit(X_train, y_train)
lin_reg_R2 = lin_reg.score(X_test, y_test)
print("Linear Regression R2 = ", lin_reg_R2)
In [143]:
# Write code to import necessary module for RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
# Write code to train a MultiOutputRegressor model using RandomForestRegressor.
# Test its performance on the test set
rfr = MultiOutputRegressor(RandomForestRegressor())
rfr.fit(X_train, y_train)
rfr_R2 = rfr.score(X_test, y_test)
print("Random Forest Regressor R2 = ", rfr_R2)
In [ ]: