The goal of this notebook is introduce some key concepts for getting started with time series analysis. After working through this short notebook, you should have a sense of a where to start with analyzing time series data and how you can begin generating insights.
This tutorial leans heavily on content found in here and here, both of which can be used as a jumping off point for reading. While useful, these are much longer tutorials. Our goal here is to provide you with something streamlined to get you started quickly!
Note: Please send along any errata or comments you may have.
This notebook uses data from the files "NH_seaice_extent_final_v2.csv" and "Antarctic_Temperatures.csv". In addition, this notebook requires the following libraries:
The data we will work with in this tutorial is real measurements of Antarctic Ice extent. As you can imagine, there is probably a strong seasonal aspect to ice melting and re-freezing, as well as possible long term climate change trends. The original data were on a daily timescale, but we will average within months to simplify things a bit.
You should be able to just run these lines to get your data prepped for the tutorial
In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
In [2]:
# read data into a pandas dataframe
df = pd.read_csv('NH_seaice_extent_final_v2.csv')
In [3]:
# light munging to deal with file format and dump unnecessary columns
df.columns = [col.strip() for col in df.columns.tolist()]
df = df.drop(0) # this is a descriptor row, not actual data
df = df.drop(['Missing','Source Data'], axis=1)
df.Extent = pd.to_numeric(df.Extent)
df.Year = pd.to_numeric(df.Year)
df.Month = pd.to_numeric(df.Month)
df.head()
# Year, Month, Day, Extent (in 10^6 square km)
Out[3]:
In [7]:
monthly = df.groupby(['Year', 'Month'])['Extent'].mean()
monthly = monthly.reset_index()
#monthly = monthly.reset_index(level=0)
monthly['Day'] = 1
monthly.index = pd.to_datetime(monthly.loc[:, ['Year', 'Month', 'Day']])
In [8]:
monthly.head()
Out[8]:
Stationary Timeseries data have constant mean and variance over time, as well as constant autocorrelation over time. If you look at the seasonality, the frequency and amplitude should remain somewhat constant over time. Additionally, there shouldn't be a trend for the mean to increase or decrease as time goes on.
Of course, real data doesn't behave like this, it almost always is non-stationary. In this example, we'll look at data describing the extent of antarctic ice measured monthly. You can imagine that this varies both within a year (a season component), as well as across years (a trend).
A good discussion of the importance of stationary data to time series analysis can be found here. A key take away on the importance of stationarity is that it is a necessary condition for many of the assumptions that underlie statistical tests. In short, many of the assumptions we can apply to independent random variables also apply to stationary time series data.
For almost all time series analyses we are going to want to apply (at least some) statistical tests before we dive into modeling. Therefore, the first thing to do when working with time series data is to assess whether it is stationary.
In [9]:
# plot ice extent over time
sns.set_context('talk')
monthly.plot(x=monthly.index, y='Extent')
sns.despine()
One useful tool can be to plot the rolling mean, help you to visualize any trends over time. Additionally, it is smart to plot the standard deviation as a line graph to see if it deviates from a straight line.
In [10]:
# calculate rolling mean: win = size of window to calculate over
# the window should usually be at least the size of your seasonal period (12 months here)
win = 12
rollMean = monthly.Extent.rolling(window=win, center=False).mean()
rollStd = monthly.Extent.rolling(window=win, center=False).std()
# plot the rolling stats
sns.set_context('talk')
fig, (ax1, ax2) = plt.subplots(2)
ax1.plot(monthly.Extent, label='Raw Data')
ax1.plot(rollMean, label='Rolling Mean')
ax1.set_ylabel('Mean Extent')
ax1.legend()
ax2.plot(rollStd, label='Rolling Std')
ax2.set_xlabel('Year')
ax2.set_ylabel('Extent Std')
sns.despine()
Based on these plots, it looks like neither the mean nor the standard deviation are stable over time. But it is still only a small variation, so it is still hard to judge whether or not this time series is stationary. Instead, we can use statistical tests to say if the data are stationary.
Time Series models are described in terms of autoregressive (AR) lags. So an AR(1) model states that the value at Yt depends on the value at Yt-1. So we can represent the time series as a model:
Yt = Alpha + Rho*Yt-1 + Errort
It turns out, a time series is perfectly stationary when Rho = 0. This makes intuitive sense: The time series will not depend on its previous value when Rho is 0, so it depends entirely on error and is equivalent to a random walk. A time series is perfectly non-stationary if Rho = 1. Alpha is a coeffecient that we want to estimate such that the Error term is minimized, but let's worry about that later.
The Dickey-Fuller Test asks if Rho in an AR(1) model is equal to 0 (H0: Rho=1, H1: Rho<1)
The augmented Dickey-Fuller Test just expands this to work for AR(n) models
A p-value < 0.05 indicates that your time series IS STATIONARY.
In [11]:
from statsmodels.tsa.stattools import adfuller
In [12]:
dftest = adfuller(monthly.Extent, autolag='AIC')
# the statsmodels output is ugly and unlabeled, lets use a function to format it nicely
def adf_output(dftest):
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','# of Lags Used','Number of Observations Used'])
# the last element is a dictionary to unpack
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print(dfoutput)
adf_output(dftest)
In this output we can see that the p-value is much larger that 0.05, confirming that the data are not stationary!
Common sources of non-stationarity are Trend and Seasonality. Trend is when your mean drifts over time, while seasonality is variations in specific time frames (like ice melting in summer but accumulating in winter).
In addition to removing these components for forecasting purposes, most data scientists will want to identify these components and look at them historically to see if there are possible reasons for changes in them over time. Perhaps the trend in ice melting reverses around the time electric cars are introduced, for example.
The simplest way to estimate and remove trend is to calculate a rolling mean and subtract it out:
In [13]:
# calculate rolling mean
win = 12
rollMean = monthly.Extent.rolling(window=win, center=False).mean()
# plot
sns.set_context('talk')
plt.plot(monthly.Extent, label='Raw Data')
plt.plot(rollMean, color='red', label='Rolling Mean')
plt.xlabel('Year')
plt.ylabel('Extent')
plt.legend()
plt.title("Raw Data and the Rolling Mean")
sns.despine()
Notice that in the above plot, a few data points are missing from the front of the rolling mean time series. This is because of the window for calculating the moving average. We can remove those NAs and then detrend the time series. This loss of data is a good reason to keep the window as small as possible.
In [14]:
# subtract mean
demeaned = monthly.Extent - rollMean
demeaned.dropna(inplace=True)
# did this do anything for our stationarity?
demeaned_adf = adfuller(demeaned, autolag='AIC')
adf_output(demeaned_adf)
With the output above, you can see that this simple transformation made the data stationary.
Below, you can see that the moving average is now more stable.
In [15]:
# rolling mean and std
win = 12
roll_demean = demeaned.rolling(window=win, center=False).mean()
roll_demean_std = demeaned.rolling(window=win, center=False).std()
# plot
plt.plot(roll_demean, color='red', label='mean')
plt.plot(roll_demean_std, color='black', label='std')
plt.legend(loc=2)
sns.despine()
This is also apparent in the lack of decreasing trend in our demeaned time series
In [16]:
# plot the de-meaned time series
plt.plot(demeaned)
sns.despine()
Another common method for removing trend is differencing. In this technique, you just subtract from your time series the time series shifted by t-1. Much like demeaning, this operation leaves you with just the change across time, which often leads to a stationary series. You can then just model that change and superimpose it back on the trend.
In [17]:
# construct difference curve and plot
diffed = monthly.Extent - monthly.Extent.shift()
diffed.dropna(inplace=True)
plt.plot(diffed)
sns.despine()
In [18]:
diffed_adf = adfuller(diffed, autolag='AIC')
adf_output(diffed_adf)
Differencing can also be useful when done multiple times, depending on the time series.
In [19]:
from statsmodels.tsa.seasonal import seasonal_decompose
decomp = seasonal_decompose(monthly.Extent, model="additive")
In [20]:
fig, (ax1, ax2, ax3, ax4) = plt.subplots(4)
ax1.plot(monthly.Extent)
ax2.plot(decomp.trend)
ax3.plot(decomp.seasonal)
ax4.plot(decomp.resid)
sns.despine()
It is worth noting that statsmodels themselves admit this is a naive decomposition. The R statistical language has substantially more support for time series analysis than does Python, so it may be worth exploring STL Decomposition in R if this decomposition is not getting you very far.
Regardless of how you do your decomposition, the residual component is what you would use to do your modeling.
A major benefit of decomposition is that now you can visualize the trend and seasonal components separately, allowing you to make inferences from those sources of data.
In [21]:
resid = decomp.resid.dropna()
decomp_adf = adfuller(resid, autolag='AIC')
adf_output(decomp_adf)
In [22]:
# add monthly temperature data to the mix
tmp = pd.read_csv("Antarctic_Temperatures.csv")
av_temp = pd.melt(tmp, id_vars=['Year'], value_vars=['1','2','3','4','5','6','7','8','9','10','11','12'])
av_temp.columns = ['Year', 'Month', 'Temperature']
av_temp['Day'] = 1
av_temp.index = pd.to_datetime(av_temp.ix[:,['Year', 'Month', 'Day']])
av_temp.head()
Out[22]:
In [23]:
monthly['Temperature'] = av_temp.Temperature
monthly.head()
Out[23]:
In [24]:
plt.plot((monthly.Extent - np.mean(monthly.Extent)) / np.std(monthly.Extent))
plt.plot((monthly.Temperature - np.mean(monthly.Temperature)) / np.std(monthly.Temperature))
sns.despine()
While hard to see in this graph, it looks like peaks in temperature might precede drops in ice coverage. Lets statistically test this relationship
Granger Causality is a statistical method for testing the relationship between two time series. This can be a great way to supplement an analysis and move beyond simply plotting the decomposed time series.
At it's heart, this method asks whether values from a second time series provide more information about the future values of your first time series than just considering past values of the first time series:
"X is said to Granger-cause Y if Y can be better predicted using the histories of both X and Y than it can by using the history of Y alone." A good explanation of Granger Causality can be found here and here (the first is much simpler, the second is for a real deep dive!).
An important caveat of Granger Causality is that it is very possible to find that X significantly Granger-causes Y, and simultaneously see that Y Granger-causes X. With well selected variables, you can curb this issue to some degree. For example, while maybe possible, it is much more likely that changes in temperature cause ice melting than ice melting causing changes in temperature. It may also be easier to get unidirectional causality in trend components rather than seasonal components.
First, both time series should be stationary. So we'll do an ADF test on the temperature data and then use the differenced data for monthly ice extent
In [25]:
temp_adf = adfuller(monthly.Temperature, autolag='AIC')
adf_output(temp_adf)
The Statsmodels Granger Causality package is a bit hard to parse:
In [26]:
from statsmodels.tsa.stattools import grangercausalitytests
# create a 2d dataframe to test if ice extent is caused by temperature
granger = pd.DataFrame({'diffed_extent':diffed, 'temperature':av_temp.Temperature})
granger = granger.dropna()
In [27]:
granger_test = grangercausalitytests(granger, maxlag=12)
Here you can see that statsmodels gives a test all the way up to maxlags, increasing the numerator degrees of freedom each time (for the F tests). If you observe a p-value less than .05 for a given test, then the data are consistent with your second column Granger-causing the first column (or a lagged version of the second column Granger-causing the first). Here you see that the data are consistent with granger causality all the way back to a year.
This notebook is geared toward introducing core concepts of time series analytics. There are a few key take aways:
While this notebook was focused on time series analytics, a very common use case of time series data is to forecast future behavior of a system. As mentioned before, R has much more developed tools for time series forecasting than python, but it is not impossible to do it in python. Here are some terms to look into for getting started with forecasting techniques:
Some good reading on forecasting time series data can be found here and here.
Hopefully now you're ready to dig into your own data!
In [ ]: