Intro
- Time-Series
Moving Average
Filling missing values
- Interpolation
Stationarity [TOFIX]
- - Check on more complex example Time Series
Correlation
Arima

Intro

Notebook that explores time-series and techniques to analyze them.

Resources:

Time-Series

"A time-series is sequence of measurements from a system that varies in time"

A time-series is generally decomposed in three major components:

Trend: persistent change along time.
Seasonality: regular periodic variation. There can be multiple seasonalities, and each can span different time-frames (by day, week, month, year, etc.)
Noise: random variation



In [ ]:

    
%matplotlib notebook

import numpy as np
import seaborn as sns
import pandas as pd

sns.set_context("paper")

Moving Average

Moving Averages (also rolling/running average or moving/running mean) is a technique that helps to extract the trend from a series. It reduces noise and decreases impact of outliers. Is consists in dividing the series in overlapping windows of fixed size $N$, and for each considering the average value. It follows that the first $N-1$ values will be undefined, given that they don't have enough predecessors to compute the average.

Exponentially-Weighted Moving Average (EWMA) is an alternative that gives more importance to recent values.



In [ ]:

    
# rolling mean basic example 
series = np.arange(10)
pd.Series(series).rolling(3).mean()



In [ ]:

    
# ewma basic example 
series = np.arange(10)
pd.Series(series).ewm(3).mean()



In [ ]:

    
# ewm on partial long series of 0s
series = [1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0]
pd.Series(series).ewm(span=2).mean()



In [ ]:

    
pd.Series(series).ewm(span=2).mean().plot()

Filling missing values

Basic Pandas methods:

pad/ffill: fill values forward
bfill/backfill: fill values backward



In [ ]:

    
# Random arrays to play with
a = np.arange(20)
b = a*a
b_empty = np.array(a*a).astype('float')
# Add missing values and get a Pandas Series
b_empty[[0, 5, 6, 15]] = np.nan
c = pd.Series(b_empty)



In [ ]:

    
# Visualize how filling method works
fig, axes = sns.plt.subplots(2)
sns.pointplot(np.arange(20), c, ax=axes[0])
sns.pointplot(np.arange(20), c.fillna(method='bfill'), ax=axes[1])
sns.plt.show()

Interpolation

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html

'linear' ignores the index, ‘time’: interpolation works on daily and higher resolution data to interpolate given length of interval

Stationarity [TOFIX]

Link



In [ ]:

    
df = pd.read_csv("time_series.csv")
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index("datetime", inplace=True)
df.fillna(method='pad', axis=0, inplace=True)
df.head()



In [ ]:

    
#Determing rolling statistics
rolmean = pd.rolling_mean(new_df, window=12)
rolstd = pd.rolling_std(new_df, window=12)

#Plot rolling statistics:
sns.plt.plot(new_df, color='blue',label='Original')
sns.plt.plot(rolmean, color='red', label='Rolling Mean')
sns.plt.plot(rolstd, color='black', label = 'Rolling Std')
sns.plt.show()



In [ ]:

    
new_df = df.copy()
new_df['val'] = new_df['val'] - pd.ewma(new_df, halflife=12)['val']
new_df['val'] = new_df['val'] - pd.ewma(new_df, halflife=12)['val']
new_df.plot()
sns.plt.show()

Check on more complex example Time Series



In [ ]:

    
df = pd.read_csv("time_series.csv")
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index("datetime", inplace=True)
df.head()



In [ ]:

    
df.plot()
sns.plt.show()



In [ ]:

    
df.fillna(method='pad', axis=0).plot()
sns.plt.show()



In [ ]:

    
null_indexes = [i for i, isnull in enumerate(pd.isnull(df['val'].values)) if isnull]



In [ ]:

    
# missing_values_correct 
y = [32.69,32.15,32.61,29.3,28.96,28.78,31.05,29.58,29.5,30.9,31.26,31.48,29.74,29.31,29.72,28.88,30.2,27.3,26.7,27.52]



In [ ]:

    
filled = df['val'].interpolate(method='time').values
predict = filled[null_indexes]
len(predict)==len(y)



In [ ]:

    
d = sum([abs((y[i]-predict[i])/y[i]) for i in range(len(y))])



In [ ]:

    
d

Correlation

Link

There exists different methods to analyze correlation for time-series. If we compare two different time series we are talking about cross-correlation, while in auto-correlation a time-series is compared with itself (can detect seasonality). Both of the previous mentioned categories can use normalization (useful for example when series characterized by different scales, also good for values = zero).

Correlation between two time-series $y$ and $x$ is defined as

$$ corr(x, y) = \sum_{n=0}^{n-1} x[n]*y[n] $$

while normalized correlation is defined as

$$norm\_corr(x,y)=\dfrac{\sum_{n=0}^{n-1} x[n]*y[n]}{\sqrt{\sum_{n=0}^{n-1} x[n]^2 * \sum_{n=0}^{n-1} y[n]^2}}$$

For auto-correlation we shift the time-series by an interval called lag, and then compare the shifted version with the original one to understand the strength of the correlation (process sometime also called serial-correlation, especially when lag=1). The idea is that a series values are not random independent event, but should have some level of dependency with preceding values. This dependency is the pattern we are trying to discover.

Suggestions: check correlation after removing the trend. Understand the seasonality more appropriate for your case.



In [ ]:

    
a = np.array([1,2,-2,4,2,3,1,0])
b = np.array([2,3,-2,3,2,4,1,-1])
c = np.array([-2,0,4,0,1,1,0,-2])



In [ ]:

    
print("a and b correlate value = {}".format(np.correlate(a, b)[0]))
print("a and c correlate value = {}".format(np.correlate(a, c)[0]))



In [ ]:

    
def normalized_cross_correlation(a, v):
    # cross-correlation is simply the dot product of our arrays
    cross_cor = np.dot(a, v)
    norm_term = np.sqrt(np.sum(a**2) * np.sum(v**2))
    return cross_cor/norm_term



In [ ]:

    
normalized_cross_correlation(a, c)



In [ ]:

    
print("a and a/2 correlate value = {}".format(np.correlate(a, a/2)[0]))
print("a and a/2 normalized correlate value = {}".format(normalized_cross_correlation(a, a/2)))

Arima



In [ ]:

Table of Contents