# Intro

Notebook that explores time-series and techniques to analyze them.

Resources:

## Time-Series

"A time-series is sequence of measurements from a system that varies in time"

A time-series is generally decomposed in three major components:

• Trend: persistent change along time.
• Seasonality: regular periodic variation. There can be multiple seasonalities, and each can span different time-frames (by day, week, month, year, etc.)
• Noise: random variation


In [ ]:

%matplotlib notebook

import numpy as np
import seaborn as sns
import pandas as pd

sns.set_context("paper")



# Moving Average

Moving Averages (also rolling/running average or moving/running mean) is a technique that helps to extract the trend from a series. It reduces noise and decreases impact of outliers. Is consists in dividing the series in overlapping windows of fixed size $N$, and for each considering the average value. It follows that the first $N-1$ values will be undefined, given that they don't have enough predecessors to compute the average.

Exponentially-Weighted Moving Average (EWMA) is an alternative that gives more importance to recent values.



In [ ]:

# rolling mean basic example
series = np.arange(10)
pd.Series(series).rolling(3).mean()




In [ ]:

# ewma basic example
series = np.arange(10)
pd.Series(series).ewm(3).mean()




In [ ]:

# ewm on partial long series of 0s
series = [1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0]
pd.Series(series).ewm(span=2).mean()




In [ ]:

pd.Series(series).ewm(span=2).mean().plot()



# Filling missing values

Basic Pandas methods:

• bfill/backfill: fill values backward


In [ ]:

# Random arrays to play with
a = np.arange(20)
b = a*a
b_empty = np.array(a*a).astype('float')
# Add missing values and get a Pandas Series
b_empty[[0, 5, 6, 15]] = np.nan
c = pd.Series(b_empty)




In [ ]:

# Visualize how filling method works
fig, axes = sns.plt.subplots(2)
sns.pointplot(np.arange(20), c, ax=axes[0])
sns.pointplot(np.arange(20), c.fillna(method='bfill'), ax=axes[1])
sns.plt.show()



## Interpolation

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html

'linear' ignores the index, ‘time’: interpolation works on daily and higher resolution data to interpolate given length of interval

# Stationarity [TOFIX]



In [ ]:

df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index("datetime", inplace=True)




In [ ]:

#Determing rolling statistics
rolmean = pd.rolling_mean(new_df, window=12)
rolstd = pd.rolling_std(new_df, window=12)

#Plot rolling statistics:
sns.plt.plot(new_df, color='blue',label='Original')
sns.plt.plot(rolmean, color='red', label='Rolling Mean')
sns.plt.plot(rolstd, color='black', label = 'Rolling Std')
sns.plt.show()




In [ ]:

new_df = df.copy()
new_df['val'] = new_df['val'] - pd.ewma(new_df, halflife=12)['val']
new_df['val'] = new_df['val'] - pd.ewma(new_df, halflife=12)['val']
new_df.plot()
sns.plt.show()



### Check on more complex example Time Series



In [ ]:

df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index("datetime", inplace=True)




In [ ]:

df.plot()
sns.plt.show()




In [ ]:

sns.plt.show()




In [ ]:

null_indexes = [i for i, isnull in enumerate(pd.isnull(df['val'].values)) if isnull]




In [ ]:

# missing_values_correct
y = [32.69,32.15,32.61,29.3,28.96,28.78,31.05,29.58,29.5,30.9,31.26,31.48,29.74,29.31,29.72,28.88,30.2,27.3,26.7,27.52]




In [ ]:

filled = df['val'].interpolate(method='time').values
predict = filled[null_indexes]
len(predict)==len(y)




In [ ]:

d = sum([abs((y[i]-predict[i])/y[i]) for i in range(len(y))])




In [ ]:

d



# Correlation

There exists different methods to analyze correlation for time-series. If we compare two different time series we are talking about cross-correlation, while in auto-correlation a time-series is compared with itself (can detect seasonality). Both of the previous mentioned categories can use normalization (useful for example when series characterized by different scales, also good for values = zero).

Correlation between two time-series $y$ and $x$ is defined as

$$corr(x, y) = \sum_{n=0}^{n-1} x[n]*y[n]$$

while normalized correlation is defined as

$$norm\_corr(x,y)=\dfrac{\sum_{n=0}^{n-1} x[n]*y[n]}{\sqrt{\sum_{n=0}^{n-1} x[n]^2 * \sum_{n=0}^{n-1} y[n]^2}}$$

For auto-correlation we shift the time-series by an interval called lag, and then compare the shifted version with the original one to understand the strength of the correlation (process sometime also called serial-correlation, especially when lag=1). The idea is that a series values are not random independent event, but should have some level of dependency with preceding values. This dependency is the pattern we are trying to discover.

Suggestions: check correlation after removing the trend. Understand the seasonality more appropriate for your case.



In [ ]:

a = np.array([1,2,-2,4,2,3,1,0])
b = np.array([2,3,-2,3,2,4,1,-1])
c = np.array([-2,0,4,0,1,1,0,-2])




In [ ]:

print("a and b correlate value = {}".format(np.correlate(a, b)[0]))
print("a and c correlate value = {}".format(np.correlate(a, c)[0]))




In [ ]:

def normalized_cross_correlation(a, v):
# cross-correlation is simply the dot product of our arrays
cross_cor = np.dot(a, v)
norm_term = np.sqrt(np.sum(a**2) * np.sum(v**2))
return cross_cor/norm_term




In [ ]:

normalized_cross_correlation(a, c)




In [ ]:

print("a and a/2 correlate value = {}".format(np.correlate(a, a/2)[0]))
print("a and a/2 normalized correlate value = {}".format(normalized_cross_correlation(a, a/2)))



# Arima



In [ ]: