Notebook that explores time-series and techniques to analyze them.
Resources:
"A time-series is sequence of measurements from a system that varies in time"
A time-series is generally decomposed in three major components:
In [ ]:
%matplotlib notebook
import numpy as np
import seaborn as sns
import pandas as pd
sns.set_context("paper")
Moving Averages (also rolling/running average or moving/running mean) is a technique that helps to extract the trend from a series. It reduces noise and decreases impact of outliers. Is consists in dividing the series in overlapping windows of fixed size $N$, and for each considering the average value. It follows that the first $N-1$ values will be undefined, given that they don't have enough predecessors to compute the average.
Exponentially-Weighted Moving Average (EWMA) is an alternative that gives more importance to recent values.
In [ ]:
# rolling mean basic example
series = np.arange(10)
pd.Series(series).rolling(3).mean()
In [ ]:
# ewma basic example
series = np.arange(10)
pd.Series(series).ewm(3).mean()
In [ ]:
# ewm on partial long series of 0s
series = [1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0]
pd.Series(series).ewm(span=2).mean()
In [ ]:
pd.Series(series).ewm(span=2).mean().plot()
Basic Pandas methods:
In [ ]:
# Random arrays to play with
a = np.arange(20)
b = a*a
b_empty = np.array(a*a).astype('float')
# Add missing values and get a Pandas Series
b_empty[[0, 5, 6, 15]] = np.nan
c = pd.Series(b_empty)
In [ ]:
# Visualize how filling method works
fig, axes = sns.plt.subplots(2)
sns.pointplot(np.arange(20), c, ax=axes[0])
sns.pointplot(np.arange(20), c.fillna(method='bfill'), ax=axes[1])
sns.plt.show()
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html
'linear' ignores the index, ‘time’: interpolation works on daily and higher resolution data to interpolate given length of interval
In [ ]:
df = pd.read_csv("time_series.csv")
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index("datetime", inplace=True)
df.fillna(method='pad', axis=0, inplace=True)
df.head()
In [ ]:
#Determing rolling statistics
rolmean = pd.rolling_mean(new_df, window=12)
rolstd = pd.rolling_std(new_df, window=12)
#Plot rolling statistics:
sns.plt.plot(new_df, color='blue',label='Original')
sns.plt.plot(rolmean, color='red', label='Rolling Mean')
sns.plt.plot(rolstd, color='black', label = 'Rolling Std')
sns.plt.show()
In [ ]:
new_df = df.copy()
new_df['val'] = new_df['val'] - pd.ewma(new_df, halflife=12)['val']
new_df['val'] = new_df['val'] - pd.ewma(new_df, halflife=12)['val']
new_df.plot()
sns.plt.show()
In [ ]:
df = pd.read_csv("time_series.csv")
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index("datetime", inplace=True)
df.head()
In [ ]:
df.plot()
sns.plt.show()
In [ ]:
df.fillna(method='pad', axis=0).plot()
sns.plt.show()
In [ ]:
null_indexes = [i for i, isnull in enumerate(pd.isnull(df['val'].values)) if isnull]
In [ ]:
# missing_values_correct
y = [32.69,32.15,32.61,29.3,28.96,28.78,31.05,29.58,29.5,30.9,31.26,31.48,29.74,29.31,29.72,28.88,30.2,27.3,26.7,27.52]
In [ ]:
filled = df['val'].interpolate(method='time').values
predict = filled[null_indexes]
len(predict)==len(y)
In [ ]:
d = sum([abs((y[i]-predict[i])/y[i]) for i in range(len(y))])
In [ ]:
d
There exists different methods to analyze correlation for time-series. If we compare two different time series we are talking about cross-correlation, while in auto-correlation a time-series is compared with itself (can detect seasonality). Both of the previous mentioned categories can use normalization (useful for example when series characterized by different scales, also good for values = zero).
Correlation between two time-series $y$ and $x$ is defined as
$$ corr(x, y) = \sum_{n=0}^{n-1} x[n]*y[n] $$while normalized correlation is defined as
$$norm\_corr(x,y)=\dfrac{\sum_{n=0}^{n-1} x[n]*y[n]}{\sqrt{\sum_{n=0}^{n-1} x[n]^2 * \sum_{n=0}^{n-1} y[n]^2}}$$For auto-correlation we shift the time-series by an interval called lag, and then compare the shifted version with the original one to understand the strength of the correlation (process sometime also called serial-correlation, especially when lag=1). The idea is that a series values are not random independent event, but should have some level of dependency with preceding values. This dependency is the pattern we are trying to discover.
Suggestions: check correlation after removing the trend. Understand the seasonality more appropriate for your case.
In [ ]:
a = np.array([1,2,-2,4,2,3,1,0])
b = np.array([2,3,-2,3,2,4,1,-1])
c = np.array([-2,0,4,0,1,1,0,-2])
In [ ]:
print("a and b correlate value = {}".format(np.correlate(a, b)[0]))
print("a and c correlate value = {}".format(np.correlate(a, c)[0]))
In [ ]:
def normalized_cross_correlation(a, v):
# cross-correlation is simply the dot product of our arrays
cross_cor = np.dot(a, v)
norm_term = np.sqrt(np.sum(a**2) * np.sum(v**2))
return cross_cor/norm_term
In [ ]:
normalized_cross_correlation(a, c)
In [ ]:
print("a and a/2 correlate value = {}".format(np.correlate(a, a/2)[0]))
print("a and a/2 normalized correlate value = {}".format(normalized_cross_correlation(a, a/2)))
In [ ]: