Time series $Z$ mixed with noise $$Y = Z + E,$$ where $E$ is a whitenoise $$ cov(E_i, E_j) = \begin{cases} \sigma_e^2,& i = j \\ 0,& i \neq j \end{cases} $$
$Z$ depends on $X_i, i = 1..6$ time series.
In [1]:
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (!code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<a href="javascript:code_toggle()">Toggle</a> the raw code.''')
Out[1]:
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import pylab
pylab.rcParams['figure.figsize'] = (15.0, 15.0)
In [3]:
from matplotlib import rc
rc('font', size=18)
In [4]:
from numpy import loadtxt, arange, log, delete, array, ones, concatenate
from numpy.linalg import lstsq
from scipy.stats import f
from scipy.stats.mstats import normaltest
from pandas.stats.moments import ewma
from pandas import Series
In [5]:
from common import *
In [6]:
dataset = loadtxt('data.csv', skiprows=1, delimiter=',')
t = arange(dataset.shape[0])
y = dataset[:, 0]
xs = dataset[:, 1:].transpose()
In [7]:
F_TAIL_CRITICAL = 0.1
In [8]:
plt.plot(t, y)
plt.title('$Y$')
plt.ylabel('Value')
plt.xlabel('Time')
plt.show()
In [9]:
for i, x in enumerate(xs):
plt.plot(t, x, label='$X_%d$'%(i+1))
plt.ylabel('Value')
plt.xlabel('Time')
plt.legend()
plt.show()
In [10]:
span = 6
smoothed_y = data_smoother(y, span=span)
plt.plot(t, y, label='Source')
plt.plot(t, smoothed_y, label='Smoothed with $%d$'%span)
plt.title('$Y$ smoothing')
plt.ylabel('Value')
plt.xlabel('Time')
plt.legend()
plt.show()
Anomaly points are the ones which affect smoothing.
We can find them as points which increase standard deviation of smoothing errors a lot.
Given an anomaly point $t$ we can build new time series $Y_{fixed}\left( t \right)$ without this point and check whether this point is really anomaly $$ V\left( t \right) = \frac{Var\left( Y - Y^{smoothed} \right)} {Var\left( Y{fixed}\left(t\right) - Y{fixed}^{smoothed}\left(t\right) \right)}
V_{critical} $$
It's an F-test and we can do it in following way $$ \max\limits_{t}{F_{F, T, T-1}\left( V\left( t \right) \right)} > F_{F, T, T-1}^{critical} \Longrightarrow t_{anomaly} = \max\limits_{t}{F_{F, T, T-1}\left( V\left( t \right) \right)}. $$ because errors have normal distribution by initial conditions.
Let's take a look on variances of fixed time series, which don't contain specific points
In [11]:
anomalies = get_f_test_values(y, span)
f_test = get_f_weights(anomalies)
plt.plot(t, anomalies)
plt.title('Variances ratios between series wihout anomaly points and original one')
plt.xlabel('Anomaly point')
plt.ylabel('Variance')
plt.show()
Plot which contains weight of Fisher distribution tails for each fixed time series
In [12]:
min_f_tail, suspicious_point = display_variances_without_anomalies(t, y, span)
In [13]:
anomaly_points = []
while min_f_tail <= F_TAIL_CRITICAL:
anomaly_points.append(suspicious_point)
min_f_tail, suspicious_point = display_variances_without_anomalies(t, y, span, anomaly_points)
In [14]:
y_without_anomalies = y.copy()
# t_without_anomalies = t[:-len(anomaly_points)]
t_without_anomalies = t.copy()
for anomaly_point in anomaly_points:
y_without_anomalies[anomaly_point] = .5 * (
y_without_anomalies[anomaly_point - 1] + y_without_anomalies[anomaly_point + 1])
In [15]:
smoothed_y_without_anomalies = data_smoother(y_without_anomalies, span=span)
plt.plot(t_without_anomalies, y_without_anomalies, label='Source')
plt.plot(t_without_anomalies, smoothed_y_without_anomalies, label='Smoothed with $%d$'%span)
plt.title('$Y$ smoothing')
plt.ylabel('Value')
plt.xlabel('Time')
plt.legend()
plt.show()
We need to find proper span for exponentially-weighted moving average in order to have further calculations right
We know only that distribution of the noise is normal, so we can use normality test of Ralph D'Agostino and choose the span with which errors have the biggest normality
In [16]:
best_span = get_best_span(y_without_anomalies)
In [17]:
smoothed_y_without_anomalies = data_smoother(y_without_anomalies, span=best_span)
errors = y_without_anomalies - smoothed_y_without_anomalies
trend = smoothed_y_without_anomalies.copy()
plt.plot(t_without_anomalies, y_without_anomalies, label='Source')
plt.plot(t_without_anomalies, smoothed_y_without_anomalies, label=r'Smoothed with $%d$, $\sigma=%2.1f$'%(
best_span, errors.std()))
plt.title('$Y$ smoothing')
plt.ylabel('Value')
plt.xlabel('Time')
plt.legend()
plt.show()
Trend line may cover not all relations between time and value and a part of relations will occur in errors (difference between trend and data)
Assume we have an estimate $\tilde{Y}$ for data $Y$ and error is defined as $$ \varepsilon = Y - \tilde{Y} $$ Our preposition is that there exists $tau$ such as $$ \varepsilon_{t + \tau} = \delta_{t + \tau} + e_{t + \tau} $$ where $e$ is a real noise and $\delta$ is a part which depends on previous observations linearly $$ \delta_{t + \tau} = a_0 + \sum_{i = 0}^{\tau - 1} \delta_{t + i} \cdot a_{i + 1} $$
Let's take a look at $\varepsilon$
In [18]:
plt.plot(t_without_anomalies, errors)
plt.title(r'$Y$ smoothing errors $\varepsilon$')
plt.ylabel('Value')
plt.xlabel('Time')
plt.show()
To find the $\tau$, which is a lag time, we need to calculate autoregressions for several values of lag
In [19]:
max_lag = errors.size - 1
autocorrelation = array([Series(errors).autocorr(i) for i in range(max_lag)])
plt.plot(list(range(max_lag)), abs(autocorrelation), label='Absolute value')
plt.plot(list(range(max_lag)), autocorrelation, label='Raw value')
plt.title('Autocorrelation with specific lags')
plt.xlabel('Lag')
plt.ylabel('Autocorrelation')
plt.legend()
plt.show()
The next step is to choose the lag and calculate coefficients $a_i$ with least square method
In [20]:
lag = 2
coefficients, autoregressors = get_errors_estimate(errors, lag)
Now we got a difference between trend and source data, data found by autoregressive analysis and nose
In [21]:
plt.plot(errors[lag:], label='Difference between source and trend')
plt.plot(autoregressors, label='Dependend part of difference')
plt.plot(errors[lag:] - autoregressors, label='Noise')
plt.legend()
plt.show()
Now we are ready to complete our model with new data
In [22]:
smoothed_y_without_anomalies = data_smoother(y_without_anomalies, span=best_span)[lag:] + autoregressors
errors = y_without_anomalies[lag:] - smoothed_y_without_anomalies
plt.plot(t_without_anomalies, y_without_anomalies, label='Source')
plt.plot(t_without_anomalies[lag:], smoothed_y_without_anomalies, label=r'Smoothed, $\sigma=%2.1f$'%(
errors.std()))
plt.title('$Y$ smoothing')
plt.ylabel('Value')
plt.xlabel('Time')
plt.legend()
plt.show()
Trend line we have can be forecasted for $L$ steps by the formula $$ Y_{t + L} = \alpha_{0 t} + \alpha_{1 t} \cdot L + \frac{1}{2} \cdot \alpha_{2 t} \cdot L, $$ where $$ \begin{cases} \alpha_{0 t} = 3 \cdot \left( S_t^1 - S_t^2 \right) + St^3, \ \alpha{1 t} = \frac{\alpha}{2 \cdot \beta^2} \cdot \left[ \left( 6 - 5 \cdot \alpha \right) \cdot S_t^1
- 2 \cdot \left( 5 - 4 \cdot \alpha \right) \cdot S_t^2
+ \left( 4 - 3 \cdot \alpha \right) \cdot S_t^3
\right], \\
\alpha_{2 t} = \frac{\alpha^2}{\beta^2} \cdot \left( S_t^1 - 2 \cdot S_t^2 \right) + S_t^3
\end{cases} $$ and $$ \begin{cases} S_t^1 = \alpha \cdot Y_t + \beta \cdot S_{t-1}^1 = \alpha \cdot \sum\limits_{i=0}^{t-1} \beta^i \cdot Y_{t-i} + \beta^t \cdot Y_0, \\ S_t^2 = \alpha \cdot S_t^1 + \beta \cdot S_{t-1}^2, \\ S_t^3 = \alpha \cdot S_t^2 + \beta \cdot S_{t-1}^3. \end{cases} $$
In [23]:
alpha = 2. / (span + 1)
In [24]:
S1 = get_S(trend, alpha)
S2 = get_S(S1, alpha)
S3 = get_S(S2, alpha)
In [25]:
a0 = get_a0(S1, S2, S3, alpha)
a1 = get_a1(S1, S2, S3, alpha)
a2 = get_a2(S1, S2, S3, alpha)
In [26]:
plt.plot(t_without_anomalies, y_without_anomalies, label='Source')
plt.plot(t_without_anomalies, trend, label='Trend')
plt.plot(t_without_anomalies, [forecast(a0, a1, a2, 1, time) for time in t_without_anomalies], label='Trend with one step forecast')
plt.plot(t_without_anomalies, [forecast(a0, a1, a2, 2, time) for time in t_without_anomalies], label='Trend with two steps forecast')
plt.plot(t_without_anomalies, [forecast(a0, a1, a2, 3, time) for time in t_without_anomalies], label='Trend with three steps forecast')
plt.title('$Y$ trend inline forecast')
plt.ylabel('Value')
plt.xlabel('Time')
plt.legend()
plt.show()
In [27]:
trend_forecasted = concatenate((trend, [forecast(a0, a1, a2, steps) for steps in range(1, 4)]))
plt.plot(t_without_anomalies, y_without_anomalies, label='Source')
time_forecasted = concatenate((t_without_anomalies, list(range(t_without_anomalies.size + 1, t_without_anomalies.size + 4))))
plt.plot(time_forecasted, trend_forecasted, label='Trend with three steps forecast')
plt.title('$Y$ trend forecast')
plt.ylabel('Value')
plt.xlabel('Time')
plt.legend()
plt.show()
In [28]:
plt.plot(errors[lag:], label='Difference between source and trend')
plt.plot(autoregressors, label='Dependend part of difference')
for forecast_steps in (1, 2, 3):
autoregressors_forecast = [get_errors_forecast(errors, coefficients, forecast_steps, time) for time in range(forecast_steps, autoregressors.size)]
plt.plot(concatenate((autoregressors[:forecast_steps], autoregressors_forecast)), label='Errors with %d steps forecast'%forecast_steps)
plt.legend()
plt.show()
In [29]:
errors_forecasted = autoregressors.copy()
for forecast_steps in (1, 2, 3):
errors_forecasted = concatenate((errors_forecasted, [get_errors_forecast(errors_forecasted, coefficients)]))
In [30]:
source_series_forecasted = concatenate((trend_forecasted[:2], trend_forecasted[2:] + errors_forecasted))
plt.plot(t_without_anomalies, y_without_anomalies, label='Source')
plt.plot(time_forecasted, source_series_forecasted, label='Smooth with three steps forecast')
plt.legend()
plt.show()