Basic statistical analysis of time series

In this example, basic time series statistical analysis is demonstrated.



In [1]:

    
import numpy as np
from scipy import stats

import matplotlib.pyplot as plt
%matplotlib inline

import evapy

Generating a Gaussian stochastic signal

Lets start by generating a Gaussian time series signal. The duration is set to 3 hours and sampling rate is 10Hz.



In [2]:

    
t = np.arange(0., 3*3600., 0.1)

The signal will be generated using a rectangular power spectrum between 5 and 15 seconds of period. It is important to use sufficiently many discretizations to obtain a non-repeating approximation.



In [3]:

    
start_period = 15.
stop_period = 5.

discrete_size = 1000



In [4]:

    
amp_norm = stats.norm.rvs(size=discrete_size)
phase = stats.uniform.rvs(size=discrete_size)*2.*np.pi

signal_mean = 0.



In [5]:

    
freq = np.linspace(2.*np.pi*1./start_period, 2.*np.pi*1./stop_period, num=discrete_size)

Generating the signal by super-positioning a large number of cosine functions with stochastic amplitudes and phases.



In [6]:

    
signal = np.zeros(len(t)) + signal_mean

for i, freq_i in enumerate(freq):
    signal = signal + amp_norm[i]*np.cos(freq_i*t + phase[i])



In [7]:

    
plt.figure('Snapshot of signal', figsize=(16,9))
plt.title('Snapshot of the signal')
plt.plot(t[1500:2000], signal[1500:2000])
plt.xlabel('Time [s]')
plt.ylabel('Amplitude')
plt.grid()
plt.show()

Statistical analysis of the signal

Lets see if the signal resambles a Guassian signal with zero mean.



In [10]:

    
descr_stats = stats.describe(signal)

print('mean: {}'.format(descr_stats.mean))
print('std: {}'.format(np.sqrt(descr_stats.variance)))
print('skewness: {}'.format(descr_stats.skewness))









    



mean: -0.0003558229724597345
std: 22.36264071482929
skewness: 0.001286657211675139

Estimate the normal distribution paramters from data and plot the normal pdf with the histogram for data.



In [11]:

    
params = stats.norm.fit(signal)
signal_dist = stats.norm(*params)



In [12]:

    
x = np.linspace(-100, 100, num=200)



In [14]:

    
plt.figure('Signal histogram', figsize=(16,9))
plt.title('Signal histogram and Normal PDF')
plt.plot(x, signal_dist.pdf(x), color='r')
plt.hist(signal, bins=50, density=True)
plt.show()

It is clear that the signal is indeed Gaussian!

Extreme value analysis

Finding the peaks

evapy package provides robust and fast methods to identify upcrossings and peaks. The declustered peaks refer to the largest peak per upcrossing. hus, for a narrow-banded signal, all peaks are declustered. In this example, declustering will occur for mean-upcrossing.



In [15]:

    
peak_index = evapy.evstats.argrelmax(signal)
peak = signal[peak_index]

peak_dc_index = evapy.evstats.argrelmax_decluster(signal, x_up=np.mean(signal))
peak_dc = signal[peak_dc_index]



In [16]:

    
plt.figure('Finding the peaks', figsize=(16,9))
plt.title('Finding the peaks')
plt.plot(t, signal, 'k-', label='signal')
plt.plot(t[peak_dc_index], peak_dc, 'r^', markersize=12, label='peak declustered')
plt.plot(t[peak_index], peak, 'bo', label='peak')
plt.xlabel('Time [s]')
plt.ylabel('Amplitude')
plt.xlim(100,200)
plt.grid()
plt.legend()
plt.show()

Counting upcrossings

Lets find the number of upcrossings for different levels using the evapy package.



In [17]:

    
uc_0 = evapy.evstats.argupcross(signal, x_up=np.mean(signal))

print('number of upcrossings above the mean: {}'.format(len(uc_0)))









    



number of upcrossings above the mean: 1514

Since the signal is 10800 seconds long, this mean that the Tz ((mean or) zero crossing period) of the signal is



In [18]:

    
t[-1]/len(uc_0)









    Out[18]:





7.133355350066052

Now, lets find the upcrossing rate at mean+std level.



In [19]:

    
uc_1 = evapy.evstats.argupcross(signal, x_up=np.mean(signal)+np.std(signal))



In [20]:

    
print('number of upcrossings above the mean+std: {}'.format(len(uc_1)))









    



number of upcrossings above the mean+std: 933

The upcrossing rate is then:



In [21]:

    
len(uc_1)/t[-1]









    Out[21]:





0.08638968879341474

Statistical analysis of peaks

Now that we have established the peaks, lets investigate their distribution. Using different distribution functions the evapy package offers.



In [22]:

    
eak_index = evapy.evstats.argrelmax(signal)
peak = signal[peak_index]

peak_dc_index = evapy.evstats.argrelmax_decluster(signal, x_up=np.mean(signal))
peak_dc = signal[peak_dc_index]

r = np.linspace(-10, 100, num=100)

Using Rayleigh distribution:



In [23]:

    
params_ray = evapy.distributions.rayleigh.fit(peak_dc)
peak_dc_ray = evapy.distributions.rayleigh(*params_ray)



In [24]:

    
print('location: {}'.format(params_ray[0]))
print('scale: {}'.format(params_ray[1]))









    



location: -1.6435433313514798
scale: 23.33482891334915



In [26]:

    
plt.figure('Peaks histogram and Rayleigh distribution', figsize=(16,9))
plt.plot(r, peak_dc_ray.pdf(r), color='r')
plt.hist(peak_dc, bins=25, density=True)
plt.show()

Using the Weibull distribution:

Note: Sometimes, the estimation alogorithm fails if the starting values for scale parameter is too far off. This is a known issue and we are working on it. In the mean time, you may provice a reasonable starting value for the scale parameter.



In [27]:

    
params_wb = evapy.distributions.weibull.fit(peak_dc, scale=20.)
peak_dc_wb = evapy.distributions.weibull(*params_wb)



In [28]:

    
print('shape: {}'.format(params_wb[0]))
print('location: {}'.format(params_wb[1]))
print('scale: {}'.format(params_wb[2]))









    



shape: 2.064018486408185
location: -2.102492679918142
scale: 33.617562038180125

The shape parameter should be close to 2. Sometimes the estimation alogorithm fails and divereges for 3 parameter Weibull using the MLE.

Lets see how the histogram looks.



In [30]:

    
plt.figure('Peaks histogram and Weibull distribution', figsize=(16,9))
plt.plot(r, peak_dc_wb.pdf(r), color='r')
plt.hist(peak_dc, bins=25, density=True)
plt.show()

Often, the to provide a good starting value for the scale parameter. Or to lock the location parameter based on prior knowledge. In our case, the signal mean can be a reasonable choice.



In [31]:

    
params_wb = evapy.distributions.weibull.fit(peak_dc, floc=np.mean(signal))
peak_dc_wb = evapy.distributions.weibull(*params_wb)



In [32]:

    
print('shape: {}'.format(params_wb[0]))
print('location: {}'.format(params_wb[1]))
print('scale: {}'.format(params_wb[2]))









    



shape: 1.8455102728096229
location: -0.0003558229724597345
scale: 31.01194237617016



In [34]:

    
plt.figure('Peaks histogram and Weibull distribution', figsize=(16,9))
plt.plot(r, peak_dc_wb.pdf(r), color='r')
plt.hist(peak_dc, bins=25, density=True)
plt.show()

Using the generalized exponential tail distribution:

Note that the generalized exponential tail distribution is particularly suitable for truncated data. Lets truncate our peak data at 35.



In [35]:

    
peak_dc_trunc = peak_dc[peak_dc>=35.]

We have now discareded most of our data:



In [36]:

    
print('original data size: {}'.format(len(peak_dc)))
print('truncated data size: {}'.format(len(peak_dc_trunc)))









    



original data size: 1514
truncated data size: 468

WARNING: At this point, the estimation alogrithm for the generalized exponential tail distribution is very unstable. It is recommended to lock the location parameter and provide a good starting value for the scale parameter. We are working on it!



In [37]:

    
params_get = evapy.distributions.genexptail.fit(peak_dc_trunc, scale=20., floc=np.mean(signal))
peak_dc_get = evapy.distributions.genexptail(*params_get)



In [38]:

    
print('shape: {}'.format(params_get[0]))
print('q: {}'.format(params_get[1]))
print('location: {}'.format(params_get[2]))
print('scale: {}'.format(params_get[3]))









    



shape: 1.672369041261915
q: 6.055141376708667
location: -0.0003558229724597345
scale: 25.014685329726746



In [40]:

    
plt.figure('Peaks histogram and generalized exponential tail distribution', figsize=(16,9))
plt.plot(r, peak_dc_get.pdf(r), color='r')
plt.hist(peak_dc_trunc, bins=15, density=True)
plt.show()

Estimating the largest maxima

We can use the the relationship between the peak distributions and the maxima distributions to establish an estimate for the largest maxima. First, lets see whats the largest observed value in our signal.



In [41]:

    
max_obs = np.max(signal)
print('largest maxima: {}'.format(max_obs))









    



largest maxima: 91.11669792144568

Using Rayleig:



In [42]:

    
rloc, rscale = evapy.distributions.rayleigh.fit(peak_dc, floc=0.)
rN = len(peak_dc)



In [43]:

    
maxima_ray_dist = evapy.distributions.acer_o1(2., rN, loc=rloc, scale=np.sqrt(2.)*rscale)



In [44]:

    
median_maxima = maxima_ray_dist.ppf(0.5)
print('median maxima: {}'.format(median_maxima))









    



median maxima: 87.47910129889922

Using Weibull:



In [45]:

    
wshape, wloc, wscale = evapy.distributions.weibull.fit(peak_dc, floc=0.)
wN = len(peak_dc)



In [46]:

    
maxima_wb_dist = evapy.distributions.acer_o1(wshape, wN, loc=wloc, scale=wscale)



In [47]:

    
median_maxima = maxima_wb_dist.ppf(0.5)
print('median maxima: {}'.format(median_maxima))









    



median maxima: 93.65923694847491

Using generalized exponential tail:



In [48]:

    
gshape, qn, gloc, gscale = evapy.distributions.genexptail.fit(peak_dc_trunc, floc=0., scale=20.)
gN = len(peak_dc_trunc)



In [49]:

    
maxima_get_dist = evapy.distributions.acer_o1(gshape, gN*qn, loc=gloc, scale=gscale)



In [50]:

    
median_maxima = maxima_get_dist.ppf(0.5)
print('median maxima: {}'.format(median_maxima))









    



median maxima: 88.76801208798118



In [ ]: