Data loading and Visualization

Data loading

We begin by importing all the necessary libraries for data visualization and time series analysis and data frame structure


In [283]:
from scipy import signal
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
%matplotlib inline
import pandas as pd
from datetime import datetime
from statsmodels.tsa.stattools import adfuller, acf, pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA
import statsmodels.api as smapi
import statsmodels as sm

Then, we load the data into a data frame with $\texttt{timestamp}$ as index, and the $\texttt{consommation}$ as label.


In [284]:
'''Data loading'''
f_names_national = [
    '2012 Conso Prod.csv',
    '2013 Conso Prod.csv',
    '2014 Conso Prod.csv',
    '2015 Conso Prod.csv'
]

datas = []
data_news = []
for f_name in f_names_national:
#     print(f_name)
    data = pd.read_csv('data/'+ f_name, delimiter='\t', encoding = "ISO-8859-1", low_memory=False)
    pd.set_option('max_columns', 100)
    headers = list(data)
    data = data[data.Consommation.notnull()]
    data = data[data.Date.notnull()]
    data['timestamp'] = [str(d) + ' ' + str(t) for d, t in zip(data['Date'].values, data['Heures'].values)]
    data['timestamp'] = pd.to_datetime(data['timestamp'], format='%Y-%m-%d %H:%M')
    datas.append(data)

data_final = pd.concat(datas).reset_index()
data_final = data_final.sort_values(by=['Date','Heures'], ascending=[False,False])

ts = pd.Series(data_final['Consommation'].values, index = data_final['timestamp'].values )

Data visualization

We then visualize our data for all the range provided, to understand what type of model we should use


In [285]:
fig, ax = plt.subplots(figsize=(20,5))
ax.plot(ts[::10].index, ts[::10])
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
ax.set_title(" Electricity consumption in France in MW")
# plt.savefig('visualization.png')


Out[285]:
<matplotlib.text.Text at 0xc0cf1d0>

As we can see, there is too much variation in the plot, so it would be more adequate to visualize the rolling mean since it is smoother and reprensent the data very well

If we take 48 (a day) as the length of the window, we obtain the following


In [286]:
fig, ax = plt.subplots(figsize=(20,5))
rolling_mean = ts.rolling(center=False,window=48).mean()
ax.plot(rolling_mean.index, rolling_mean)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
ax.set_title(" Electricity consumption in France in MW")


Out[286]:
<matplotlib.text.Text at 0xbdc5908>

The figure still has a lot of variations, therefore, we try with a larger window, say 1 week (7*48)


In [287]:
fig, ax = plt.subplots(figsize=(20,5))
rolling_mean = ts.rolling(center=False,window=48*7).mean()
ax.plot(rolling_mean.index, rolling_mean)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
ax.set_title(" Electricity consumption in France in MW")


Out[287]:
<matplotlib.text.Text at 0xc18b748>

and 1 month


In [288]:
fig, ax = plt.subplots(figsize=(20,5))
rolling_mean = ts.rolling(center=False,window=48*30).mean()
ax.plot(rolling_mean.index, rolling_mean)
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
ax.set_title(" Electricity consumption in France in MW")


Out[288]:
<matplotlib.text.Text at 0xc239b00>

Stationarization

Getting the period

With a 1-month window, we get a smooth curve, which allows us to find the periodicity of our time series. ARMA models are not fit to modelize periodic time series, so we have to do some preprocessing by eliminating the periodicity in the data. Let's find the frequency of our data.


In [289]:
rolling_mean = rolling_mean.dropna()
fig, ax = plt.subplots(figsize=(20,5))
f, PSD = signal.periodogram((rolling_mean.values-np.mean(rolling_mean.values))/np.std(rolling_mean.values), fs = 1/(30*60))
ax.semilogy(f, PSD)
plt.xlabel('frequency [Hz]')
plt.ylabel('PSD [V**2/Hz]')
plt.ylim([1e-2, 1e9])
plt.show()



In [290]:
fig, ax = plt.subplots(figsize=(20,5))
f, PSD = signal.periodogram((rolling_mean.values-np.mean(rolling_mean.values))/np.std(rolling_mean.values), fs = 1/(30*60))
ax.plot(f, PSD)
plt.xlabel('frequency [Hz]')
plt.ylabel('PSD [V**2/Hz]')
plt.xlim([0, 0.0000001])
plt.show()
freq_Hz = f[PSD.argmax(axis=0)];
print(freq_Hz)
print(freq_Hz*60*60*24*365.25)
print(1/(freq_Hz*60*30))


3.23519373149e-08
1.02094949701
17172.25

In [291]:
rolling_mean.values.shape


Out[291]:
(68689,)

The PSD ( Power Spectral Density) seems be able to detect the frequency, which is equal to $3.23 * 10^{-8}$. in terms of number of points the period is equal to $\texttt{period} = \frac{1}{3.23*10^{-8}*60*30} = 17172$.Manually, We can see that in our range, there is approximatively 4 periods, so the period would be the range divided by 4 i.e $\texttt{period} = \frac{70000}{4} = 17500, it turns out to be a good approximation.$

Seasonal decomposition

We can try to decompose our data in trend, seasonality and residuals. Let's split them into train and test


In [292]:
ts_train = data_final[data_final['Date'] <= '2015-10-31']
ts_train = pd.Series(ts_train['Consommation'].values, index = ts_train['timestamp'].values )
plt.plot(ts_train)


Out[292]:
[<matplotlib.lines.Line2D at 0x1d637f98>]

In [293]:
ts_test = data_final[data_final['Date'] > '2015-10-31']
ts_test=pd.Series(ts_test['Consommation'].values, index = ts_test['timestamp'].values )
plt.plot(ts_test)


Out[293]:
[<matplotlib.lines.Line2D at 0x311b34e0>]

We then perform the decomposition


In [294]:
seasonal_dec = seasonal_decompose(ts_train.values, freq = 17172)

data_trend = seasonal_dec.trend
data_seasonal = seasonal_dec.seasonal
data_residual = seasonal_dec.resid

plt.subplot(221)
plt.plot(ts_train, label='Original')
plt.legend(loc='best')
plt.subplot(222)
plt.plot(data_trend, label='Trend')
plt.legend(loc='best')
plt.subplot(223)
plt.plot(data_seasonal,label='Seasonality')
plt.legend(loc='best')
plt.subplot(224)
plt.plot(data_residual, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()
np.mean(seasonal_dec.trend[np.logical_not(np.isnan(seasonal_dec.trend))])


Out[294]:
54655.856447632519

As we can see, we obtain residuals with a 15000 amplitude (20% of the mean value of the series), and a 25000 amplitude seasonality. The trend is nearly constant (= 55000). The principal part would be the prediction of the residuals, which have the form of a white noise. We could generate a white noise with the same characteristics (variance), but applying ARIMA to forecast these residuals can give good results.

We can use a statistic test named Dickey-Fuller to be more sure of the stationarity of the data. We hence define a function that test the stationarity of the time series.


In [295]:
def test_stationarity(ts, stationarity):

    orig = plt.plot(ts, color='blue',label='Original')
    
    rolling_mean = ts.rolling(center=False,window=stationarity).mean()
    rolling_std = ts.rolling(center=False,window=stationarity).std()
    
    mean = plt.plot(rolling_mean, color='red', label='Rolling Mean')
    std = plt.plot(rolling_std, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)
    
    
    print('Results of Dickey-Fuller Test:')
    dftest = adfuller(ts, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print(dfoutput)

In [296]:
test_stationarity(ts_train, 48*30)


Results of Dickey-Fuller Test:
Test Statistic                -7.956634e+00
p-value                        3.025025e-12
#Lags Used                     6.200000e+01
Number of Observations Used    6.713700e+04
Critical Value (10%)          -2.566793e+00
Critical Value (5%)           -2.861583e+00
Critical Value (1%)           -3.430447e+00
dtype: float64

The results of the Dickey-Fuller test confirm the observation that the data are stationnary (since the test statistic $-7.95$ is less than the critical value (1%) $-3.43$), we can try to differentiate to see the effect that could have on the stationarity.


In [297]:
ts_train_diff = ts_train - ts_train.shift()
plt.plot(ts_train_diff)
ts_train_diff.dropna(inplace=True)
test_stationarity(ts_train_diff, 48*30)


Results of Dickey-Fuller Test:
Test Statistic                   -32.912833
p-value                            0.000000
#Lags Used                        62.000000
Number of Observations Used    67136.000000
Critical Value (10%)              -2.566793
Critical Value (5%)               -2.861583
Critical Value (1%)               -3.430447
dtype: float64

By differencing our data once, we get more stationarity (Test Statistic $-32 \leq -8$ that we had before).


In [298]:
ts_train_diff2 = ts_train_diff - ts_train_diff.shift()
ts_train_diff2.dropna(inplace=True)
test_stationarity(ts_train_diff2, 48*30)


Results of Dickey-Fuller Test:
Test Statistic                   -59.581937
p-value                            0.000000
#Lags Used                        62.000000
Number of Observations Used    67135.000000
Critical Value (10%)              -2.566793
Critical Value (5%)               -2.861583
Critical Value (1%)               -3.430447
dtype: float64

We notice that, the more we differenciate the data, the better is the stationarity.

But since the data provided are already stationary, we don't need the integrated part of the ARIMA, we will then fix the d value (degree of differentiation) to 0.

Autocorrelation and Partial Autocorrelation

We focus now on finding the parameters of the ARMA model.


In [302]:
data_residual = data_residual[np.logical_not(np.isnan(data_residual))]
mn =np.mean(data_residual)
stdd = np.std(data_residual)
print(mn)
print(stdd)


185.888727454
5960.50100603

In [303]:
samples = np.random.normal(mn, stdd, size= 54000)
plt.plot(samples)


Out[303]:
[<matplotlib.lines.Line2D at 0xc12ed30>]

We try to approximate the first degree difference with an ARMA model. We find the parameters of the AR and MA through the autocorrelation and partial autocorrelation functions.


In [304]:
acf_lag = acf(ts_train_diff, nlags=20)
pacf_lag = pacf(ts_train_diff, nlags=20, method='ols')

In [305]:
plt.subplot(211) 
plt.plot(acf_lag)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(ts_train_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(ts_train_diff)),linestyle='--',color='gray')
plt.title('Autocorrelation Function')

plt.subplot(212)
plt.plot(pacf_lag)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(ts_train_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(ts_train_diff)),linestyle='--',color='gray')
plt.title('Partial Autocorrelation Function')
plt.tight_layout()

#plt.savefig('auto_partial.png')


The degree of the MA part appears to be equal to 4, because it is the threshold where the autocorrelation function starts to be negative.

The degree of the MA part is equal to 2 (the same argument as before with the partial autocorrelation function instead of the autocorrelation function).

Building the model

AR Model


In [306]:
model = ARIMA(ts_train, order=(2, 1, 0))  
AR_results = model.fit(disp=-1)  
plt.plot(ts_train_diff)
plt.plot(AR_results.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((AR_results.fittedvalues-ts_train_diff)**2))


C:\Users\Lenovo\Anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:165: ValueWarning: No frequency information was provided, so inferred frequency -30T will be used.
  % freq, ValueWarning)
Out[306]:
<matplotlib.text.Text at 0x24808ba8>

MA Model


In [307]:
model = ARIMA(ts_train, order=(0, 1, 4))  
MA_results = model.fit(disp=-1)  
plt.plot(ts_train_diff)
plt.plot(MA_results.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((MA_results.fittedvalues-ts_train_diff)**2))


C:\Users\Lenovo\Anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:165: ValueWarning: No frequency information was provided, so inferred frequency -30T will be used.
  % freq, ValueWarning)
Out[307]:
<matplotlib.text.Text at 0x247530f0>

ARMA model


In [308]:
model = ARIMA(ts_train, order=(2, 1, 4))  
ARIMA_results = model.fit(disp=-1)  
plt.plot(ts_train_diff)
plt.plot(ARIMA_results.fittedvalues, color='red')
sum(((ARIMA_results.fittedvalues-ts_train_diff))**2)/len(ts_train_diff)
#plt.savefig('ARMA.png')


C:\Users\Lenovo\Anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:165: ValueWarning: No frequency information was provided, so inferred frequency -30T will be used.
  % freq, ValueWarning)
Out[308]:
879913.6204776417

We notice that the fitted values have the same form but not the good amplitude, so we multiply by a factor in order to fit with the max and min.


In [309]:
plt.plot(ts_train)
plt.plot(np.cumsum(results_ARIMA.fittedvalues), color='red')
#plt.savefig('cumsum.png')


Out[309]:
[<matplotlib.lines.Line2D at 0x31131438>]