ARIMA Grid Search

I found a tutorial and I don't like it. So, finally decided to find an easier solution to do grid search for ARIMA, and to add cross validation in a way that's easier to understand.

Download the data here (no need login): https://datamarket.com/data/set/22r0/sales-of-shampoo-over-a-three-year-period#!ds=22r0&display=line



In [6]:

    
import pandas as pd
import numpy as np
import statsmodels.api as sm
import itertools



In [2]:

    
series = pd.read_csv('shampoo_sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
series.head()









    Out[2]:





Month
1-01    266.0
1-02    145.9
1-03    183.1
1-04    119.3
1-05    180.3
Name: Sales of shampoo over a three year period, dtype: float64



In [3]:

    
def parser(x):
    return pd.datetime.strptime('190'+x, '%Y-%m')

series = pd.read_csv('shampoo_sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
series.head()









    Out[3]:





Month
1901-01-01    266.0
1901-02-01    145.9
1901-03-01    183.1
1901-04-01    119.3
1901-05-01    180.3
Name: Sales of shampoo over a three year period, dtype: float64



In [4]:

    
series.values









    Out[4]:





array([266. , 145.9, 183.1, 119.3, 180.3, 168.5, 231.8, 224.5, 192.8,
       122.9, 336.5, 185.9, 194.3, 149.5, 210.1, 273.3, 191.4, 287. ,
       226. , 303.6, 289.9, 421.6, 264.5, 342.3, 339.7, 440.4, 315.9,
       439.3, 401.3, 437.4, 575.5, 407.6, 682. , 475.3, 581.3, 646.9])



In [9]:

    
# Method 1 - Without cross validation, just simply grid search for ARIMA
p = d = q = range(0, 3)
pdq = list(itertools.product(p, d, q))

bestAIC = np.inf
bestParam = None

# grid search
for param in pdq:
    try:
        model = sm.tsa.statespace.SARIMAX(series.values,
                                          order=param,
                                          enforce_stationarity=False,
                                          enforce_invertibility=False)
        results = model.fit()
        #if current run of AIC is better than the best one so far, overwrite it
        if results.aic < bestAIC:
            bestAIC = results.aic
            bestParam = param
    except:
        pass
    
print('the best AIC and order are:',bestAIC,bestParam)









    



the best AIC and order are: 355.5008750421533 (1, 2, 2)



In [8]:

    
# Method 2 - Add cross validation with grid search for ARIMA
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=10)
for train_index, test_index in tscv.split(series.values):
    print("TRAIN:", train_index, "TEST:", test_index)









    



TRAIN: [0 1 2 3 4 5] TEST: [6 7 8]
TRAIN: [0 1 2 3 4 5 6 7 8] TEST: [ 9 10 11]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11] TEST: [12 13 14]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14] TEST: [15 16 17]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17] TEST: [18 19 20]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20] TEST: [21 22 23]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23] TEST: [24 25 26]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26] TEST: [27 28 29]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29] TEST: [30 31 32]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32] TEST: [33 34 35]



In [10]:

    
import warnings
warnings.filterwarnings("ignore")

p = d = q = range(0, 3)
pdq = list(itertools.product(p, d, q))

bestAIC = np.inf
best_cvAIC = np.inf
bestParam = None
best_cvParam = None

X = series.values

# grid search
for param in pdq:
    for train_index, test_index in tscv.split(series.values):
        X_train, X_test = X[train_index], X[test_index]
        try:
            model = sm.tsa.statespace.SARIMAX(X_train,
                                              order=param,
                                              enforce_stationarity=False,
                                              enforce_invertibility=False)
            results = model.fit()
            #if current run of AIC is better than the best one so far, overwrite it
            if results.aic < bestAIC:
                bestAIC = results.aic
                bestParam = param

            # apply the best order on testig data
            best_model = sm.tsa.statespace.SARIMAX(X_test,
                                              order=bestParam,
                                              enforce_stationarity=False,
                                              enforce_invertibility=False)
            results = best_smodel.fit()
            if results.aic < best_cvAIC:
                best_cvAIC = results.aic
                best_cvParam = bestParam
        except:
            pass
    
print('the best AIC, the best CV AIC and order are:',bestAIC,best_cvAIC,bestParam)









    



the best AIC, the best CV AIC and order are: 33.19568230991611 inf (0, 1, 1)

References:

https://machinelearningmastery.com/grid-search-arima-hyperparameters-with-python/
- I don't like this one, it implemented everything while there are libraries to use. I think the beauty of python is you have many existing libraries to use, why build your own when there are better choices? With so much time you can do many other things...
https://www.kaggle.com/sbongo/another-look-at-forecasting-gridsearch-arima
- This is a much better one
https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html
- About SARIMAX model
- About statesapce: https://www.statsmodels.org/dev/statespace.html