I found a tutorial and I don't like it. So, finally decided to find an easier solution to do grid search for ARIMA, and to add cross validation in a way that's easier to understand.

Download the data here (no need login): https://datamarket.com/data/set/22r0/sales-of-shampoo-over-a-three-year-period#!ds=22r0&display=line


In [6]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import itertools

In [2]:
series = pd.read_csv('shampoo_sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True)
series.head()


Out[2]:
Month
1-01    266.0
1-02    145.9
1-03    183.1
1-04    119.3
1-05    180.3
Name: Sales of shampoo over a three year period, dtype: float64

In [3]:
def parser(x):
    return pd.datetime.strptime('190'+x, '%Y-%m')

series = pd.read_csv('shampoo_sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
series.head()


Out[3]:
Month
1901-01-01    266.0
1901-02-01    145.9
1901-03-01    183.1
1901-04-01    119.3
1901-05-01    180.3
Name: Sales of shampoo over a three year period, dtype: float64

In [4]:
series.values


Out[4]:
array([266. , 145.9, 183.1, 119.3, 180.3, 168.5, 231.8, 224.5, 192.8,
       122.9, 336.5, 185.9, 194.3, 149.5, 210.1, 273.3, 191.4, 287. ,
       226. , 303.6, 289.9, 421.6, 264.5, 342.3, 339.7, 440.4, 315.9,
       439.3, 401.3, 437.4, 575.5, 407.6, 682. , 475.3, 581.3, 646.9])

In [9]:
# Method 1 - Without cross validation, just simply grid search for ARIMA
p = d = q = range(0, 3)
pdq = list(itertools.product(p, d, q))

bestAIC = np.inf
bestParam = None

# grid search
for param in pdq:
    try:
        model = sm.tsa.statespace.SARIMAX(series.values,
                                          order=param,
                                          enforce_stationarity=False,
                                          enforce_invertibility=False)
        results = model.fit()
        #if current run of AIC is better than the best one so far, overwrite it
        if results.aic < bestAIC:
            bestAIC = results.aic
            bestParam = param
    except:
        pass
    
print('the best AIC and order are:',bestAIC,bestParam)


the best AIC and order are: 355.5008750421533 (1, 2, 2)

In [8]:
# Method 2 - Add cross validation with grid search for ARIMA
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=10)
for train_index, test_index in tscv.split(series.values):
    print("TRAIN:", train_index, "TEST:", test_index)


TRAIN: [0 1 2 3 4 5] TEST: [6 7 8]
TRAIN: [0 1 2 3 4 5 6 7 8] TEST: [ 9 10 11]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11] TEST: [12 13 14]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14] TEST: [15 16 17]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17] TEST: [18 19 20]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20] TEST: [21 22 23]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23] TEST: [24 25 26]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26] TEST: [27 28 29]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29] TEST: [30 31 32]
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32] TEST: [33 34 35]

In [10]:
import warnings
warnings.filterwarnings("ignore")

p = d = q = range(0, 3)
pdq = list(itertools.product(p, d, q))

bestAIC = np.inf
best_cvAIC = np.inf
bestParam = None
best_cvParam = None

X = series.values

# grid search
for param in pdq:
    for train_index, test_index in tscv.split(series.values):
        X_train, X_test = X[train_index], X[test_index]
        try:
            model = sm.tsa.statespace.SARIMAX(X_train,
                                              order=param,
                                              enforce_stationarity=False,
                                              enforce_invertibility=False)
            results = model.fit()
            #if current run of AIC is better than the best one so far, overwrite it
            if results.aic < bestAIC:
                bestAIC = results.aic
                bestParam = param

            # apply the best order on testig data
            best_model = sm.tsa.statespace.SARIMAX(X_test,
                                              order=bestParam,
                                              enforce_stationarity=False,
                                              enforce_invertibility=False)
            results = best_smodel.fit()
            if results.aic < best_cvAIC:
                best_cvAIC = results.aic
                best_cvParam = bestParam
        except:
            pass
    
print('the best AIC, the best CV AIC and order are:',bestAIC,best_cvAIC,bestParam)


the best AIC, the best CV AIC and order are: 33.19568230991611 inf (0, 1, 1)

References: