auto_arima

Pmdarima bring R's auto.arima functionality to Python by wrapping statsmodel ARIMA and SARIMAX models into a singular scikit-learn-esque estimator (pmdarima.arima.ARIMA) and adding several layers of degree and seasonal differencing tests to identify the optimal model parameters.

Pmdarima ARIMA models:

  • Are fully picklable for easy persistence and model deployment
  • Can handle seasonal terms (unlike statsmodels ARIMAs)
  • Follow sklearn model fit/predict conventions

In [1]:
import numpy as np
import pmdarima as pm

print('numpy version: %r' % np.__version__)
print('pmdarima version: %r' % pm.__version__)


numpy version: '1.15.3'
pmdarima version: '1.3.0-dev0'

We'll start by defining an array of data from an R time-series, wineind:

> forecast::wineind
       Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
1980 15136 16733 20016 17708 18019 19227 22893 23739 21133 22591 26786 29740
1981 15028 17977 20008 21354 19498 22125 25817 28779 20960 22254 27392 29945
1982 16933 17892 20533 23569 22417 22084 26580 27454 24081 23451 28991 31386
1983 16896 20045 23471 21747 25621 23859 25500 30998 24475 23145 29701 34365
1984 17556 22077 25702 22214 26886 23191 27831 35406 23195 25110 30009 36242
1985 18450 21845 26488 22394 28057 25451 24872 33424 24052 28449 33533 37351
1986 19969 21701 26249 24493 24603 26485 30723 34569 26689 26157 32064 38870
1987 21337 19419 23166 28286 24570 24001 33151 24878 26804 28967 33311 40226
1988 20504 23060 23562 27562 23940 24584 34303 25517 23494 29095 32903 34379
1989 16991 21109 23740 25552 21752 20294 29009 25500 24166 26960 31222 38641
1990 14672 17543 25453 32683 22449 22316 27595 25451 25421 25288 32568 35110
1991 16052 22146 21198 19543 22084 23816 29961 26773 26635 26972 30207 38687
1992 16974 21697 24179 23757 25013 24019 30345 24488 25156 25650 30923 37240
1993 17466 19463 24352 26805 25236 24735 29356 31234 22724 28496 32857 37198
1994 13652 22784 23565 26323 23779 27549 29660 23356

Note that the frequency of the data is 12:

> frequency(forecast::wineind)
[1] 12

In [2]:
from pmdarima.datasets import load_wineind

# this is a dataset from R
wineind = load_wineind().astype(np.float64)

Fitting an ARIMA

We will first fit a seasonal ARIMA. Note that you do not need to call auto_arima in order to fit a model—if you know the order and seasonality of your data, you can simply fit an ARIMA with the defined hyper-parameters:


In [3]:
from pmdarima.arima import ARIMA

fit = ARIMA(order=(1, 1, 1), seasonal_order=(0, 1, 1, 12)).fit(y=wineind)

Also note that your data does not have to exhibit seasonality to work with an ARIMA. We could fit an ARIMA against the same data with no seasonal terms whatsoever (but it is unlikely that it will perform better; quite the opposite, likely).


In [4]:
fit = ARIMA(order=(1, 1, 1), seasonal_order=None).fit(y=wineind)

Finding the optimal model hyper-parameters using auto_arima:

If you are unsure (as is common) of the best parameters for your model, let auto_arima figure it out for you. auto_arima is similar to an ARIMA-specific grid search, but (by default) uses a more intelligent stepwise algorithm laid out in a paper by Hyndman and Khandakar (2008). If stepwise is False, the models will be fit similar to a gridsearch. Note that it is possible for auto_arima not to find a model that will converge; if this is the case, it will raise a ValueError.

Fitting a stepwise search:


In [5]:
# fitting a stepwise model:
stepwise_fit = pm.auto_arima(wineind, start_p=1, start_q=1, max_p=3, max_q=3, m=12,
                             start_P=0, seasonal=True, d=1, D=1, trace=True,
                             error_action='ignore',  # don't want to know if an order does not work
                             suppress_warnings=True,  # don't want convergence warnings
                             stepwise=True)  # set to stepwise

stepwise_fit.summary()


Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=3066.760, BIC=3082.229, Fit time=0.725 seconds
Fit ARIMA: order=(0, 1, 0) seasonal_order=(0, 1, 0, 12); AIC=3133.376, BIC=3139.564, Fit time=0.029 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(1, 1, 0, 12); AIC=3099.734, BIC=3112.109, Fit time=0.334 seconds
Fit ARIMA: order=(0, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=3066.930, BIC=3079.305, Fit time=0.229 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 1, 12); AIC=3067.548, BIC=3086.110, Fit time=1.106 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 0, 12); AIC=3088.088, BIC=3100.463, Fit time=0.144 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(0, 1, 2, 12); AIC=3068.000, BIC=3086.563, Fit time=1.005 seconds
Fit ARIMA: order=(1, 1, 1) seasonal_order=(1, 1, 2, 12); AIC=3068.915, BIC=3090.571, Fit time=2.380 seconds
Fit ARIMA: order=(2, 1, 1) seasonal_order=(0, 1, 1, 12); AIC=3067.447, BIC=3086.010, Fit time=0.435 seconds
Fit ARIMA: order=(1, 1, 0) seasonal_order=(0, 1, 1, 12); AIC=3094.571, BIC=3106.946, Fit time=0.242 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=3066.742, BIC=3085.305, Fit time=0.524 seconds
Fit ARIMA: order=(2, 1, 3) seasonal_order=(0, 1, 1, 12); AIC=3070.634, BIC=3095.384, Fit time=1.342 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(1, 1, 1, 12); AIC=3068.010, BIC=3089.666, Fit time=0.704 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 1, 0, 12); AIC=3090.957, BIC=3106.426, Fit time=0.139 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(0, 1, 2, 12); AIC=3067.731, BIC=3089.387, Fit time=1.046 seconds
Fit ARIMA: order=(1, 1, 2) seasonal_order=(1, 1, 2, 12); AIC=3069.703, BIC=3094.453, Fit time=2.791 seconds
Fit ARIMA: order=(0, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(2, 1, 2) seasonal_order=(0, 1, 1, 12); AIC=3068.673, BIC=3090.330, Fit time=0.650 seconds
Fit ARIMA: order=(1, 1, 3) seasonal_order=(0, 1, 1, 12); AIC=3068.810, BIC=3090.467, Fit time=0.730 seconds
Total fit time: 14.564 seconds
Out[5]:
Statespace Model Results
Dep. Variable: y No. Observations: 176
Model: SARIMAX(1, 1, 2)x(0, 1, 1, 12) Log Likelihood -1527.371
Date: Mon, 27 May 2019 AIC 3066.742
Time: 11:17:17 BIC 3085.305
Sample: 0 HQIC 3074.278
- 176
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept -100.7331 72.197 -1.395 0.163 -242.236 40.770
ar.L1 -0.5123 0.390 -1.312 0.189 -1.277 0.253
ma.L1 -0.0806 0.404 -0.200 0.842 -0.872 0.711
ma.L2 -0.4430 0.224 -1.978 0.048 -0.882 -0.004
ma.S.L12 -0.4025 0.054 -7.448 0.000 -0.508 -0.297
sigma2 7.663e+06 7.3e+05 10.495 0.000 6.23e+06 9.09e+06
Ljung-Box (Q): 48.70 Jarque-Bera (JB): 21.57
Prob(Q): 0.16 Prob(JB): 0.00
Heteroskedasticity (H): 1.18 Skew: -0.61
Prob(H) (two-sided): 0.54 Kurtosis: 4.31


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 8.14e+14. Standard errors may be unstable.

If you don't want to use the stepwise search, auto_arima can fit a random search by enabling random=True. If your random search returns too many invalid (nan) models, you might try increasing n_fits or making it an exhaustive search (stepwise=False, random=False).


In [6]:
rs_fit = pm.auto_arima(wineind, start_p=1, start_q=1, max_p=3, max_q=3, m=12,
                       start_P=0, seasonal=True, d=1, D=1, trace=True,
                       n_jobs=-1,  # We can run this in parallel by controlling this option
                       error_action='ignore',  # don't want to know if an order does not work
                       suppress_warnings=True,  # don't want convergence warnings
                       stepwise=False, random=True, random_state=42,  # we can fit a random search (not exhaustive)
                       n_fits=25)

rs_fit.summary()


Total fit time: 38.591 seconds
Out[6]:
Statespace Model Results
Dep. Variable: y No. Observations: 176
Model: SARIMAX(2, 1, 3)x(1, 1, 1, 12) Log Likelihood -1524.252
Date: Mon, 27 May 2019 AIC 3066.504
Time: 11:19:38 BIC 3094.348
Sample: 0 HQIC 3077.809
- 176
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept -112.2831 131.424 -0.854 0.393 -369.870 145.304
ar.L1 -1.7290 0.052 -33.349 0.000 -1.831 -1.627
ar.L2 -0.8938 0.053 -16.983 0.000 -0.997 -0.791
ma.L1 1.1245 0.066 17.022 0.000 0.995 1.254
ma.L2 -0.3031 0.085 -3.547 0.000 -0.471 -0.136
ma.L3 -0.6783 0.050 -13.698 0.000 -0.775 -0.581
ar.S.L12 0.1278 0.157 0.815 0.415 -0.179 0.435
ma.S.L12 -0.5523 0.154 -3.586 0.000 -0.854 -0.250
sigma2 7.548e+06 0.003 2.24e+09 0.000 7.55e+06 7.55e+06
Ljung-Box (Q): 43.96 Jarque-Bera (JB): 16.51
Prob(Q): 0.31 Prob(JB): 0.00
Heteroskedasticity (H): 1.07 Skew: -0.58
Prob(H) (two-sided): 0.80 Kurtosis: 4.04


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 9.91e+24. Standard errors may be unstable.

Inspecting goodness of fit

We can look at how well the model fits in-sample data:


In [7]:
from bokeh.plotting import figure, show, output_notebook
import pandas as pd

# init bokeh
output_notebook()

def plot_arima(truth, forecasts, title="ARIMA", xaxis_label='Time',
               yaxis_label='Value', c1='#A6CEE3', c2='#B2DF8A', 
               forecast_start=None, **kwargs):
    
    # make truth and forecasts into pandas series
    n_truth = truth.shape[0]
    n_forecasts = forecasts.shape[0]
    
    # always plot truth the same
    truth = pd.Series(truth, index=np.arange(truth.shape[0]))
    
    # if no defined forecast start, start at the end
    if forecast_start is None:
        idx = np.arange(n_truth, n_truth + n_forecasts)
    else:
        idx = np.arange(forecast_start, n_forecasts)
    forecasts = pd.Series(forecasts, index=idx)
    
    # set up the plot
    p = figure(title=title, plot_height=400, **kwargs)
    p.grid.grid_line_alpha=0.3
    p.xaxis.axis_label = xaxis_label
    p.yaxis.axis_label = yaxis_label
    
    # add the lines
    p.line(truth.index, truth.values, color=c1, legend='Observed')
    p.line(forecasts.index, forecasts.values, color=c2, legend='Forecasted')
    
    return p


Loading BokehJS ...

In [8]:
in_sample_preds = stepwise_fit.predict_in_sample()
in_sample_preds[:10]


Out[8]:
array([  -66.61074424, 10098.89197661, 12090.04716934, 16290.75198825,
       15943.17987042, 17257.4782733 , 17797.61571866, 20199.85515377,
       21475.00800993, 20787.12947566])

In [9]:
show(plot_arima(wineind, in_sample_preds, 
                title="Original Series & In-sample Predictions", 
                c2='#FF0000', forecast_start=0))


Predicting future values

After your model is fit, you can forecast future values using the predict function, just like in sci-kit learn:


In [10]:
next_25 = stepwise_fit.predict(n_periods=25)
next_25


Out[10]:
array([21966.69715352, 25983.70920565, 30225.30978848, 35417.12496267,
       13011.3618805 , 19639.41027742, 21506.58176159, 23674.33998321,
       21685.76394174, 23669.77818948, 26955.35255979, 22755.2506162 ,
       19808.16131764, 23578.7818493 , 27845.86719673, 32923.89426476,
       10475.6877784 , 17024.74533391, 18831.64797186, 20929.546713  ,
       18876.02414315, 20792.57512611, 24011.97546913, 19745.03906623,
       16731.45362497])

In [11]:
# call the plotting func
show(plot_arima(wineind, next_25))


Updating your model

ARIMAs create forecasts by using the latest observations. Over time, your forecasts will drift, and you'll need to update the model with the observed values. There are several solutions to this problem:

  • Fit a new ARIMA with the new data added to your training sample
    • You can either re-use the order discovered in the auto_arima function, or re-run auto_arima altogether.
  • Use the update method (preferred). This will allow your model to update its parameters by taking several more MLE steps on new observations (controlled by the maxiter arg) starting from the parameters it's already discovered. This approach will help you avoid over-fitting.

For this example, let's update our existing model with the next_25 we just computed, as if they were actually observed values.


In [12]:
stepwise_fit.update(next_25, maxiter=10)  # take 10 more steps
stepwise_fit.summary()


Out[12]:
Statespace Model Results
Dep. Variable: y No. Observations: 201
Model: SARIMAX(1, 1, 2)x(0, 1, 1, 12) Log Likelihood -1748.493
Date: Mon, 27 May 2019 AIC 3508.986
Time: 11:25:31 BIC 3528.404
Sample: 0 HQIC 3516.853
- 201
Covariance Type: opg
coef std err z P>|z| [0.025 0.975]
intercept -100.7331 72.099 -1.397 0.162 -242.045 40.578
ar.L1 -0.5122 0.390 -1.314 0.189 -1.277 0.252
ma.L1 -0.0806 0.403 -0.200 0.842 -0.871 0.710
ma.L2 -0.4431 0.224 -1.981 0.048 -0.882 -0.005
ma.S.L12 -0.4024 0.054 -7.458 0.000 -0.508 -0.297
sigma2 7.663e+06 7.1e+05 10.789 0.000 6.27e+06 9.05e+06
Ljung-Box (Q): 54.62 Jarque-Bera (JB): 44.08
Prob(Q): 0.06 Prob(JB): 0.00
Heteroskedasticity (H): 0.54 Skew: -0.66
Prob(H) (two-sided): 0.01 Kurtosis: 4.97


Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 7.7e+14. Standard errors may be unstable.

In [14]:
updated_data = np.concatenate([wineind, next_25])

In [15]:
# visualize new forecasts
show(plot_arima(updated_data, stepwise_fit.predict(n_periods=10)))



In [ ]: