Stockmarket analysis with `pmdarima`

This example follows the post on Towards Data Science (TDS), demonstrating the use of pmdarima to simplify time series analysis.



In [1]:

    
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline

import pmdarima as pm
print(f"Using pmdarima {pm.__version__}")









    



Using pmdarima 1.3.0-dev0

Import the data

pmdarima contains an embedded datasets submodule that allows us to try out models on common datasets. We can load the MSFT stock data from pmdarima 1.3.0+:



In [2]:

    
from pmdarima.datasets.stocks import load_msft

df = load_msft()
df.head()

Split the data

As in the blog post, we'll use 80% of the samples as training data. Note that a time series' train/test split is different from that of a dataset without temporality; order must be preserved if we hope to discover any notable trends.



In [3]:

    
train_len = int(df.shape[0] * 0.8)
train_data, test_data = df[:train_len], df[train_len:]

y_train = train_data['Open'].values
y_test = test_data['Open'].values

print(f"{train_len} train samples")
print(f"{df.shape[0] - train_len} test samples")









    



6386 train samples
1597 test samples

Pre-modeling analysis

TDS fixed p at 5 based on some lag plot analysis:



In [18]:

    
from pandas.plotting import lag_plot

fig, axes = plt.subplots(3, 2, figsize=(12, 16))
plt.title('MSFT Autocorrelation plot')

# The axis coordinates for the plots
ax_idcs = [
    (0, 0),
    (0, 1),
    (1, 0),
    (1, 1),
    (2, 0),
    (2, 1)
]

for lag, ax_coords in enumerate(ax_idcs, 1):
    ax_row, ax_col = ax_coords
    axis = axes[ax_row][ax_col]
    lag_plot(df['Open'], lag=lag, ax=axis)
    axis.set_title(f"Lag={lag}")
    
plt.show()

All lags look fairly linear, so it's a good indicator that an auto-regressive model is a good choice. Therefore, we'll allow the auto_arima to select the lag term for us, up to 6.

Estimating the differencing term

We can estimate the best lag term with several statistical tests:



In [19]:

    
from pmdarima.arima import ndiffs

kpss_diffs = ndiffs(y_train, alpha=0.05, test='kpss', max_d=6)
adf_diffs = ndiffs(y_train, alpha=0.05, test='adf', max_d=6)
n_diffs = max(adf_diffs, kpss_diffs)

print(f"Estimated differencing term: {n_diffs}")









    



Estimated differencing term: 1

Use auto_arima to fit a model on the data.



In [74]:

    
auto = pm.auto_arima(y_train, d=n_diffs, seasonal=False, stepwise=True,
                     suppress_warnings=True, error_action="ignore", max_p=6,
                     max_order=None, trace=True)









    



Fit ARIMA: order=(2, 1, 2); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(0, 1, 0); AIC=5640.878, BIC=5654.402, Fit time=0.004 seconds
Fit ARIMA: order=(1, 1, 0); AIC=5640.426, BIC=5660.711, Fit time=0.074 seconds
Fit ARIMA: order=(0, 1, 1); AIC=5640.350, BIC=5660.635, Fit time=0.050 seconds
Fit ARIMA: order=(1, 1, 1); AIC=5640.152, BIC=5667.198, Fit time=0.607 seconds
Fit ARIMA: order=(1, 1, 2); AIC=nan, BIC=nan, Fit time=nan seconds
Fit ARIMA: order=(2, 1, 1); AIC=5641.440, BIC=5675.249, Fit time=0.556 seconds
Total fit time: 2.311 seconds



In [75]:

    
print(auto.order)









    



(1, 1, 1)



In [76]:

    
from sklearn.metrics import mean_squared_error
from pmdarima.metrics import smape

model = auto

def forecast_one_step():
    fc, conf_int = model.predict(n_periods=1, return_conf_int=True)
    return (
        fc.tolist()[0],
        np.asarray(conf_int).tolist()[0])

forecasts = []
confidence_intervals = []

for new_ob in y_test:
    fc, conf = forecast_one_step()
    forecasts.append(fc)
    confidence_intervals.append(conf)
    
    # Updates the existing model with a small number of MLE steps
    model.update(new_ob)
    
print(f"Mean squared error: {mean_squared_error(y_test, forecasts)}")
print(f"SMAPE: {smape(y_test, forecasts)}")









    



Mean squared error: 0.3416473178248818
SMAPE: 0.981464018635346



In [86]:

    
fig, axes = plt.subplots(2, 1, figsize=(12, 12))

# --------------------- Actual vs. Predicted --------------------------
axes[0].plot(y_train, color='blue', label='Training Data')
axes[0].plot(test_data.index, forecasts, color='green', marker='o',
             label='Predicted Price')

axes[0].plot(test_data.index, y_test, color='red', label='Actual Price')
axes[0].set_title('Microsoft Prices Prediction')
axes[0].set_xlabel('Dates')
axes[0].set_ylabel('Prices')

axes[0].set_xticks(np.arange(0, 7982, 1300).tolist(), df['Date'][0:7982:1300].tolist())
axes[0].legend()


# ------------------ Predicted with confidence intervals ----------------
axes[1].plot(y_train, color='blue', label='Training Data')
axes[1].plot(test_data.index, forecasts, color='green',
             label='Predicted Price')

axes[1].set_title('Prices Predictions & Confidence Intervals')
axes[1].set_xlabel('Dates')
axes[1].set_ylabel('Prices')

conf_int = np.asarray(confidence_intervals)
axes[1].fill_between(test_data.index,
                     conf_int[:, 0], conf_int[:, 1],
                     alpha=0.9, color='orange',
                     label="Confidence Intervals")

axes[1].set_xticks(np.arange(0, 7982, 1300).tolist(), df['Date'][0:7982:1300].tolist())
axes[1].legend()









    Out[86]:





<matplotlib.legend.Legend at 0x12532d080>



In [88]:

    
df["Date"]









    Out[88]:





0       1986-03-13
1       1986-03-14
2       1986-03-17
3       1986-03-18
4       1986-03-19
5       1986-03-20
6       1986-03-21
7       1986-03-24
8       1986-03-25
9       1986-03-26
10      1986-03-27
11      1986-03-31
12      1986-04-01
13      1986-04-02
14      1986-04-03
15      1986-04-04
16      1986-04-07
17      1986-04-08
18      1986-04-09
19      1986-04-10
20      1986-04-11
21      1986-04-14
22      1986-04-15
23      1986-04-16
24      1986-04-17
25      1986-04-18
26      1986-04-21
27      1986-04-22
28      1986-04-23
29      1986-04-24
           ...    
7953    2017-10-02
7954    2017-10-03
7955    2017-10-04
7956    2017-10-05
7957    2017-10-06
7958    2017-10-09
7959    2017-10-10
7960    2017-10-11
7961    2017-10-12
7962    2017-10-13
7963    2017-10-16
7964    2017-10-17
7965    2017-10-18
7966    2017-10-19
7967    2017-10-20
7968    2017-10-23
7969    2017-10-24
7970    2017-10-25
7971    2017-10-26
7972    2017-10-27
7973    2017-10-30
7974    2017-10-31
7975    2017-11-01
7976    2017-11-02
7977    2017-11-03
7978    2017-11-06
7979    2017-11-07
7980    2017-11-08
7981    2017-11-09
7982    2017-11-10
Name: Date, Length: 7983, dtype: object



In [ ]:

	Date	Open	High	Low	Close	Volume
0	1986-03-13	0.06720	0.07533	0.06720	0.07533	1371330506
1	1986-03-14	0.07533	0.07533	0.07533	0.07533	409569463
2	1986-03-17	0.07533	0.07533	0.07533	0.07533	176995245
3	1986-03-18	0.07533	0.07533	0.07533	0.07533	90067008
4	1986-03-19	0.07533	0.07533	0.07533	0.07533	63655515

Stockmarket analysis with pmdarima

Import the data

Split the data

Pre-modeling analysis

Estimating the differencing term

Stockmarket analysis with `pmdarima`