In [1]:
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[1]:
So far, we have only considered the same month values of the SWN and LDA scores as exogenous controls in ARIMA models trying to predict the CCI values. However, the latter might be best explained by controlling for lagged values of the explanatory variables. Before including these explicitly in our machine learning models, we test out the value of setting up Vector Autoregression (VAR).
VAR models are a popular method in time-series econometrics, which estimate all variables as endogenous responses of the lagged values of all the variables. Hence, this system of equations approach is useful in estimating impulse-responses between different variables, capturing potential general equilibrium and dynamic relationships between the variables. Moreover, these models can be set up as to make roll-forward predictions, avoiding potential overfitting. This roll-forward specification for making predictions is called the Dynamic VAR, and we capture the R2 of its predictions as our measurement of the model's predictive quality.
In this Notebook, we first set up a Dynamic VAR model of the CCI score and the SWN score and evaluate its predictive power. Then, we set up a final Dynamic VAR model with the CCI, SWN and LDA Topics and evaluate its predictive quality.
Given that this model works with lagged values as predictors, we see no value in working only with the data of the first 15 days for prediction purposes. Hence, we work with the SWN data for the full month and not just the first two weeks of data.
In [4]:
# Load Datasets
dateparse = lambda dates: pd.datetime.strptime(dates, '%Y-%m')
cci = pd.read_csv('Economic_Sentiment_Forecast/CCI.csv', parse_dates=True, index_col='TIME',date_parser=dateparse)
cci = cci["Value"]
cci.columns = ["CCI"]
cci = cci['1990-01-01':]
dateparse = lambda dates: pd.datetime.strptime(dates, '%Y-%m-%d')
df2 = pd.read_csv('./AC209_Project_data/daily_sentiment_1990_2016.csv', parse_dates=True, index_col='date',date_parser=dateparse)
df2 = df2[["avg_score"]]
df2.columns = ["value"]
sen = df2[df2.index.day < 32].resample('MS').mean()
sen.columns = ["SEN"]
df = pd.concat([cci, sen], axis=1)
df.columns = ['CCI', 'SEN']
df = df.ix['1990-01-01':'2016-08-01']
In [10]:
def test_stationarity(timeseries):
#Determing rolling statistics
rolmean = pd.rolling_mean(timeseries, window=12)
rolstd = pd.rolling_std(timeseries, window=12)
#Plot rolling statistics:
fig, ax = plt.subplots(1,1, figsize=(20, 5))
ax = plt.plot(timeseries, color='blue',label='Original')
ax = plt.plot(rolmean, color='red', label='Rolling Mean')
#ax = plt.plot(rolstd, color='black', label = 'Rolling Std')
#ax = plt.plot(timeseries-rolmean, color='green', label = 'Noise')
ax = plt.legend(loc='best')
ax = plt.title('CCI and its Rolling Mean')
plt.show(block=False)
#Perform Dickey-Fuller test:
for j in ['nc', 'c', 'ct']:
print ''
print 'Results of Dickey-Fuller Test (Reg. {}):'.format(j)
dftest = tsa.adfuller(timeseries, autolag='AIC', regression = j)
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print dfoutput
In [5]:
def is_unit_root(data, threshold="5%"):
nc = tsa.adfuller(data, regression="nc")
c = tsa.adfuller(data, regression="c")
ct = tsa.adfuller(data, regression="ct")
votes = 0
for test in [nc, c, ct]:
if(abs(test[0]) < abs(test[4][threshold])):
votes += 1
return votes >= 2, {"nc": nc, "c":c, "ct":ct}
In [13]:
print 'We cannot reject null hypothesis that CCI is Unit Root: '
test_stationarity(df['CCI'])
print ''
print ''
print 'We cannot reject null hypothesis that SWN is Unit Root: '
test_stationarity(df['SEN'])
Surprisingly, we cannot reject the null hypothesis that the SWN series for the full month follows a random walk, while we were able to confidently do so with the data for the first 15 days. For this reason, we will work with both series in one-month differences for our Dynamic VAR models.
In [14]:
dfs = pd.concat([cci.diff(), sen.diff()], axis=1)
dfs.columns = ['CCI_d', 'SEN_d']
dfs = dfs.ix['1990-01-02':'2016-08-01']
In [15]:
# Define application of VAR to relevant dataset
var = tsa.VAR(dfs)
In [16]:
# Select lag order by information criteria
var.select_order(10)
Out[16]:
From the information criteria outlined above for lag order selection, it seems that the most prevalent lag order is 5. Hence, we proceed with the VAR Model with this specification.
In [19]:
# Fit VAR model with 5 lags (simpler model of those identified)
result = var.fit(5)
In [20]:
print 'Results Summary of the VAR model: '
result.summary()
Out[20]:
In [21]:
# Forecasting
lag = result.k_ar
fc = result.forecast(df.values[-lag:], 36)
In [22]:
# Impulse Response Function
irf = result.irf(36)
irf.plot()
In [268]:
# Cummulative Impulse Response Function
irf.plot_cum_effects(orth=False)
Interestingly, a positive shock on the SWN score is associated with a negative change in the CCI score. There seems to be no statistically significant effect in the inverse direction for the earliest lags, but lags accumulate in a way that a statistically significant effect on the SWN score is noted.
In [24]:
# Granger Causality
print ''
print 'SEN Granger Causes CCI_d?'
a = result.test_causality('CCI_d', 'SEN_d')
print ''
print 'CCI_d Granger Causes SEN_d?'
a = result.test_causality('SEN_d', 'CCI_d')
The model suggests that both variables 'Granger' cause each other, meaning that the null hypotheses that shocks in one do not affect the other can be confidently rejected.
With the same lag order, we now estimate the rolling-forward models to obtain out of sample predictions.
In [26]:
# Dynamic Specification of the VAR Model
var = tsa.DynamicVAR(dfs, lag_order=5, window_type='expanding')
In [29]:
print 'Coefficients for each lag in each of the two models: '
var.coefs.major_xs(datetime(2015, 11, 01)).T
Out[29]:
In [30]:
preds = var.forecast(1)
var.plot_forecast(1)
In [39]:
raw_preds = preds['CCI_d']
cci_n = cci['1991-07-01':]
sum_preds = raw_preds.cumsum()
pred = pd.Series(cci_n.ix[0], index=cci_n.index)
pred = pred.add(sum_preds, fill_value=0)
from sklearn.linear_model import LinearRegression as LR
x = np.asarray(pred[-12:]).reshape(12, 1)
y = np.asarray(cci_n[-12:]).reshape(12, 1)
lr = LR().fit(x, y)
a = lr.score(x, y)
plt.figure(figsize = (15,5))
plt.plot(cci_n)
plt.plot(pred, color = 'r')
plt.title('CCI: Actual vs. Dynamic VAR Prediction \n R$^2$ on last year of data = {}'.format(a))
plt.legend(['Actual', 'Dynamic VAR Prediction'], loc='best')
plt.tight_layout()
plt.show()
In contrast to the simple ARIMA and ARIMA + SWN, this model does worse in predicting the last 12 months of CCI data.
We now include the LDA topics into the VAR setting. As was discussed in previous Notebooks, LDA topics are not only non-stationary but they behave with a strong seasonal component. For this reason, we include the 12 month differences of the LDA topics into our VAR model.
In [64]:
# Load datasets
top = np.load('topicsByMonthBigrams8.npy')
top = pd.DataFrame(data = top, index = sen.index)
top.columns = ['T1','T2','T3','T4','T5','T6','T7','T8']
top1 = np.load('topicsByMonthBigrams8.npy')
top1 = pd.DataFrame(data = top1, index = sen.index)
top1.columns = ['T1','T2','T3','T4','T5','T6','T7','T8']
top1 = top1 - top1.shift(12)
top1.dropna(inplace=True)
top2 = np.load('topicsByMonthBigrams8.npy')
top2 = pd.DataFrame(data = top2, index = sen.index)
top2.columns = ['T1','T2','T3','T4','T5','T6','T7','T8']
for i in ['T1','T2','T3','T4','T5','T6','T7','T8']:
temp1 = tsa.seasonal_decompose(top2[i].values, freq = 12)
top2[i] = temp1.trend + temp1.resid
top2.dropna(inplace=True)
dfs = pd.concat([cci, sen], axis = 1)
dfs = dfs - dfs.shift(1)
dfs = pd.concat([dfs, top1], axis = 1)
dfs.dropna(inplace=True)
dfs.columns = ['CCI', 'SEN', 'T1','T2','T3','T4','T5','T6','T7','T8']
In [65]:
# Define application of VAR to relevant dataset
var = tsa.VAR(dfs)
# Select lag order by information criteria
var.select_order(12)
Out[65]:
We observe 2 information criteria pointing to an optimal lag order of 2, while 2 other information criteria point to an optimal lag order of 12. We opt to select the simpler model with a lag order of 2. Given the very high number of regressors in this specification, we do not show the summary of results, impulse-response and cummulative impulse-response graphs, and other preliminary results to the dynamic VAR specification.
In [72]:
# Dynamic Specification of the VAR Model
var = tsa.DynamicVAR(dfs, lag_order=2, window_type='expanding')
preds = var.forecast(1)
In [73]:
raw_preds = preds['CCI']
cci_n = cci['1993-01-01':]
sum_preds = raw_preds.cumsum()
pred = pd.Series(cci_n.ix[0], index=cci_n.index)
pred = pred.add(sum_preds, fill_value=0)
In [74]:
from sklearn.linear_model import LinearRegression as LR
x = np.asarray(pred[-12:]).reshape(12, 1)
y = np.asarray(cci_n[-12:]).reshape(12, 1)
lr = LR().fit(x, y)
a = lr.score(x, y)
plt.figure(figsize = (15,5))
plt.plot(cci_n)
plt.plot(pred, color = 'r')
plt.title('CCI: Actual vs. Dynamic VAR Prediction \n R$^2$ for the last year of data = {}'.format(a))
plt.legend(['Actual', 'Dynamic VAR Prediction'], loc='best')
plt.tight_layout()
plt.show()
The roll-forward R2 score for the last 12 months of data suggests that the resulting predictions are only marginally better than average CCI values.
These results suggest that we should not proceed with VAR style specifications, as roll-forward predictions for the last year of data coming from other simpler models outperform both VAR specifications. This however, does not rule out lagged values of the regressors a possible controls that can help improve predictions on current CCI values. In order to control for these possible effects, we proceed now into specifying new machine-learning models that use lags of SWN and LDA topics, along with their current values, as regressors for current values of CCI as unique response.