The data was collected with the Tropical Atmosphere Ocean (TAO) array which was developed by the international Tropical Ocean Global Atmosphere (TOGA) program. The TAO array consists of nearly 70 moored buoys spanning the equatorial Pacific, measuring oceanographic and surface meteorological variables critical for improved detection, understanding and prediction of seasonal-to-interannual climate variations originating in the tropics, most notably those related to the El Nino/Southern Oscillation (ENSO) cycles.
In this project, we will perform some models for specific meteorological variables in order to make predictions in the future. Then we will make a comparison between these models.
In [1]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
%matplotlib inline
#%matplotlib qt
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = (15, 7)
import warnings
warnings.filterwarnings('ignore')
import sys
import os.path
import pandas as pd
In [2]:
folder = os.path.join('data')
def get_data(directory):
filename_csv = os.path.join(directory, 'elnino.csv')
data = pd.read_csv(filename_csv, index_col=0)
return pd.concat([data], axis=1)
data = get_data(folder)
attributes = data.columns.tolist()
print(attributes)
In the beginning, we started by exploring our data, by looking at the different features and their interactions.
In [3]:
#Renaiming the columns:
data.columns = ['Yr', 'Mon', 'Day', 'Date', 'Lat', 'Long', 'ZW', 'MW', 'Humid', 'AT', 'SST']
attributes = data.columns.tolist()
print(attributes)
In [4]:
data.head()
Out[4]:
In [5]:
data.tail()
Out[5]:
In [6]:
data.dtypes
Out[6]:
Plotting of all the features, in order to see their variability:
In [7]:
# We convert data types:
data = data.convert_objects(convert_numeric=True)
from utils import plotFeatures
plotFeatures(data)
We remark some redundant behavior on our data that we can see it clearly with the air temperature and the sea surface temperature features. We can deduce that measures were taken for different years and that the x-axis of observations represents redendant periods of years.
We also noticed that we haven't the same number of measures because the amount of data recorded was evolving along the years. This is shown clearly with Base Map animation.
In [8]:
from utils import baseMapResult
longitude, latitude, years = baseMapResult(data)
In [9]:
import matplotlib
from mpl_toolkits.basemap import Basemap
import matplotlib.animation as animation
#from utils import _blit_draw
# Modify the function to plot the years dynamicaly
def _blit_draw(self, artists, bg_cache):
# Handles blitted drawing, which renders only the artists given instead
# of the entire figure.
updated_ax = []
for a in artists:
# If we haven't cached the background for this axes object, do
# so now. This might not always be reliable, but it's an attempt
# to automate the process.
if a.axes not in bg_cache:
# bg_cache[a.axes] = a.figure.canvas.copy_from_bbox(a.axes.bbox)
# change here
bg_cache[a.axes] = a.figure.canvas.copy_from_bbox(a.axes.figure.bbox)
a.axes.draw_artist(a)
updated_ax.append(a.axes)
# After rendering all the needed artists, blit each axes individually.
for ax in set(updated_ax):
# and here
# ax.figure.canvas.blit(ax.bbox)
ax.figure.canvas.blit(ax.figure.bbox)
# MONKEY PATCH!!
matplotlib.animation.Animation._blit_draw = _blit_draw
# Animation-------------------------------------------------------
# draw map with markers for float locations
%matplotlib qt
fig=plt.figure()
ax = plt.axes()
ttl = ax.text(0, 1, '', transform = ax.transAxes,fontsize=20)
m = Basemap(projection='hammer',lon_0=180)
m.drawmapboundary(fill_color='aqua')
m.fillcontinents(color='#cc9966',lake_color='#99ffff')
x,y = m(0, 0)
point = m.plot(x, y, 'ro', markersize=5)[0]
def getLong(i):
long = []
for ind,lo in enumerate(longitude):
if years[ind] == 80+(i%19):
long.append(lo)
return long
def getLat(i):
lat = []
for ind,lo in enumerate(latitude):
if years[ind] == 80+(i%19):
lat.append(lo)
return lat
def init():
ttl.set_text('')
point.set_data([], [])
return point, ttl
# animation function. This is called sequentially
def animate(i):
lons = getLong(i)
lats = getLat(i)
x, y = m(lons, lats)
point.set_data(x, y)
if i == 19:
ttl.set_text("Year : " + str(1980))
else :
ttl.set_text("Year : " + str(1980+i))
return point, ttl
# call the animator. blit=True means only re-draw the parts that have changed.
anim = animation.FuncAnimation(plt.gcf(), animate, init_func=init,
frames=20, interval=500, blit=True)
plt.show()
From the Base Map plots, we can also see that the series of buoys in the equatorial Pacific are not always in the same positions.
Values of latitudes and longitudes in the data showed that the bouys moved around to different locations. The latitude values stayed within a degree from the approximate location. Yet the longitude values were sometimes as far as five degrees off of the approximate location.
So in order to rearrange them, we collect groups of buyos in different zones (as shown in the second Base Map figure). These latters were deduced dynamically from the existing latitudes and longitudes by rounding these coordinates. Doing so, we will have a considerable amount of data that will be better regrouped, which will help us in the construction of the predictive model.
In [10]:
%matplotlib inline
from utils import plotGeneralTendencies
plotGeneralTendencies(data)
From these graphs, we observe a clear increase of the air and sea surface temperatures. There is a stagnation in the beginning of the 90's but the increase restarts at the end of this decade.
We observe also a more or less constant evolution of humidity where it's mentioning that these measures started from 1989 (not like the other features from 1980) until 1998.
The mean evolution of zonal winds is in the range of negative values and it's are around -4 [m/s]. As explained in the description of the data, negative values for zonal winds means that they are coming from the West.
For meridional winds, we have almost positive values (which means that they are coming from the North). The speed of these winds raised once from around 1.5 [m/s] to 3 [m/s] from 1985 to 1987. Since this year, the speed of meridional winds is more or less constant.
We remark also that we have a very high variance. This can be due to the fact that the noise in the full oceanic-atmospheric-land system is significant and can make extraction of coherent signals very challenging.
Then we try to plot the correlation between each meteorological variable, in order to see if there is any dependancy between them, so that we can select those with the highest correlation to have a better fitting in the prediction model.
In [11]:
from utils import plotFeaturesCorrelation
plotFeaturesCorrelation(data)
We see that we have a linear relation between the air and the sea surface temperaures, which indicates a high relation between these two features. That is justified also with the Pearson correlation factor that is close to 1. The closer we are to this value the more dependance we have between features.
For winds, we see that we have no linear relationship between them. So we conclude that they are independant from each other. This is also proved with the Pearson correlation factor that is close to 0.
From the plots of correlations between the air temperature and the other meteorological variables (winds and humidity), we see that we don't have a linear relationship between them, which is justified also wih Pearson correlation factor that has a maximum absolute value of 0.385 (but still far from 1).
And since the air temperature is linear to the sea surface temperature, we can deduce that this latter is also independent from the other meteorological variables.
We try also to plot seperately the correlation beween humidity and winds:
In [12]:
from utils import plotCorrelationHW
plotCorrelationHW(data)
Now we concentrate on a specific zone to explore its behavior all over the years.
As we have seen before, we tried to better regroup buyos in order to have a well defined zones (see second figure of the Base Map). Then we pick one of these zones where we have the maximum number of measures (of air temperature) and we try to simulate its behavior along the years.
After picking the zone with the highest number of measurements , we try to rearrange our data in order to have an ascending recording of measures in respect of the dates. This will help us to perform a time series prediction.
In [13]:
df_zone = data.copy()
df_zone.sort_index(by=['Lat', 'Long'], ascending=[True, False], inplace=True)
# Round coordinates to create geographical areas
df_zone['Lat_rounded'] = (df_zone['Lat'].round()).astype(int)
df_zone['Long_rounded'] = (((df_zone['Long']*0.1).round())*10).astype(int)
df_zone.head()
Out[13]:
In [14]:
# Delete rows with NaN in 'AT' column
data_zone = df_zone[np.isfinite(df_zone['AT'])]
# Average of features over index
data_zone = data_zone.groupby(['Yr','Mon','Lat_rounded','Long_rounded']).agg({'AT': 'mean'})
df_zone = df_zone.groupby(['Lat_rounded','Long_rounded','Yr','Mon'], as_index=True).agg({'ZW':'mean','MW':'mean','Humid':'mean','AT':'mean','SST':'mean'})
# Count nbr of valid data of AT for each zone
nb_measures = df_zone.reset_index().groupby(['Lat_rounded','Long_rounded'])['AT'].nunique()
nb_measures_df = pd.DataFrame(nb_measures)
# Extract index of the zone with maximal nbr of data
index_zone = nb_measures_df.idxmax()
coord = index_zone.get_value('AT')
print('Selected zone coordinates: ', coord)
# Take AT values of previous selected zone
data_ts = data_zone.iloc[(data_zone.index.get_level_values('Lat_rounded') == coord[0]) & (data_zone.index.get_level_values('Long_rounded') == coord[1])]
We try to perform an histogram that gives us an intuition about the behavior of the meteorological variable of the selected zone for a specific month (for example the month of July) along the years:
In [15]:
# Select features value for the specific zone:
df_zone_hist = df_zone.iloc[(df_zone.index.get_level_values('Lat_rounded') == coord[0]) & (df_zone.index.get_level_values('Long_rounded') == coord[1])]
# Extracting values for March:
df_lat_long_july = df_zone_hist.iloc[df_zone_hist.index.get_level_values('Mon') == 7]
# Plotting histograms:
rcParams['figure.figsize'] = 15, 15
df_lat_long_july.columns = ['Meridional Winds', 'Humidity', 'Air Temp', 'Sea Surface Temp', 'Zonal Winds']
df_lat_long_july.unstack(level=1).plot(kind='bar', subplots=True)
Out[15]:
Here we perform the evolution of each feature for the month of July, at a latitude = 0 and a longitude = -110, evolving along the years (81 for 1981 until 97 for 1997).
We can see more clearly from these histograms that for some features there is many data that are missing (especially for the humidity).
The prediction of the air temperature on a specific zone will be studied in this section, by performing prediction using an ARMA model.
The zone with the largest number of measurments for the air temperature over the years will be considered. The prediction will be done with a time-step of one month.
In [16]:
# Multi index [['Year','Day']] to column of df
data_ts.reset_index(inplace=True)
# Add column Day for datetime series creation
data_ts['Day'] = data['Day']
data_ts['Day'] = data_ts['Day'].replace(data_ts['Day'],value=1)
# Change year values to required format
data_ts['Yr'] = data_ts['Yr'].add(1900)
data_ts[0:4]
Out[16]:
In [17]:
# Convert int values into string
data_ts['Yr'] = data_ts['Yr'].astype(str)
data_ts['Mon'] = data_ts['Mon'].astype(str)
data_ts['Day'] = data_ts['Day'].astype(str)
# Convert date into datetime values
date_df = pd.to_datetime(pd.DataFrame({'year': data_ts['Yr'],'month': data_ts['Mon'],'day': data_ts['Day']}))
# Set the Date as the index of the df
data_ts.index = date_df
data_ts.index.name = 'Date'
In [18]:
rcParams['figure.figsize'] = (15, 7)
ts = data_ts['AT']
plt.plot(ts)
plt.grid()
In [19]:
from utils import test_stationarity
test_stationarity(ts)
The air temperature seems to be a stationary signal. In fact, no seasonality can be observed and the rolling mean does not have any trend and is almost constant over the interval of time.
The results from the Dickey-Fuller test underline this observation. In fact as the test statistic is small than the 1% critical value, it can be said that the signal is at 99% stationary.
The values of the air temperature can be used directly to create an ARMA model.
In [20]:
from utils import plotACF_PACF
plotACF_PACF(ts)
From the autocorrelation and the partial autocorrelation functions the parameters (p,d) of the ARMA model can be deduced.
When an AR (Auto Regression), MA (Moving Average) and ARMA model were tested with these values, the best results were obtained with p = 5 and d = 5 for an ARMA model. The best result was selected from the RSS (Residual Sum of Squares) value.
In [21]:
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(ts, order=(5,0,5))
results_ARIMA = model.fit(disp = -1)
plt.plot(ts)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_ARIMA.fittedvalues-ts)**2))
Out[21]:
In [22]:
from sklearn.metrics import mean_squared_error
ts = data_ts['AT']
size = int(len(ts) * 0.66)
train, test = ts[0:size], ts[size:len(ts)]
history = [x for x in train]
predictions = list()
for t in range(len(test)):
model = ARIMA(history, order=(5,0,5))
model_fit = model.fit(disp=0)
output = model_fit.forecast()
yhat = output[0][0]
predictions.append(yhat)
obs = test[t]
history.append(obs)
print('predicted=%f, expected=%f' % (yhat, obs))
error = mean_squared_error(test, predictions)
print('Test MSE: %.3f' % error)
A train and test set were extracted from the entire data. The values of the train set are used to estimate one step ahead the value of the air temperature. The estimated value is compared to the one in the test set.
The first value of the test set is then appended to the end of the train set and a new predicition is performed.
The obtained MSE for this zone is of 1.567°C.
In [23]:
from utils import plotResARMA
plotResARMA(test,predictions)
It can be noticed from the plot that when consecutive data are present the prediction is correct and when several months are missing as between the middle of 1995 and the middle of 1996 the deviation of the prediction and correct value is large.
For a better accuracy of the predictions with time series using the ARMA model, more values with a specific time-step are required.
In the following we try to predict the air temperature as a function of the position and the time.
We sort the data with respect to time and we clean by removing the columns we are not interested in such as Humidity, Date, Winds and Sea Surface Temperature.
In [24]:
sortedDataElNino = data.copy()
sortedDataElNino = sortedDataElNino.sort(['Date'],ascending=[True])
ReducedData = sortedDataElNino.drop({'Humid','Date','SST','MW','ZW'},axis = 1)
ReducedData = ReducedData[np.isfinite(ReducedData['AT'])]
AirTemp = ReducedData['AT']
X = ReducedData.drop({'AT'},axis = 1)
X.head()
Out[24]:
In [25]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X,AirTemp)
d = {'Features' : X.columns,'Coefficients' : lm.coef_}
coeffs = pd.DataFrame(d)
coeffs
Out[25]:
We use the 66% of the first data as a train set (past values), and the rest as a test set (future values), in order to make sure to not train on the future.
In [26]:
percTrain = 0.66;
IndexTrainTest = round(percTrain*len(X.index))
X_train = X[0:IndexTrainTest]
Y_train = AirTemp[0:IndexTrainTest]
X_test = X[IndexTrainTest+1:len(X.index)-1]
Y_test = AirTemp[IndexTrainTest+1:len(X.index)-1]
In [27]:
from utils import RegressLin
from utils import RegressLasso
from utils import RegressRidge
from utils import RegressElasticNet
from utils import RegressPoly
timeLin, trainPredLin, testPredLin = RegressLin(X_train,X_test,Y_train)
timeLas, trainPredLas, testPredLas = RegressLasso(X_train,X_test,Y_train, 1)
timeRidge, trainPredRidge, testPredRidge = RegressRidge(X_train,X_test,Y_train)
timeElastic, trainPredElastic, testPredElastic = RegressElasticNet(X_train,X_test,Y_train)
timePoly5, trainPredPoly5, testPredPoly5 = RegressPoly(X_train,X_test,Y_train,5)
timePoly4, trainPredPoly4, testPredPoly4 = RegressPoly(X_train,X_test,Y_train,4)
timePoly3, trainPredPoly3, testPredPoly3 = RegressPoly(X_train,X_test,Y_train,3)
timePoly2, trainPredPoly2, testPredPoly2 = RegressPoly(X_train,X_test,Y_train,2)
In order to select the best model, we consider two metrics,
The MSE (Mean Squared Error) that gives a measure of the quality of an estimator:
$\operatorname{MSE}=\frac{1}{n}\sum_{i=1}^n(\hat{Y}_i - Y_i)^2$
Where:
$\hat{Y}_i$: represents the predictions
$Y_i$: represents the real data
The $R^2$ score defined as follow:
$R^2 \equiv 1 - {SS_{\rm res}\over SS_{\rm tot}}$
Where:
$SS_\text{tot}=\sum_i (y_i-\bar{y})^2$
$SS_\text{res}=\sum_i (y_i - f_i)^2=\sum_i e_i^2$
$y_i$ a collection of data, $f_i$ the predicted associated values
$\bar{y}$: mean of observed data: $\bar{y}=\frac{1}{n}\sum_{i=1}^n y_i $
$R^2$ represents the goodness of the fit of a model with the real data. When $R^2=1$ means that a regression line perfectly matches the data.
Let us compare the performance of the regression models :
In [28]:
from utils import metricDisplay
metricDisplay(Y_train,Y_test,trainPredLin,testPredLin,trainPredLas,testPredLas,trainPredRidge,testPredRidge,trainPredElastic,testPredElastic,trainPredPoly5,testPredPoly5,trainPredPoly4,testPredPoly4,trainPredPoly3,testPredPoly3,trainPredPoly2,testPredPoly2)
The four first regressions give more or less similar results. The best one according to the MSE on the test set is the polynomial regression of degree 2. In the same time, this regression has the highest R2 score (0.58 which is the closest to 1) comparing to the first four models.
We see that increasing the degree of the regression will induce an overfitting of the data.
We can also have a look at their fitting time.
In [29]:
# Display fitting time
x = [1,2,3,4,5,6,7,8]
y = [timeLin,timeLas,timeElastic,timeRidge,timePoly2,timePoly3,timePoly4,timePoly5]
plt.stem(x,y,'-.', markerfmt='bo',linewidth = 9)
plt.ylabel('Fitting time [s]',fontsize=14)
my_xticks = ['Linear','LASSO','ElasticNet','Ridge','Ployn. 2','Ployn. 3','Ployn. 4','Ployn. 5']
plt.xticks(x, my_xticks,fontsize=14)
plt.grid(True)
plt.yscale('log')
plt.show()
Now we can visually assess the quality of the polynomial regression of order 2 by computing the residuals, i.e. the difference between the predicted and the true values and also, in the second graph, the relationship between the measured and predicted values. Idealy, it should follow the black, dashed line.
In [30]:
# View of the regression
f, axarr = plt.subplots(2, 1, sharex = True)
#Residuals
axarr[0].scatter(trainPredPoly2,trainPredPoly2-np.array(Y_train).astype(np.float), c='r')
axarr[0].scatter(testPredPoly2,testPredPoly2-np.array(Y_test).astype(np.float), c='y')
axarr[0].hlines(y=0, xmin = 24.5, xmax = 28)
axarr[0].set_title('Residuals (Polynomial regression, degree 2), train(red) and test(yellow)',fontsize = 14)
# Other view
axarr[1].scatter(testPredPoly2,np.array(Y_test).astype(np.float),c='y')
axarr[1].plot([testPredPoly2.min(), testPredPoly2.max()], [testPredPoly2.min(), testPredPoly2.max()], 'k--', lw=4)
axarr[1].set_xlabel('Predicted',fontsize=14)
axarr[1].set_ylabel('Measured',fontsize=14)
plt.show()
In classical machine learning problem, we can use cross validation to find a suitable hyperparameter.
As we have a time serie, we have to do something else because we cannot predict the past knowing future values.
Train on fold 1 –> Test on fold 2
Train on fold 1+2 –> Test on fold 3
Train on fold 1+2+3 –> Test on fold 4
Train on fold 1+2+3+4 –> Test on fold 5
Train on fold 1+2+3+4+5 –> Test on fold 6
Train on fold 1+2+3+4+5+6 –> Test on fold 7
Train on fold 1+2+3+4+5+6+7 –> Test on fold 8
Train on fold 1+2+3+4+5+6+7+8 –> Test on fold 9
Train on fold 1+2+3+4+5+6+7+8+9 –> Test on fold 10
...
$\Rightarrow$ and then we compute the mean value of the metrics we obtain after each test.
The minimization problem of the Lasso is given by :
\begin{aligned} \min_{w} (\frac{1}{2 n_{samples}} ||y - Xw||^2_2 + \alpha ||w||_1) \end{aligned}with y the real data, x the train data and $\alpha$ an hyperparameter that we want to estimate.
In [31]:
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
# "cross-validation" for time series
nbFolds = 20;
length1Fold = int(np.floor(float(X_train.shape[0]) / nbFolds))
MSETestLas = []
R2TestLas = []
meanMSELas = []
meanR2Las = []
alphas = np.logspace(-7, 1, num=10)
for index,a in enumerate(alphas):
for i in range(2, nbFolds + 1):
X = X_train[:(length1Fold*i)]
y = Y_train[:(length1Fold*i)]
split = float(i-1)/i
index = int(np.floor(X.shape[0] * split))
# folds used to train the model
X_trainFolds = X[:index]
y_trainFolds = y[:index]
# fold used to test the model
X_testFold = X[(index + 1):]
y_testFold = y[(index + 1):]
lassoreg = Lasso(alpha=a,normalize=True, max_iter=1e5)
lassoreg.fit(X_trainFolds,y_trainFolds)
trainPredLas = lassoreg.predict(X_trainFolds)
testPredLas = lassoreg.predict(X_testFold)
MSETestLas.append(mean_squared_error(testPredLas, np.array(y_testFold).astype(np.float)))
R2TestLas.append(r2_score(np.array(y_testFold).astype(np.float),testPredLas))
meanMSELas.append(np.mean(MSETestLas))
meanR2Las.append(np.mean(R2TestLas))
The best value of $\alpha$ is where the mean squared error is the smallest or where the $R^2$ score is the closest to one.
In [32]:
idx_min = meanMSELas.index(min(meanMSELas))
f, axarr = plt.subplots(2, 1, sharex=True)
axarr[0].stem(alphas,meanMSELas)
axarr[0].set_ylim([2.95,3.25])
axarr[0].set_xscale('log')
axarr[0].grid(True)
axarr[0].set_ylabel('MSE',fontsize=14)
axarr[0].annotate('Minimum : alpha = '+str(alphas[idx_min]), xy=(alphas[idx_min], meanMSELas[idx_min]), xytext=(1e-4, 3.15),
arrowprops=dict(facecolor='black', shrink=0.5),
)
axarr[1].set_xlabel('alpha',fontsize=14)
axarr[1].stem(alphas,meanR2Las)
axarr[1].set_xscale('log')
axarr[1].grid(True)
axarr[1].set_ylabel('R2 Score',fontsize=14)
plt.show()
After optaining the optimal parameter (alpha) for Lasso model, we can know perform its metrics and compared to the previous Lasso model and the polynomial regression model of order 2:
In [33]:
opt_alpha = alphas[idx_min]
timeLas, trainPredLas, testPredLas = RegressLasso(X_train,X_test,Y_train, opt_alpha)
MSETrainLas = mean_squared_error(trainPredLas, np.array(Y_train).astype(np.float))
MSETestLas = mean_squared_error(testPredLas, np.array(Y_test).astype(np.float))
R2TrainLas = r2_score(np.array(Y_train).astype(np.float),trainPredLas)
R2TestLas = r2_score(np.array(Y_test).astype(np.float),testPredLas)
print('Mean squared error, train set : ',MSETrainLas,', test set : ', MSETestLas)
print('R2 score, train set : ',R2TrainLas,', test set : ',R2TrainLas)
With $\alpha = 1$, we had:
Mean squared error, train set : 3.47892681224 , test set : 3.00262798816
R2 score, train set : 0.0 , test set : 0.0
The new $\alpha$ give better results than the default one.
We see that the MSE of the test set for Lasso regression is smaller than the first one, but still worse than the polynomial regression of order 2.
So we can conclude that, with a split of (2/3) of the data for the train set and the remaining (1/3) for the test set, we have the polynomial regression of order 2 that performs better than the cited regression models performed above.
We choose this specific split of train and test sets because we want to compare the ARMA model that has this splitting proportionality. This will be performed below:
We start by organizing the data as it was done for the pre-processing of the ARMA model:
In [34]:
sortedDataElNino_2 = data_ts.copy()
AirTemp = sortedDataElNino_2['AT']
X = sortedDataElNino_2.drop({'AT'},axis = 1)
X.head()
Out[34]:
In [35]:
# We make the same split as done for the ARMA part:
percTrain = 0.66;
IndexTrainTest = round(percTrain*len(X.index))
X_train = X[0:IndexTrainTest]
Y_train = AirTemp[0:IndexTrainTest]
X_test = X[IndexTrainTest+1:len(X.index)-1]
Y_test = AirTemp[IndexTrainTest+1:len(X.index)-1]
In [36]:
# We perform the polynomial regression model of order 2:
timePoly2, trainPredPoly2, testPredPoly2 = RegressPoly(X_train,X_test,Y_train,2)
In [37]:
# We display its metrics:
MSETrainPoly2 = mean_squared_error(trainPredPoly2, np.array(Y_train).astype(np.float))
MSETestPoly2 = mean_squared_error(testPredPoly2, np.array(Y_test).astype(np.float))
R2TrainPoly2 = r2_score(np.array(Y_train).astype(np.float),trainPredPoly2)
R2TestPoly2 = r2_score(np.array(Y_test).astype(np.float),testPredPoly2)
print('Mean squared error, train set : ',MSETrainPoly2,', test set : ', MSETestPoly2)
print('R2 score, train set : ',R2TrainPoly2,', test set : ',R2TrainPoly2)
We see that the ARMA model has a lower MSE for the test set (1.567 < 6.285).
So we can finally conclude that the best regression model between all the performed ones for our data is the ARMA model.