Prediction (out of sample)



In [1]:

    
%matplotlib inline



In [2]:

    
import numpy as np
import statsmodels.api as sm

Artificial data



In [3]:

    
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation



In [4]:

    
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.988
Model:                            OLS   Adj. R-squared:                  0.988
Method:                 Least Squares   F-statistic:                     1308.
Date:                Fri, 13 Mar 2020   Prob (F-statistic):           1.63e-44
Time:                        14:01:21   Log-Likelihood:                 9.8561
No. Observations:                  50   AIC:                            -11.71
Df Residuals:                      46   BIC:                            -4.064
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9487      0.071     70.093      0.000       4.807       5.091
x1             0.4996      0.011     45.884      0.000       0.478       0.522
x2             0.5446      0.043     12.723      0.000       0.458       0.631
x3            -0.0202      0.001    -21.095      0.000      -0.022      -0.018
==============================================================================
Omnibus:                        4.223   Durbin-Watson:                   1.747
Prob(Omnibus):                  0.121   Jarque-Bera (JB):                3.132
Skew:                          -0.478   Prob(JB):                        0.209
Kurtosis:                       3.767   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction



In [5]:

    
ypred = olsres.predict(X)
print(ypred)









    



[ 4.44449549  4.9435382   5.40034568  5.78523744  6.07924451  6.27722602
  6.3887138   6.43634636  6.45214939  6.47227409  6.53105761  6.65538154
  6.86025535  7.14635038  7.49988953  7.89491082  8.29753285  8.67152141
  8.9842427   9.21202492  9.34404751  9.38411894  9.35005096  9.27073222
  9.18138191  9.11776337  9.11030827  9.17911677  9.3306551   9.55669324
  9.83565699 10.13617267 10.4222216  10.65905798 10.81891744 10.88557872
 10.85702871 10.74579325 10.57687913 10.38366559 10.20242016 10.06634003
 10.00009786 10.01578809 10.11093991 10.26892315 10.46168062 10.6543378
 10.81093264 10.900324  ]

Create a new sample of explanatory variables Xnew, predict and plot



In [6]:

    
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)









    



[10.88825915 10.73319506 10.4564891  10.10681195  9.74823138  9.44452619
  9.24357095  9.16561462  9.19832293  9.2997982 ]

Plot comparison



In [7]:

    
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier



In [8]:

    
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2



In [9]:

    
res.params









    Out[9]:





Intercept           4.948678
x1                  0.499605
np.sin(x1)          0.544605
I((x1 - 5) ** 2)   -0.020167
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically



In [10]:

    
res.predict(exog=dict(x1=x1n))









    Out[10]:





0    10.888259
1    10.733195
2    10.456489
3    10.106812
4     9.748231
5     9.444526
6     9.243571
7     9.165615
8     9.198323
9     9.299798
dtype: float64